E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture
E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture
E2 E3 Infosphere Datastage - Introduction To The Parallel Architecture
Parallel Architecture
Agenda
• Parallel processing architecture
• Partition parallelism
• Pipeline parallelism
• Parallel development environment
• Parallel framework
• Parallel job execution model
• Learning DataStage at the GUI job design level is not enough. In order to develop the
ability to design sound, scalable jobs, it is necessary to understand the underlying
architecture.
• Development environment
How to take advantage of the parallel framework
How to debug and change the GUI design based on parallel framework messages in
the job log
• To be able to design robust parallel jobs, we need to get behind and beyond the GUI. We
need to understand what gets generated from the GUI design and how this gets executed
by the parallel framework. We also need to be able to debug and modify our job designs
based on what we see happen at runtime.
Parallel engine
Saturday, August 07, 2021 6
Key Parallel Concepts
• Parallel processing:
Executing the job on multiple CPUs
• Scalable processing:
Add more resources (CPUs and disks) to increase system performance
• The parallel engine uses the processing node concept. “Standalone processes” rather
than “thread technology” is used. Processed-based architecture is platform-independent,
and allows greater scalability across resources within the processing pool.
● 2 – 1000’s of CPUs
● Complex to manage
• Lots of small jobs
• This processing may exist outside of a database (using flat files for intermediate results) or
within a database (using SQL, stored procedures, and temporary tables).
• Advantages:
Reduces disk usage for staging areas
Keeps processors busy
• Partitioning breaks a dataset into smaller sets. This is a key to scalability. However, the
data needs to be evenly distributed across the partitions; otherwise, the benefits of
partitioning are reduced.
• It is important to note that what is done to each partition of data is the same. How the data
is processed or transformed is the same.
• By combining both pipelining and partitioning DataStage creates jobs with higher volume
throughput. The configuration file drives the parallelism by specifying the number of
partitions.
at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
• Scale up by adding
processors or nodes
with no application
change or recompile
• Configuration file
specifies hardware
configuration and
resources
• The parallel configuration file specifies hardware resources used to run a Particular Job.
data-partition
• Explicit parallelism
• Implicit pipeline "parallelism"
• Implicit data-partition parallelism
• There are really only two types of parallelism (pipeline and partition). Explicit represents
an implied way of doing things because you can restrict the degree of parallelism by
explicitly specifying certain nodes. This is done by explicitly specifying that we want to
“constrain” a stage to a node or node pool (i.e., named set of nodes).
• However, constraining pieces of a job to a specific pool of nodes reduces flexibility. Named
node pools give more flexibility than constraining to specific nodes, but any constraints
reduce flexibility.
• Defines number of nodes (logical processing units) with their resources (need not match
physical CPUs)
Dataset, Scratch, Buffer disk (file systems)
Optional resources (Database, SAS, etc.)
Advanced resource optimizations
• “Pools” (named subsets of nodes)
• DataStage jobs can point to different configuration files by using job parameters. Thus, a
job can utilize different hardware architectures without being recompiled.
• It can pay to have a 4-node configuration file running on a 2 processor box, for example, if
the job is “resource bound.” We can spread disk I/O among more controllers.
• Following the keyword “node” is the name of the node (logical processing unit).
• The order of resources is significant. The first disk is used before the second, and so on.
• Keywords, such as “sort” and “bigdata”, when used, restrict the signified processes to the
use of the resources that are identified. For example, “sort” restricts sorting to node pools
and scratchdisk resources labeled “sort”.
• Database resources (not shown here) can also be created that restrict database access to
certain nodes.
• Question: Can objects be constrained to specific CPUs? No, a request is made to the
operating system and the operating system chooses the CPU.
• On clustered / GRID / MPP configurations, named pools can be used to further specify
resources across physical servers
Through careful job design, can minimize data shipping
Specifies server(s) with database connectivity
Row Generator
Columns generating
integers
• Here the first column cycles through 0-3, the second 0-2, and the third 0-1.
Return Mapped
values Saturday, August 07, 2021 34
Config File Displayed in Job Log
First Partition
Message displaying
config file
Second