Data Stage Architecture
Data Stage Architecture
DataStage is an ETL Tool and it is client-server technology and integrated toolset used
for designing, running, monitoring and administrating the data acquisition application is
known as job.
A job is graphical representation of dataflow from source to target and it is designed
with source definitions and target definition and transformation Rules.
The data stage software consists of client and server components
Data stage
Administrator
TCP/IP
Data stage
Repository
DS Client components:
1) Data stage Designer:It is used to create the DataStage application known as job. The following activities can be
performed with designer window.
a) Create the source definition.
b) Create the target definition.
c) Develop Transformation Rules
d) Design Jobs.
2) Data Stage Administrator:This components will be used for to perform create or delete the projects. , cleaning
metadata stored in repository and install NLS.
3) Data stage Manager:it will be used for to perform the following task like..
a) Create the table definitions.
b) Metadata back-up and recovery can be performed.
c) Create the customized components.
4) Data stage Director:It is used to validate, schedule, run and monitor the Data stage jobs.
5) Webconsole: Webconsole is use for to create the DataStage users and do the
administration .This is handled by DataStage administrator.
6) Multi-client manager is use for to install multiple client like ds-7.5,ds-8.1 or ds-8.5 in the
local pc and can swap to any version when it is required. This is used by DataStage
developer/operator/administrator/all
Data Stage Repository:It is one of the server side components which is defined to store the information about
to build out Data Ware House.
Data Stage Server:This is defined to execute the job while we are creating Data stage jobs.
Parallel Jobs:
b)
c)
d)
e)
f)
For example, my source is having 4 records as soon as first record starts processing, then all
remaining records processing simultaneously.
Let us assume that the input file has only one column i.e.,
Customer_Na
me
Clark
Raj
James
Cameroon
After the 1st record(Clark) is extracted by the 1st Stage (Sequential File Stage) and moved to the
second stage for processing, the 2nd record(Raj) is immediately even before the 1st record reaches the
final stage(Peek Stage). Thereby, by the time the 1st record reaches he peek stage the 3rd
record(Michael) would have been extracted in the Sequential file stage
Data can be buffered in blocks so that each process is not slowed when other components are running.
This approach avoids deadlocks and speeds performance by allowing both upstream and downstream
processes to run concurrently.
Without data pipelining, the following issues arise:
Data must be written to disk between processes, degrading performance and increasing storage
requirements and the need for disk management.
The developer must manage the I/O processing between components.
The process becomes impractical for large data volumes.
The application will be slower, as disk use, management, and design complexities increase.
Each process must complete before downstream processes can begin, which limits performance
and full use of hardware resources.
Partition parallism:- in this parallism, the same job would effectively be run simultaneously by several
processors. Each processors handles separate subset of total records.
Data partitioning is an approach to parallelism that involves breaking the record set into partitions, or
subsets of records. Data partitioning generally provides linear increases in application performance.
Figure shows data that is partitioned by customer surname before it flows into the Transformer stage.
Partition Parallelism divides the incoming stream of data into subsets that will be processed separately
by a separate node/processor. These subsets are called partitions and each partition is processed by
the same operation.
Let us understand this in layman terms:
As we know DataStage can be implemented in an SMP or MPP architecture. This provides us with
additional processors for performing operations.
To leverage this processing capability, Partition Parallelism was introduced in the Information Server
(DataStage).
Let us assume that in our current set up there are 4 processors available for use by DataStage. The
details of these processors are to be defined in the DataStage Configuration File
For example, my source is having 100 records and 4 partitions. The data will be equally partition across
4 partitions that mean the partitions will get 25 records. Whenever the first partition starts, the
remaining three partitions start processing simultaneously and parallel.
Sample Configuration File
{
node "node1"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
node "node2"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
node "node3"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
node "node4"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
}
Using the configuration file, DataStage can identify the 4 available processors and can utilize them to
perform operations simultaneously.
For the same example,
Customer_Na
me
James
Sonata
Yash
Carey
if we have to add _Female to the end of the string for names that starts S and _Male for names
that dont. We will use the Transformer Stage to perform the operation. By selecting the 4 node
configuration file, we will be able to perform the required operation on the names four times faster
than before.
The required operation is replicated on each processor i.e Node and each Customer name will
processed on a separate node simultaneously, thereby greatly increasing the performance of DataStage
Jobs.
A scalable architecture should support many types of data partitioning, including the following types:
IBM Information Server automatically partitions data based on the type of partition that the stage
requires. Typical packaged tools lack this capability and require developers to manually create data
partitions, which results in costly and time-consuming rewriting of applications or the data partitions
whenever the administrator wants to use more hardware capacity.
In a well-designed, scalable architecture, the developer does not need to be concerned about the
number of partitions that will run the ability to increase the number of partitions, or repartitioning
data.
Dynamic repartitioning:
In the examples shown in figure1 and figure 2data is partitioned based on customer surname, and then
the data partitioning is maintained throughout the flow.
This type of partitioning is impractical for many uses, such as a transformation that requires data
partitioned on surname but must then be loaded into the data warehouse by using the customer
account number.
Dynamic data repartitioning is a more efficient and accurate approach. With dynamic data
repartitioning, data is repartitioned while it moves between processes without writing the data to disk,
based on the downstream process that data partitioning feeds. The IBM Information Server parallel
engine manages the communication between processes for dynamic repartitioning.
Data is also pipelined to downstream processes when it is available