01 Ab Initio Basics
01 Ab Initio Basics
• Ab Initio Architecture
• Overview of Graph
• Ab Initio functions
• Basic components
• Partitioning and De-partitioning
• Case Studies
May 18,
Course
May 18,
Ab Initio Architecture
July 6, 2010
Introductio
• Data processing tool from Ab Initio software
corporation (http://www.abinitio.com)
User Applications
Development Environments
May 18,
Client Server
Host Machine 1 GDE
Ability to graphically design
Unix Shell Script or NT Batch File batch programs comprising Ab
Supplies parameter values to Initio components, connected by
underlying programs through pipes
Co>Operating
arguments andSystem
environment variables Ability to test run the
AbControls
Initio Built-in
the flow of data through pipes
Host Machine
graphical design and monitor
Component Programs its2progress
(Partitions,
Usually generated using the GDE User
Transforms
etc)
Programs
Co-Operating User
System Programs
Operating
System Operating System
May 18,
Ab Initio – Process
FTP
TELNET
REXEC
RSH
DCOM
Co-operating System
May 18,
CO>Operating
• Layered on the top of the operating system
• Unites a network of computing resources into a data-
processing system with scalable performance
• Co>Operating system runs on …
– Sun Solaris 2.6, 7, and 8 (SPARC)
– IBM AIX 4.2, and 4.3
– Hewlett-Packard HP-UX 10.20, 11.00, and 11.11
– Siemens Pyramid Reliant UNIX Release 5.43
– IBM DYNIX/ptx 4.4.6, 4.4.8, 4.5.1, and 4.5.2
– Silicon Graphics IRIX 6.5
– Red Hat Linux 6.2 and 7.0 (x86)
– Windows NT 4.0 (x86) with SP 4, 5 or 6
– Windows NT 2000 (x86) with no service pack or SP1
– Digital UNIX V4.0D (Rev. 878) and 4.0E (Rev. 1091)
– Compaq Tru64 UNIX Versions 4.0F (Rev 1229) and 5.1 (Rev 732)
– IBM OS/390 Version 2.8, 2.9, and 2.10
– NCR MP-RAS 3.02
May 18,
Graphical Development Environment
The GDE …
release
Note: During deployment, GDE sets AB_COMPATIBILITY to the Co>Operating System version number. So, a
May 18,
Overview of Graph
July 6, 2010
The Graph Model
A Graph
• Logical modular unit of an application.
A Component
A Component Organizer
May 18,
The Graph Model: Naming the pieces
A Sample Graph …
Datasets
Dataset
Components
L1
L1* L
L1 1
o
Good
Score * Customers
desele
Customers
L1
Other
Customers
Flows
May 18,
The Graph Model: A closer look
A Sample Graph …
Expression Metadata
Ports
Layout
May 18,
Parts of typical
•Datasets – A table or a file which holds input or output data.
•Meta Data – Data about data.
May 18,
Types of
teradata, netezza, DB2, MS SQL, Red Brick, Sybase etc
May 18,
Structural Components of a
• Start Script
– Local to the Graph
• Setup Command
• Graph
• End Script
May 18,
Runtime Environment
• The graph execution can be done from the GDE itself
or from the back-end as well
at the back-end
May 18,
A sample graph
May 18,
Layou
1.Layout determines the location of a resource.
May 18,
Layou
file on Host X
files on
Host X
May 18,
Layout Determines What Runs
May 18,
Layout Determines What Runs
May 18,
Layout Determines What Runs
Serial
Parallel
May 18,
Controlling
Propagate (default)
Construct layout
manually
Run on
these hosts
Database components
can use the same
layout as a database
May 18,
Phase of a
Phases are used to break up a graph into blocks for performance tuning.
Breaking an application into phases limits the contention for :
- Main memory
- Processors
Breaking an application into phases costs: Disk Space
The temporary files created by phasing are deleted at the end of the
phase, regardless of whether the run was successful.
Phase 0 Phase 1
May 18,
Checkpoint &
A checkpoint is a point at which the Co>Operating
System saves all the information it would need to restore a
job to its state at that point. In case of failure, you can
recover completed phases of a job up to the last completed
checkpoint.
May 18,
The Phase
A Toggle between:
Phase (P), and Checkpoint After Phase (C)
Select Phase
Number
View
Phase Set
May 18,
The Phase
P
h
a
s
e
May 18,
Anatomy of a Running
May 18,
Anatomy of a Running
Host
GDE
Host
Host
GGDDEE Agen
Agen
t
t
Host
Agen Agen
GDE t
t
• Component Execution
– Component processes do their jobs.
– Component processes communicate directly
with datasets and each other to move data
around.
Host
Host
Agen Agen
GDE t
t
• Agent Termination
– When all of an Agent’s Component processes
exit, the Agent informs the Host process that
those components are finished.
– The Agent process then exits.
Host
GDE
• Host Termination
– When all Agents have exited, the Host process
informs the GDE that the job is complete.
– The Host process then exits.
Host
GDE
Host
Agen
GDE Agen t
t
Agen
Agen t
Host t
GDE
• Agent Termination
– When every Component process of an Agent have
been killed, the Agent informs the Host process
that those components are finished.
– The Agent process then exits.
Host
GDE
• Host Termination
– When all Agents have exited, the Host
process informs the GDE that the job failed.
– The Host process then exits.
Host
GDE
July 6, 2010
DML(Data Manipulation
• DML provides different set of data types including
Base,Compound as well as User-defined data types
DML)
Field Names
Data Types
record
decimal(4) id;
string(10) first_name;
DML BLOCK string(6) last_name;
end
May 18,
Data Manipulation Language or
More DML Types
0345,01-09-02,1000.00John,Smith
0212,05-07-03, 950.00Sam,Spade
Delimiters 0322,17-01-00, 890.50Elvis,Jones
0492,25-12-02,1000.00Sue,West
0221,28-02-03, 500.00William,Black
record
decimal(“,”) id;
date(“DD-MM-YY”)(“,”) join_date; decimal(7,2) salary_per_day; string(“,”)firs
end
Precision
& Scale
May 18,
Built-in Functions
Ab Initio built-in functions are DML expressions that
– can manipulate strings, dates, and numbers
Function categories
– Date functions
– Lookup functions
– Math functions
– Miscellaneous functions
– String functions
May 18,
Date
• date_day
• date_day_of_month
• date_day_of_week
• date_day_of_year
• date_month
• date_month_end
• date_to_int
• date_year
• datetime_add
• datetime_day
• datetime_day_of_month
• datetime_day_of_week
• datetime_day_of_year
• datetime_difference
• datetime_hour
• datetime_minute
• datetime_second
• datetime_microsecond
• datetime_month
• datetime_year
May 18,
Inquiry and Error
• fail_if_error
• force_error
• is_error
• is_null
• length_of
• write_to_log
• first_defined
• is_defined
• is_failure
• is_valid
• size_of
• write_to_log_file
May 18,
Lookup Functions
• lookup
• lookup_count
• lookup_local
• lookup_count_local
• lookup_match
• lookup_next
• lookup_next_local
May 18,
Math Functions
• Ceiling
• decimal_round
• decimal_round_down
• decimal_round_up
• Floor
• decimal_truncate
• math_abs
• math_acos
• math_asin
• math_atan
• math_cos
• math_cosh
• math_exp
• math_finite
• math_log
• math_log10
• math_tan
• math_pow
• math_sin
• math_sinh
• math_sqrt
• math_tanh
May 18,
Miscellaneous Functions
• allocate
• ddl_name_to_dml_name
• ddl_to_dml
• hash_value
• next_in_sequence
• number_of_partitions
• printf
• Random
• raw_data_concat
• raw_data_substring
• scanf_float
• scanf_int
• scanf_string
• sleep_for_microseconds
• this_partition
• translate_bytes
• unpack_nibbles
May 18,
String Functions
• char_string
• decimal_lpad
• decimal_lrepad
• decimal_strip
• is_blank
• is_bzero
• re_index
• re_replace
• string_char
• string_compare
• string_concat
• string_downcase
• string_filter
• string_lpad
• string_length
• string_upcase
• string_trim
• string_substring
• re_replace_first
• string_replace_first
• string_pad
• string_ltrim
• string_lrtrim May 18, 2010
Lookup File
• Represents one or more Serial or Multifile
• The file you want to use as a Lookup must fit into main memory
• This allows a transform function to retrieve records much more quickly than
it could retrieve them if they were stored on disk
• Lookup File associates key values with corresponding data values to index
records and retrieve them
• Lookup parameters:
– Key: Name of the key fields against which Lookup File matches its
arguments
– Record Format: The record format you want Lookup File to use
when returning data records
• We use Lookup functions to call Lookup Files where the first argument to
these lookup functions is the “name of the Lookup File”. The remaining
arguments are values to be matched against the fields named by the key
parameter. lookup(”file-name”, key-expression)
• The Lookup functions returns a record that matches the key values and has
the format given by the Record Format parameter.
May 18,
Using Lookup File instead of Join
May 18,
Lookup File
• Storage Methods
– Serial lookup : lookup()
• whole file replicated to each partition
– Parallel lookup : lookup_local()
• file partitions held separately
• Lookup Functions
Name Arguments Purpose
lookup_next() File Label Returns successive data records from a Lookup File.
NOTE: Data needs to be partitioned on same key before using loMoakyu18p, 2 l0o1 0c a l
functions
Transform Functions : XFRs
• Transform functions direct the behavior of transform
component
• It is named,parameterized sequence of local variable
definition, statements & rules that computes the expression
from input values & variables,and assigns results to the output
object.
• Syntax
output-var[,output-var....]::xform-name(input-var[,input-var...])=
begin
local-variable-declaration-list
Variable-list
Rule-list
end;
May 18, 2010
A transform function definition consists of:
1. A list of output variables followed by a double colon(::)
2. A name for the transform function
3. A list of input variables
4. An optional list of local variable definition
5. An optional list of local statements
6. A series of rules
The list of local variable definitions, if any, must precede the list of statements.
The list of statements, if any, must appear before the list of rules
Example:
1. temp::trans1(in) =
begin
temp.sum :: 0;..............Local variable declaration with field sum
end;
2. out.temp::trans2(temp, in) =
begin
temp.sum :: temp.sum + in. amount;
out. city :: in. city;
out.sum :: temp.sum;
end; May 18, 2010
Basic Components
July 6, 2010
Basic
• Filter by Expression
• Reformat
• Redefine Format
• Replicate
• Join
• Sort
• Rollup
• Aggregate
• Dedup Sorted
May 18,
Reforma
1. Reads record from in port
2. Changes the record format by dropping fields, or by using DML
expressions to add fields, combine fields, or transform the data in the
records.
3. Records written to out ports, if the function returns a success status
4. Records written to reject ports with descriptive message to error
port, if the function returns NULL
May 18,
Ports of Reformat Component
IN
– Records enter into the component from this port
OUT
– Success records are written to this port
Diagnostic Ports :
REJECT
– Input records that caused error are sent to this port
ERROR
– Associated error message is written to this port
LOG
– Logging records are sent to this port
May 18,
Reforma
Parameters of Reformat Component
May 18,
Reforma
Typical Limit and Ramp settings . .
– Limit = 0 Ramp = 0.0 Abort on any error
– Limit = 50 Ramp = 0.0 Abort after 50 errors
– Limit = 1 Ramp = 0.01 Abort if more than 2 in 100 records causes error
– Limit = 1 Ramp = 1 Never Abort
• Logging: specifies whether or not you want the component to generate log records
for certain events. The values of logging parameter is True or False.
The default value is False.
– log_input: indicates how often you want the component to send an input record
to its log port.
For example: If you select 100,then the component sends every 100th input record to its log
port
– log_output: indicates how often you want the component to send an output record
to its log port.
For example: If you select 100,then the component sends every 100th output record to its
log port
– log_reject:indicates how often you want the component to send an reject
record to its log port.
For example: If you select 100,then the component sends every 100th reject record to its log
port
May 18,
Example of
The following is the data of the Input file :
May 18,
Example of
In this example Reformat has the two transform functions, each of which writes
output to an out port
Reformat uses the following transform function to write output to out port out0:
May 18,
Example of
Reformat uses the following transform function to write output to out port out1:
May 18,
Example of
The graph produces Output File 0 with the following output :
May 18,
Filter by Expression
1. Reads record from the in port
2. Applies the expression in the select_expr parameter to each record. If the
expression returns
– Non-0 Value :it writes the record to the out port
– 0 :it writes the record to deselect port & if you do not connect
deselect port, discards the record.
– NULL :it writes the record to the reject port and a descriptive
error message to the error port.
3. Filter by Expression stops the execution of graph when the number of reject
events exceeds the tolerance value.
Input port
expr
true?
Yes No
May 18,
Ports of Filter by Expression
IN
– Records enter into the component through this port
DESELECT
Typical LimitTolerance
and Rampvalue=limit
settings . .+ ramp*total number of records read
– Limit = 0 Ramp = 0.0 Abort on any error
– Limit = 50 Ramp = 0.0 Abort after 50 errors
– Limit = 1 Ramp = 0.01 Abort if more than 2 in 100 records causes error
– Limit = 1 Ramp = 1 Never Abort
May 18,
Filter by
• Logging: specifies whether or not you want the component to generate log records for
certain events. The values of logging parameter is True or False.
The default value is False.
– log_input: indicates how often you want the component to send an input record
to its log port.
For example: If you select 100,then the component sends every 100th input record to its log
port
– log_output: indicates how often you want the component to send an output record
to its log port.
For example: If you select 100,then the component sends every 100th output record to its
log port
– log_reject:indicates how often you want the component to send an reject
record to its log port.
For example: If you select 100,then the component sends every 100th reject record to its log
port
May 18,
Example of Filter by
The following is the data of the Input file :
May 18,
Example of Filter by
Let Filter by Expression uses the following filter expression.
May 18,
Redefine
1. Redefine format copies data records from its input to
its output without changing the values in the
data records.
Parameters: None
May 18,
Example of Redefine
Suppose the input record format is:
record
String(10) first_name;
String(10) last_name;
String(30) address;
Decimal(6)
postal_code; Decimal(8.2)
salary;
end
You can reduce the number of fields by specifying the
output record format as :
record
String(56)
personal_info; Decimal(8.2)
May 18,
Example of Redefine
salary;
end
May 18,
Replicat
a single flow
May 18,
Example of
Suppose you want to aggregate the flow of records and also
send them to the another computer, you can accomplish this by
using Replicate component.
May 18,
Aggregat
Reads record from the in port
If you have defined the select parameter, it applies the expression in the
select parameter to each record. If the expression returns
– Non-0 Value :it processes the record
– 0 :it does not process that record
– NULL : writes a descriptive error message to the error port &
stops the execution of the graph.
If you do not supply an expression for the select parameter,
Aggregate processes all the records on the in port.
Uses the transform function to aggregate information about groups of
records.
Writes output data records to out port that contain aggregated information
May 18,
Ports of Aggregate
IN
– Records are read from this port
OUT
– aggregated records are written to this port
Diagnostic Ports :
REJECT
– Input records that caused error are written to this
port
ERROR
– Associated error message is written to this port
LOG
– Logging records are written to this port
May 18,
Aggregat
Parameters of Aggregate component :
Sorted-input :
– Input must be sorted or grouped: Aggregate requires
grouped input, and max-core parameter is not available
– In memory: Input need not be sorted :Aggregate requires
ungrouped input, and requires the use of max-core
parameter.
Default is Input must be sorted or grouped.
Max-core : maximum memory usage in bytes
Key: name of the key field Aggregate uses to group the data records
Transform : either name of the file containing the transform function, or
the transform string.
Select: filter for data records before aggregation
Reject-Threshold : The components tolerance for reject event
– Abort on first reject: The component stops the execution of graph at the
first reject event it generates.
– Never Abort: The component does not stops execution of the graph,
no matter how many reject events it generates
– Use Limit/Ramp: The component uses the settings in the ramp & limit
parameters to determine how many reject events to allow before it
stops the execution of graph.
May 18,
Aggregat
Limit: contains an integer that represents a number of reject events
Ramp: contains a real number that represents a rate of reject events in the number
of records processed.
Logging: specifies whether or not you want the component to generate log records for
certain events. The values of logging parameter is True or False.
The default value is False.
– log_input: indicates how often you want the component to send an
input record to its log port.
For example: If you select 100,then the component sends every 100th input record to its
log port
– log_output: indicates how often you want the component to send an
output record
to its log port.
For example: If you select 100,then the component sends every 100th output record to its
log port
– log_reject:indicates how often you want the component to send an
reject record
to its log port.
For example: If you select 100,then the component sends every 100th reject record to its
log port
– log_intermediate: indicates how often you want the component to
send an intermediate record to its log port
May 18,
Example of
The following is the data of the Input File :
May 18,
Example of
The following is the record format of the Input file:
The Aggregate uses the following key specifier to sort the data.
Key
Aggregate uses the following transform function to write output.
May 18,
Example of
The following is the record format of the out port of Aggregate
After the processing the graph produces the following Output File :
May 18,
Sor
Sort component sorts and merges the data records.
The sort component :
– Reads the records from all the flows connected to the in port until it
reaches the number of bytes specified in the max-core parameter
– Sorts the records and writes the results to a temporary file on disk
– Repeat this procedure until it has read all the records
– Merges all the temporary files, maintaining the sort order
– Writes the result to the out port
Ports:
May 18,
Sor
Parameters of Sort component :
May 18,
Joi
1. Reads records from multiple input ports
2. Operates on records with matching keys using a multi-input transform function
3. Writes result to the output ports
Parameters of Join:
1. Key: Name of the fields in the input record that must have matching values
for Join to call transform function
May 18,
Joi
Sorted-input:
– Input must be sorted: Join accepts unsorted input, and permits the use of
maintain-order parameter
– In memory: Input need not be sorted : Join requires sorted input, and
maintain-order parameter is not available.
Default is Input must be sorted
Logging: specifies whether or not you want the component to generate log records
for certain events. The values of logging parameter is True or False.
The default value is False.
– log_input: indicates how often you want the component to send an
input record to its log port.
For example: If you select 100,then the component sends every 100th input record to its log
port
– log_output: indicates how often you want the component to send an
output record
to its log port.
For example: If you select 100,then the component sends every 100th output record to its log
port
– log_reject:indicates how often you want the component to send an reject
record
to its log port.
For example: If you select 100,then the component sends every 100th reject record to its log
port
– log_intermediate: indicates how often you want the component to send
an intermediate record to its log port
May 18,
Joi
Max-core : maximum memory usage in bytes
Transform : either name of the file containing the transform function, or the
transform string.
Selectn: filter for data records before aggregation. One per inn port.
Reject-Threshold : The components tolerance for reject event
– Abort on first reject: The component stops the execution of graph at the
first reject event it generates.
– Never Abort: The component does not stops execution of the graph, no
matter how many reject events it generates
– Use Limit/Ramp: The component uses the settings in the ramp & limit
parameters to determine how many reject events to allow before it stops
the execution of graph.
Limit: contains an integer that represents a number of reject events
Ramp: contains a real number that represents a rate of reject events in the number
of records processed.
Driving: number of the port to which you connect the driving input. The driving input
is the
largest input. All the other inputs are read into memory.
The driving parameter is only available when the sorted-input parameter is set to
In memory: Input need not be sorted. Specify the port number as the value of the driving
parameter. The Join reads all other inputs into memory
Default is 0
Max-memory: maximum memory usage in bytes before Join writes temporary files to
disk. Only available when the sorted-input parameter is set to Inputs must be
May 18,
Joi
sorted.
May 18,
Joi
Maintain-order: set to True to ensure that records remain in the original order of the
driving input. Only available when the sorted-input parameter is set to In
memory:Input need not be sorted.
Default is False.
Override-keyn: alternative names for the key fields for a particular inn port.
Default value is 0.0
Dedupn: set the dedupn parameter to True to remove duplicates from the
corresponding inn port before joining.
Default is False, which does not remove duplicates.
join-type: choose from the following
– Inner join: sets the record-requiredn parameter for all ports to True. Inner
join is the default.
– Outer join: sets the record-requiredn parameters for all ports to False.
– Explicit: allows you to set the record-requiredn parameter for each
port individually.
record-requiredn:This parameter is available only when the join-type parameter is set to
Explicit. There is one record-requiredn parameter per inn port.
When there are 2 inputs, set record-requiredn to True for the input port for
which you want to call the transform for every record regardless of whether
there is a matching record on the other input port.
When there are more than 2 inputs, set record-requiredn to True when you
want to call the transform only when there are records with matching keys on
all input ports for which record-requiredn is True.
May 18,
Example of
The following is the data of the Input File 0 :
May 18,
Example of
The following is the data of the Input File 1:
May 18,
Example of
The sort component uses the following key to sort the data .
Custid
Join uses the following transform function to write output.
Join uses the default value, Inner join, for the join-type parameter.
May 18,
Example of
May 18,
Rollu
Rollup performs a general aggregation of data i.e. it reduces the group of records to
a single output record
Parameters of Rollup Component:
Sorted-input:
– Input must be sorted or grouped: Rollup accepts
grouped input and max-core parameter is not available.
– In memory: Input need not be sorted : Rollup requires
ungrouped input, and requires use of the max-core parameter.
Default is Input must be sorted or grouped.
Key-method: the method by which the component groups the records.
– Use key-specifier: the component uses the key specifier.
– Use key_change function: the component uses the key_change
transform function.
Key: names of the key fields Rollup can use to group or to define groups of
data records.
If the value of the key-method parameter is Use key-specifier ,you must specify the value
for the key parameter.
Max-core : maximum memory usage in bytes
Transform : either name of the file containing the type and transform function, or the
transform string.
check-sort: indicates whether or not to abort execution on the first input record that
is out of sorted order. The Default is True.
This parameter is available only when key-method parameter is Use key-specifier
Limit: contains an integer that represents a number of reject events
May 18,
Rollu
Ramp: contains a real number that represents a rate of reject events in the number
of records processed.
Logging: specifies whether or not you want the component to generate log records
for certain events. The values of logging parameter is True or False.
The default value is False.
– log_input: indicates how often you want the component to send an input record
to its log port.
For example: If you select 100,then the component sends every 100th input record to its log port
– log_output: indicates how often you want the component to send an output
record to its log port.
For example: If you select 100,then the component sends every 100th output record to its log port
– log_reject:indicates how often you want the component to send an reject
record to its log port.
For example: If you select 100,then the component sends every 100th reject record to its log port
– log_intermediate: indicates how often you want the component to send
an intermediate record to its log port
Reject-Threshold : The components tolerance for reject event
– Abort on first reject: The component stops the execution of graph at the
first reject event it generates.
– Never Abort: The component does not stops execution of the graph, no matter
how many reject events it generates
– Use Limit/Ramp: The component uses the settings in the ramp & limit
parameters to determine how many reject events to allow before it stops the
May 18,
Rollu execution of graph
May 18,
Rollu
in:
Do for first record
tempInitialize:
in each group ...
:
Do for every
record
Rollup: ...
in each group
May 18,
Dedup Sorted
Separates one specified record in each group of
records from the rest of the records in that group
Requires grouped input.
Reads grouped flow of records from the in port.
If your records are not already grouped, use Sort Component to group them
It applies the expression in the select parameter to each record. If
the expression returns
– Non-0 Value :it processes the record
– 0 : it does not process that record
– NULL : writes the record to the reject port & a descriptive error
message to the error port.
If you do not supply an expression for the select parameter, Dedup
Sorted processes all the records on the in port.
Dedup sorted considers any consecutive records with the same key value
to be in the same group.
– If a group consists of one record, Dedup sorted writes that record
to
the out port.
– If a group consists of more than one record, Dedup sorted uses
the value of keep parameter to determine:
• Which record to write to the out port.
• Which record or records to write to dup port May 18, 2010
Ports of Dedup Sorted Component
IN
– Records enter into the component from this port
OUT
– Output records are written to this port
DUP
– Duplicate records are written to this port
Diagnostic Ports :
REJECT
– Input records that caused error are written to
this port
ERROR
– Associated error message is written to this port
LOG
– Logging records are written to this port
May 18,
Dedup
Parameters of Dedup Sorted Component :
Key: name of the key field, you want Dedup sorted to use when
determining group of data records.
select: filter for records before Dedup sorted separates duplicates.
keep: determines which record Dedup sorted keeps to write to the out port
– first: keeps first record of the group. This is the default.
– last: keeps the last record of the group.
– unique- only: keeps only records with unique key values.
Dedup sorted writes the remaining records of the each group to the dup port
Reject- threshold: The components tolerance for reject events
– Abort on first reject: The component stops the execution of graph at the first
reject event it generates.
– Never Abort: The component does not stops execution of the graph, no matter
how many reject events it generates
– Use Limit/Ramp: The component uses the settings in the ramp & limit parameters
to determine how many reject events to allow before it stops the execution of graph.
Limit: contains an integer that represents a number of reject events
Ramp: contains a real number that represents a rate of reject events in the number
of records processed.
Check- sort: indicates whether you want processing to abort on the first record that is out
of sorted order.
May 18,
Dedup
Logging: specifies whether or not you want the component to generate log
records for certain events. The values of logging parameter is True or False.
The default value is False.
– log_input: indicates how often you want the component to send an
input record to its log port.
For example: If you select 100,then the component sends every 100th input record to its
log port
– log_output: indicates how often you want the component to send
an output record
to its log port.
For example: If you select 100,then the component sends every 100th output record to its
log port
– log_reject:indicates how often you want the component to send an
reject record
to its log port.
For example: If you select 100,then the component sends every 100th reject record to its
log port
May 18,
Partitioning and De-partitioning
July 6, 2010
Multifile
• A global view of a set of ordinary files called partitions
usually located on different disks or systems
• Ab Initio provides shell level utilities called “m_
etc.)
mfile://pluto.us.com/usr/ed/mfs1/new.dat
May 18,
A
A directory spanning across partitions on different hosts
mfile://host1/u/jo/mfs/mydir
//host1/u1/jo/mfs
//host3/vol7/pC/mydir
//host1/vol4/pA/mydir //host2/vol3/pB/mydir
<.mdir>
May 18,
A
A file spanning across partitions on different hosts
mfile://host1/u/jo/mfs/mydir/myfile.dat
//host1/u1/jo/mfs/mydir
/myfile.dat //host3/vol7/pC/mydir
//host1/vol4/pA/mydir //host2/vol3/pB/mydir /myfile.dat
/myfile.dat /myfile.dat
May 18,
A Sample multifile system
Host Node Agent Nodes
A multifile
Control file
Partitions (Serial Files)
Multidirectories dat,
s95
May 18,
Parallelis
Parallel Runtime Environment
Where some or all of the components of an application – datasets
Forms of Parallelism
– Component Parallelism
– Pipeline Parallelism
Inherent in Ab Initio
– Data Parallelism
May 18,
Component
When different instances of same component run on separate data sets
Sorting Customers
Sorting
May 18,
Pipeline
When multiple components run on same data set
Processing Record 99
May 18,
Data
When data is divided into segments or partitions and processes
run simultaneously on each partition
Expanded View
Global View
Multifile
May 18,
Data parallelism
• Data parallelism scales with data and requires data partitioning
• Data can be partitioned using different partitioning methods.
• The actual way of working in a parallel runtime environment is
transparent to the application developer.
• It can be decided at runtime whether to work in serial or in parallel, as
well as to determine the degree of parallelism
May 18,
Data Partitioning Components
Data can be partitioned using
• Partition by Round-robin
• Partition by Key
• Partition by Expression
• Partition by Range
• Partition by Percentage
• Broadcast
May 18,
Partition by Round-
• Writes records to each partition evenly
• Block-size records go into one partition before moving on
to the next.
Record
Partition 1
1
Record1
Record
Record2
Record3
Record4 Record
Partition 2
Record5 2
Record6 Record
Record
Partition 3
3
Record
May 18,
Partition by
• Distributes data records to its output flow partitions according
to key values
100 % 3 1
100 57 100 122
91 91 % 3 1 213 91
57
25 57 % 3 0 25
122
213 25 % 3 1
122 % 3 2
• Data may not be evenly distributed across partitions
213 % 3 0
May 18,
Partition by Expression
• Distributes data records to partitions according to DML
expression values
DML Expression
Expression Value Partition0 Partition1 Partition2
99 / 40 2
99 25 57 99
91 91 / 40 2 73 91
22
57
25 57 / 40 1
22
73 25 / 40 even distribution
• Does not guarantee 0 across partitions
• Cascaded Filter by Expressions can be avoided
22 / 40 0
73 / 40 1
May 18,
Broadcas
• Combines all data records it receives into a single flow
• Writes copy of that flow into each output data partition
Partition0 Partition1 Partition2
A A A A
B B B B
C C C
C D D D
E E E
D F F F
G G G
E
May 18,
Partition by
• Distributes data records to its output flow partitions according to
the ranges of key values specified for each partition.
• Typically used in conjunction with Find Splitter component
for better load balancing
May 18,
Partition by
76 10
10 76
10 73 17 84
17 9 45
9 2 29 98
45 73
2
84
98
29 Num_Partitions = 3
73
May 18,
Summary of Partitioning
Method Key-Based Balancing Uses
Record-independent
Round-robin No Even
parallelism
Depends on the key Key-dependent
Partition by Key Yes
value parallelism
Partition by Depends on data and
Yes Application specific
Expression expression
Record-independent
Broadcast No Even
parallelism
Key-dependent
Partition by Range Yes Depends on splitters parallelism, Global
Ordering
May 18,
Departitioning
• Gather
• Concatenate
• Merge
• Interleave
May 18,
Departitioning
• Gather
May 18,
Concatenat
May 18,
Interleav
May 18,
Departitioning
• Summary of Departitioning Methods
Creating ordered
Merge Yes Sorted
serial flows
Unordered
Gather No Arbitrary
departitioning
May 18,
Case Studies
July 6, 2010
Case Study 1
In a shop, the customer file, contains the following fields:
Cust_id amount
215657 1000
462310 1500
462310 2000
215657 2500
462310 5500
215657 4500
May 18,
Develop the AbInitio Graph, which will do the following:
It takes the first three records of each Cust_id and sum the amounts,
the output file is as follows –
May 18,
Case Study 2
Consider the following BP_PRODUCT file , containing the
following fields :
Field Name Data Type Length/Delimiter Format/Mask
product_id Decimal “|”(pipe) None
product_code String “|”(pipe) None
plan_details_id Decimal “|”(pipe) None
plan_id Decimal “|”(pipe) None
Here are some sample data for the BP_PRODUCT file :
plan_details_i
product_id product_code d plan_id
May 18,
• Sample data of the file
May 18,
The output file will contain the following
May 18,
Queries???
July 6, 2010
Thank You!!!