Abinitio Introduction
Abinitio Introduction
===================
1.Input file
2.output file
3.Reformat
4.join
5.Normalize
6.Rollup
7.Scan
8.Filter by Expression
9.Dedup Sorted
10.Sort
11.Gather
12.Marge
13.Interleave
14 PBE (Partition by Expression
15.PBK-Partition by Key
16.Partition by Round-robin
17.Replicate
18.Lookup Utilities
====================
1.Input file : Purpose
INPUT FILE represents records read as input to a graph from one or more serial
files or from a multifile.
You can also use INPUT FILE to read files directly from a Hadoop Distributed File
System.
For more information about using the Co>Operating System withHadoop, see the Hadoop
Integration Guide.
You can also use INPUT FILE to read files directly from Amazon S3 and Google Cloud
Storage. For more information, see “URLs for cloud storage”.
INPUT FILE is not a phased component. The phase number it displays in the GDE
refers to the phase number of the component that reads from it. (If the INPUT FILE
is read in more than one phase, the phases are shown as a range of phase numbers.)
For more information, see “Phases and checkpoints” in the Graph Developer’s Guide.
===================
2.output file : Purpose
OUTPUT FILE represents records written as output from a graph into one or more
serial files or a multifile.
You can use OUTPUT FILE to write records to a Hadoop Distributed File System
(HDFS). In this case, the file is always written in serial. For moreinformation
about using the Co>Operating System with Hadoop, see the Hadoop Integration Guide.
You can also use OUTPUT FILE to write files directly to Amazon S3 and Google Cloud
Storage. For more information, see “URLs for cloud storage”.
OUTPUT FILE is not a phased component. The phase number itdisplays in the GDE
refers to the phase number of the component that writes to it. For moreinformation,
see “Phases and checkpoints” in the Graph Developer’s Guide.
==================
3.Reformat : Purpose
REFORMAT performs an implicit reformat when you do not define a reformat function.
For more information, see “Implicit reformat” in the Graph Developer’s Guide.
(you are not be able to use aggr funct in reformat)
==================
4.join : Purpose
JOIN reads data from two or more input ports, combines records with matching keys
according to the transform you specify,
and sends the transformed records to the output port. Additional ports allow you
to collect rejected and unused records.
#fullouter :
join type: explicit
record-required0:false
record-required1:false
or
join type: full outer join
#inner join
or
==============
5.Rollup : Purpose
ROLLUP processes groups of input records that have the same key, generating one
output record for each group. Typically, the output record summarizes or aggregates
the data in some way; for example, a simple ROLLUP might calculate a sum or average
of one or more input fields. ROLLUP can select certain information from each group;
for example, it might output the largest value in a field, or accumulate a vector
of values that conform to specific criteria.
===============
6. Normalize : Purpose(Bablesh)
NORMALIZE generates multiple output records from each of its input records. You can
directly specify the number of output records for each input record, or you can
make the number of output records dependent on a calculation.
==============
7.Scan : Purpose(balesh)
For every input record, SCAN generates an output record that consists of a running
cumulative summary for the group to which the input record belongs, upto and
including the current record. For example, the output records might include
successive year-to-date totals for groups of records.
SCAN is similar to ROLLUP. The difference between the two is that SCAN produces one
output record for each input record, while ROLLUP produces one output record for
each key group. (SCAN works like a cumulative ROLLUP: the group of records being
aggregated grows with each successive input record, until a key change occurs.)
You can use a SCAN component in two modes, depending on how you define the
transform parameter:
##Template mode — You define a simple scan function that typically includes
aggregation functions. For more information, see “Using SCAN in template mode”.
##Expanded mode — You create a transform using an expanded scan package. This mode
allows for scans that do not necessarily use regular aggregation functions. For
more information, see “Using SCAN in expanded mode”.
====================
8.Filter by Expression : Purpose(Where clause of sql)
======================
9.Dedup Sort : Purpose(Distinct clause)
DEDUP SORTED separates one specified record in each group of records from the rest
of the records in the group.
====================
10.Sort : Purpose
SORT sorts and merges records. You can use SORT to order records before you send
them to a component that requires grouped or sorted records.
The in-port of the SORT component accepts fan-in and all-to-all flows (partitioned
data). SORT automatically performs a gather on its in port, so there’s no need to
gather databefore sorting it.
Sort stability is not guaranteed: records with identical key values may not
maintain their relative order after being sorted.
If the sort key field contains NULL values, the NULL records are listed first with
ascending sort order and last with descending sort order.
Although it lacks a reformat transform function, SORT supports implicit reformat.
For moreinformation, see “Implicit reformat” in the Graph Developer’s Guide.
=====================
11.Gather : Purpose
======================
12.Marge : Purpose
MERGE combines records from multiple flows or flow partitions that have been sorted
according to the same key specifier, and maintains the sort order.
MERGE requires sorted input data, but never sorts dataitself.
=======================
13.Interleave : Purpose
==================
14.partition by Expression : Purpose
The output port for PARTITION BY EXPRESSION is ordered. (See “Ordered ports” in the
Graph Developer’s Guide.) Although you can use fan-out flows on the outport, we do
not recommend connecting multiple fan-out flows. You may connect a single fan-out
flow; or, preferably, limit yourself to straight flows on the outport.
==================
15.partition by key : Purpose
PARTITION BY KEY distributes records to its output flow partitions according to key
values.
How PARTITION BY KEY interprets key values depends on the internal representation
of the key. For example, the number 4 in a field of type integer(2) is not
considered identical to the number 4 in a field of type decimal(4).
===================
16. partition by round-robin : Purpose
===================
17. Replicate : Purpose
REPLICATE arbitrarily combines all records it receives into a single flow and
writes a copy of that flow to each of its output flows. Use REPLICATE to support
component parallelism — such as when you want to perform more than one operation on
a flow of records coming from an active component.
The input port of REPLICATE has an implicit gather. If this port receives more than
one input stream, REPLICATE does not preservesort order.
REPLICATE is not required when you want to read data from an INPUT FILE component
on multiple flows. Rather, you can connect multiple flows directly to the INPUT
FILE. Thedownstream components each read the data independently, which sometimes
results in improvedgraph performance.
=====================
18. Lookup : From a lookup file, returns the first record that matches a specified
expression.
NOTE:
Errors are generated if calls to lookup and lookup_local are made to the same file
from a single component. To avoid this error, use either the lookup or lookup_local
function, but not both, to call a single file from a single component.
When you are looking up data statically, this function returns the first record
that matches the values of expr from the lookup file lookup_file. When you are
looking up data dynamically, itreturns the first record that matches the values of
expr fromthe lookup file referenced by lookup_id or the lookup_template_type
handle. If no matching recordexists, the function returns NULL.
============================
19. Buffered copy : Purpose
BUFFERED COPY regulates the flow of records in a graph by buffering data when the
downstream components experience a slowdown. This allows upstream components to
continue working while the downstream component is blocked.
1.Abinitio_Introduction:
Ab Initio means 'starts from the beginning'. Ab Initio tool is a fourth generation
data analysis, data manipulation,batch-processing and graphical user
interface(GUI)-based parallel processing product used to Extract, Transform and
Load (ETL)data.Ab Initio Software is a widely used Business Intelligence Data
Processing Platform.The Ab Initio Component Library is a reusable software module
for sorting, data transformation, and high-speed database loading and unloading.
2.Co-operating System :
This operates on the top of operating system and is a basement for all Ab Initio
processes. This component offers following features:
This is an abinitio environment and repository used for storing and managing
metadata.
It has the capability to store both technical and business metadata.EME metadata
can be accessed from the web browser,
Ab initio GDE and Co operating system command line.
The Enterprise Meta>Environment® (EME®) product is an enterprise data catalog and
metadata management solution.
It provides broad capabilities for all metadata created and used by the
Co>Operating System product and a large number of third-party products
databases (such as Snowflake, Teradata, DB2, and Oracle)
5.Conduct IT:
This is an environment used for creating abinitio data integration systems.
The prime focus of this environment is to create special type of graphs called
abinitio plans.
Abinitio provides both command-line and graphical interface to Conduct IT.
6.Data Profiler:
7. Control Center:
8. Metadata Hub:
The AbInitio Metadata Hub acts as the data governance component for AbInitio's data
management platform.
It can be used as either a system of record or a system of reference, is able to
govern technical, business,
and even logical assets, and provides both business and technical lineage.
The Metadata hub is used for handling the interchange and distributing of technical
Metadata between decision processing products.
It is designed for use primarily by technical staff during the growth and
maintenance of data warehouses.
9.Express>It:
10.Continuous Flow:
11.Query>It:
The Query>It® product enables large amounts of data to be stored and then queried
via SQL.
The data can be stored in a wide range of file systems, such as S3, Hadoop (HDFS),
or Ab Initio’s own highly efficient parallel and indexed file system (ICFF).
In addition to supporting SQL query access to these data stores, the Query>It
product can also reference data held in third-party databases
(such as Snowflake, Teradata, DB2, and Oracle), allowing federated, push-down query
access to heterogeneous data.
13.Sandbox:
Sandbox is a special directory which contains some specific sub directories to
store Ab initio graphs and related files.
Sandbox is the user's work Area.