0% found this document useful (0 votes)
307 views30 pages

Abinitio Preperation

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
307 views30 pages

Abinitio Preperation

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Explain about your project

Data Warehousing

Data Warehousing is a process of collecting, storing, and managing data from different sources to
provide meaningful business insights.

It is a database which stores current data and historical data and used for reporting and analysis
purposes.

Collection of DataMart's is a Datawarehouse

Datamart

Data Mart is a Subset of the Datawarehouse.

Data Modelling

Data Modelling is the diagrammatic representation showing how the entities are related to each other.
It is the initial step towards database design. We first create the conceptual model, then the logical
model and finally move to the physical model.

Different data models

Conceptual Model will be showing entity names and entity relationships.

Logical Model will be showing entity names, entity relationships, attributes, primary keys and foreign
keys in each entity.

Physical Data Model will be showing primary keys, foreign keys, table names, column names and column
data types.

What are the different design schemas in Data Modelling?

Star Schema: we have a fact table in the center that references multiple dimension tables around it. All
the dimension tables are connected to the fact table.

The primary key in all dimension tables acts as a foreign key in the fact table.

The star schema is quite simple, flexible and it is in de-normalized form.

Snowflake Schema: the level of normalization increases. The fact table here remains the same as in the
star schema. However, the dimension tables are normalized.

Different ETL Tools:

Ab Initio, Informatica, DataStage, Talend, SSIS, Pentaho and OWB.

Ab Initio

The word comes from “Latin” which means “From the Beginning

It is Designed to solve simple to complex business problem in an efficient manner from the beginning.
It is a secured ETL tool used to extract, transform and load data.

It is used for data analysis, data manipulation, batch processing, and graphical user interface based
parallel processing

It is mainly used to apply transformation rules on data as per business requirements and feed all
business units in the required format.

It is a powerful tool that can handle large volumes of data.

It is used by many organizations to integrate data from different sources.

We will be responsible for designing and developing DW and ETL solutions

Parallelism makes this tool stronger

Scalability – Even if there is an increase in data volume, there won't be much impact to the
performance.

Performance wise topmost ETL tool

Integrated Framework for solving DW/DM creation projects.

Disadvantages: License Cost, it is an Intellectual Software

Co>Operating System (Server Based Application)

It is Heart of Ab Initio.

The Co>Operating System is a high-performance parallel processing framework that is optimized for
data integration and processing.

Co>Operating System is an environment for building, integrating, running and deploying large enterprise
business data applications. It is the foundation for all Ab Initio's technologies and provides a complete
and seamless application development and execution environment.

It operates on the top of the operating system and is a base for all Ab Initio processes. It provides
additional features known as air commands which can be installed on a variety of system environments
such as Unix, HP-UX, Linux, IBM AIX, Windows systems.

Actual Execution of the Ab initio Job by creating the processes, scheduling the processes, file
management and checkpointing.

Co>operating System Features or Services

Parallel and distributed application execution

Manage and execute Ab Initio graphs and control the ETL processes

Metadata management and interaction with the EME

ETL processes Monitoring and Debugging

Graphical Development Environment (GDE)-(User Based Application)


GDE is a GUI for building applications in Ab Initio and connect to the Co-Operating System using several
protocols like Telnet, Rexec, Ssh, DCOM and FTP (for file transfer).

It is an IDE (Integrated Development Environment) that enables creation of applications by dragging and
dropping components onto a canvas and configuring them using point and click operations.

These applications are then executed by Ab Initio Co-Operating System.

It is an end user environment where we can develop the graph, Run the Graph and check the status of
Graph.

It is used to develop the ab initio jobs with the help of components and debugging.

Enterprise Meta Environment (EME) (Server Side)

EME is a high-performance object-oriented data storage system that version controls and manages
various kinds of information associated with Ab Initio applications, which may range from design
information to operational data

It is a central repository, which contains data about data.

EME will be using for storing the graph and managing metadata (graphs and their associated files wrt
version)

Technical Metadata-Applications related business rules, record formats and execution statistics.

Business Metadata-User defined documentations of job functions, roles and responsibilities.

EME metadata can also be accessed from the Ab Initio GDE, internet browser or Ab Initio Co>Op
command line (Air commands).

Check-in, Checkout will be part of EME.

Sandbox (User side)

Sandbox is a collection of various directories like bin, mp, dml, run etc which contains the metadata
(Graphs and their associated files).

Sandbox can be a file system copy of a data-store project.

In the graph, instead of specifying the entire path for any file location, we specify only the sandbox
parameter variable.

Standard Environment

Builds the basic infrastructure and environment for running Ab Initio applications.

Contains the SERIAL and MFS locations, enterprise level parameters and values that will be used by
private projects.

Every project includes stdenv project and inherit the parameters defined

Checkout
1st time checkout – Sandbox is created (project path in EME moved to sandbox project path)

From next time checkout, Updated project versions will be moved from EME to Sandbox.

Project Level Checkout: All the new object versions will be moved from EME to Sandbox

Object Level Checkout: Update object version will be moved from EME to Sandbox

Checkin

Once the development and testing of code is complete, we will do checkin. Code will move to EME
repository, and version will be created for code.

What happens when you push the “Run” button?

Graph is translated into a script that can be executed in the Shell Development Environment.

This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server.
The script is invoked (via SSH or REXEC or TELNET) on the server.

The script creates and runs a job that may run across many nodes.

Monitoring information is sent back to the GDE client.

Host Process Creation:

– Pushing “Run” button generates script.

– Script is transmitted to Host node.

– Script is invoked, creating Host process.

Agent Process Creation

– Host process produces Agent processes.

Component Process Creation

Agent processes create Component processes on each processing node.

Component Execution

Component processes do their jobs.

Component processes communicate directly with datasets and each other to move data around.

Successful Component Termination

As each Component process finishes with its data, it exit with success status.

Agent Termination

When all of an Agent’s Component processes exit, the Agent informs the Host process that those
components are finished. The Agent processes then exits.

Host Termination
When all Agents have exit, the Host process informs the GDE that the job is complete.

The Host process then exits.

How you connect to your Database

Source/Destination tables are stored in Database server. We need to extract / load data from database
server to Ab Initio co>op. We need to config connection between them.

Input table component -> config -> new -> edit database configuration

Contains Database configuration information

database Server Name

database Version

db_home

db_name

db_nodes

db_character_set

user

password

Close editor and save in Sandbox -> db folder -> dbc file

Command to create the dbc file:

m_db gencfg database name filename.dbc

To test any database connectivity (using dbc file)

m_db test filename.dbc

To test any database connectivity (using database component in GDE)

I/p table -> config -> test

Parameter

To specify a value controlling some detail of an object such as a graph or component.


1. Declaration 2. Definition

Parameters provide the business logic

Parameter Set

The collection of all the parameters defined on an object (e.g. graph) is called its parameter set. By
changing the parameter set's values, the object’s behavior can be controlled.

Upon running the graph which has graph parameters set in, a parameter setting window pops up
showing the input parameters and local parameters declared and defined for that particular graph.

One graph can have multiple pset’s defined over it with same parameter name or declaration but with
different definition.

Instead of hard-coding specific values directly into the application logic, We can use parameter sets to
easily switch between different configurations or environments. This makes it easier to manage and
maintain Ab Initio applications, especially when dealing with different data sources, target systems, or
processing requirements.

The source graph associated with pset can be changed to another graph if required. Upon opening the
pset, change the source graph by going to

Edit (tab) --> change source --> new source graph name

What is a Graph Parameter?

Local Parameter: we need to initialize the value at the time of declaration

Formal Parameter / Input Parameter: We don't need to initialize the data. it will prompt at the time of
running the graph for that parameter.

Graph parameter is set after sandbox parameter is set

Graph Parameter: Scope of usage is up to that graph.

Component Parameter: Scope of usage is up to that component.

Sandbox Parameter: Scope of usage multiple graphs of Sandbox.

Software Development Life Cycle:

Dev – SIT - UAT - OAT – Production

SIT – System Integration Testing

UAT – User Acceptance Testing

OAT – Operational Acceptance Testing

How to Improve Performance of graphs in Ab initio?

How would you do performance tuning for already built graph? Can you let me know some examples?
Instead of reading many small files many times, we can directly use “Read Multiple Files” one time.

Connect multiple flows directly into an Input file. After input file not required replicate to process
multiple flows.

Filter the data as early as possible. (Drop unwanted data fields/ data records early in the graph)

Avoid using Sort component multiple times. The “SORT” component breaks pipeline parallelism.

Sort is used in front of merge component it's no use of using sort! because we have sort component
built in merge.

Sort the data in parallel by using Partition by Key and Sort, rather than sorting it serially

Phases are used to break up a graph into blocks for performance tuning.

Avoid More Phases

Use phases with checkpoints when we have large volumes of data to be processed.

Wrong placement of phase breaks. putting a phase break just before a “Filter by Expression” is a bad
idea since after using this component the size of data may get reduced.

Never put a checkpoint or a phase after sort, instead use check pointed sort.

Never put a checkpoint or a phase after the Replicate component.

Use lookup local than lookup when we have large volumes of data to be processed.

Use lookup instead of Join,Merge Component.

Maintain lookups for better efficiency

Use partitioning in the graph / Avoid departition of data unnecessarily

Partition the data as early as possible and departition the data as late as possible.

Use reformat with multiple output ports instead of replicating and using multiple reformats. Use
Reformat like Replicate component

Use gather instead of concatenate

Avoid Broadcast as partitioner when we have large volumes of data to be processed.

Avoid filter by exp instead use select expression in reformat/Join/Rollup.

Use in memory option (false) for sorted input parameter within rollup/join/scan components. should
have the option Input must be sorted.

Use data parallel processing wherever possible. Make a graph parallel as early as possible and keep it
parallel as long as possible.

Handle data quality issues as soon as possible, since it will reduce the data which is unnecessary. Do not
spread the data quality rules throughout the graph until necessary.
Maxcore

It is a parameter that specifies the maximum amount of memory that can be utilized for a component.
Else disk memory will be utilized

Max Memory: Collection of data memory usage for components.

Join max_core value is 64Mb,

Rollup-64Mb,

Scan-64 mb,

Sort-100mb,

Sort within groups-10mb

Why does the Sort component divide data into 1/3 of the Max core?

Sort Component: Orders data according to key specifier.

It is an efficient way to sort large amounts of data. The sort component sorts data in memory and then
writes it to disk. By dividing the data into 1/3 of max core, it ensures that there is enough memory
available for sorting and writing.

Sort max score is divided into 3 parts.

One part is buffer,

Another part is data (to store the data),

And the 3rd part is awake (use memory for algorithm)

Sort max score value is 100mb

Sort component use before Deadup sorted, Rollup, Scan, Merge components for performance tuning.

Deadup Sorted Component

Separates one specified data record in each group of data records from the rest of records in the group.

To remove duplicates, deadup sorted component requires sort input

keep parameter first, last and unique only

Reject Threshold parameter Abort on first reject, Never Abort, Use Limit/Ramp

Normalize Component:

It is used for splitting a single record into N number of records based on length function defined in
normalize

Always clean and validate data before normalizing it. Because normalize uses a multistage transform

Filter by Expression
Filter data records according to specified DML expression

use_package

reject-threshold parameter

(Default) Abort on first reject — Write Multiple Files stops the execution of the graph at the first reject
event it generates.
Never abort — the component does not stop the execution of the graph, no matter how many reject
events it generates.
Use ramp/limit — the component uses the settings in the ramp and limit parameters to determine how
many reject events to allow before it stops the execution of the graph.

What is meant by limit and ramp in Ab-Initio?

The limit and ramp are the variables that are used to set the reject tolerance for a particular graph. This
is one of the options for reject-threshold properties. The limit and ramp values should pass if enables
this option. Graph stops the execution when the number of rejected records exceeds the following
formula. limit + (ramp * no_of_records_processed). The default value will be set to 0.0. The limit
parameter contains an integer that represents a number of reject events. The ramp parameter contains
a real number that represents the rate of reject events in the number of records processed.

Tolerance value=limit + ramp*total number of records read

Redefine format

It copies data records from its input to its output without changing the values in the data records.

Redefine is used to rename the fields in the DML.

Reformat Component

To convert data from one format to another format we use reformat component

Transforming the Data (Business Rules / Mapping) we can write multiple transformations

Change DML

Add fields / Drop Fields

Filter Condition / Split the condition

Use it like Replicate component

Decimal to String field

Parameters: Count, Transform, Select, Reject Threshold, Logging

Count 1 – Default (Transform 0)

Count 2 – Transform 1

Count 3 – Transform 2
Reject threshold – abort on first reject, never abort, Limit/Ramp

Output Index and Output Indexes

Output_Index and Output_Indexes are two functions available in Reformat component in Ab Initio.

They are useful when we use count parameter set to number greater than 1.

Output_index function, we can direct the current input record to a single transform-output port.

Out:outpit_index(in) =

begin

Out.x::if(in.region=’USA’) 0

Out.x::if(in.region=’Australia’) 1

Out.x::if(in.region=’UK’) 2

end

Output_indexes function, we can direct the current input record to more than one port.

Out:outpit_indexs(in) =

begin

Out.x::if(in.region=’USA’) [vector 0]

Out.x::if(in.region=’Australia’) [vector 1, 2]

Out.x::if(in.region=’UK’) [Vector 0, 2]

end

Rollup Component

It is used to group the records based on certain field values.

Template mode: (summarization)

Expandable mode: It is a multi-stage transform function which contains functions like

Initialize, Rollup (Computation), Finalize

Scan Component
Scan is a transform component that generates a series of cumulative summary records for groups of
data records.

Scan component requires input as group of data records

What is the difference between rollup and scan?


The main difference between Scan and Rollup is that Scan generates cumulative result and Rollup
summarizes.

Difference between Aggregate and Rollup

Both are used to summarize the data.

Rollup is much better and convenient to use

Rollup can perform some additional functionality, like input filtering and output filtering of records.
Aggregate does not display the intermediate results in main memory, whereas Rollup can.

Rollup provides more control over record selection, grouping, and aggregation.

Aggregate provides less control

How do you add default rules in transformer?


Clik 2 times on Transform component. Click on Parameter and choose option transform.

Then click edit option. There we can see the default rules option. Once click on add default rules we get
build transform using dialog box open.

1 Match Names: Generates a set of rules that copies input fields to output fields with the same name

2. Wildcard (.*) Rule: Generates one rule that copies input fields to output fields with the same name.

Join Component

To join multiple input dataset based on a common key value.

Inner Join: 'record_required' parameter is true for all ports.

Full Outer Join: 'record_required' parameter is false for all ports.

Left Join: record_required' as true for the left join component and false for right component

R ight Join: record_required' as true for the right join component and false for left component

In join the key parameter must be specified from input flow ascending or descending order

What is Semi-Join / Explicit Join?

We put 'record_required' as true for the required component and false for other components.

What is the purpose of override key parameters in the join component?

We need to join 2 fields which have different field names.

Then we can use overridekey0 to specify what is the field name in in0

and overridekey1 to specify what is the field name in in1.

suppose I have 2 files say FILE1, FILE2


FILE1 has a key "emp_id", and FILE2 has a key" emp_code " after connecting these two to join
component, it will highlight override key option, where we have to decide which key should be override
ie "override key0", and "override key1"
"emp_id " as a key for "override key 0 "
"emp_code " as a key for "override key 1 "
We are matching the key of FILE1 with key of FILE2. So, join works properly
Broadcast versus Replicate component.

Broadcast and Replicate are very similar components in that they both copy each input record to
multiple downstream components.

Broadcast (partition component) is used for Data parallelism. It is a fan-out or All-To-All flow

R eplicate is used for component parallelism. It is a straight flow

It will put all the data in the input flow to all the data at output flow.

Layout

Defines which component will run were.

Program component we will have layout tab within properties

It is a working directory for a component

Serial Layout. “. work serial” AI_SERIAL (Serial files location)

Multilayout / Parallel Layout. “. work parallel” AI_MFS (Multi files location)

Parallelism

Parallelism is the simultaneous performance of multiple operations.

Component Parallelism: An application that has multiple components running on the system
simultaneously. But the data are separate. This is achieved through component level parallel processing.

Pipeline Parallelism: An application with multiple components but running on the same dataset. This
uses pipeline parallelism.

Data Parallelism: Data is split into segments and runs the operations simultaneously. This kind of process
is achieved using data parallelism.

Explain how you can run a graph infinitely in Ab initio?

To execute graph infinitely, the graph end script should call the .ksh file of the graph. Therefore, if the
graph name is abc.mp then in the end script of the graph it should call abc.ksh. This will run the graph
for infinitely.

End script – This script runs after the graph has completed running.

How do we handle DML changing dynamically?


It can be handled in the startup script with dynamic sql creation and create dynamic dml so that there
will be no need to change the component.

Both DML and XFR have to be passed as graph level parameter during the runtime.

Start script – A script which gets executed before the graph execution starts.

How to Create Surrogate Key using Ab Initio?

System generated unique sequential number is called a surrogate key

next_in_sequence() function in transform

assign_keys component

Scan component

What is .Ab Initiorc and What it contains?

.Ab Initiorc is the configuration file of Ab Initio.

It holds connection information such as Connection method, Login id, Password, co>op location.

It contains all configuration variables such as AB_WORK_DIR, AB_DATA_DIR.

This file can be found in $AB_HOME or Config directory of Co>Op

What is an Air Command in Ab Initio?

An Air Command is a command-line interface tool used to interact with the Ab Initio server. It allows
developers to perform various tasks such as creating, deploying, and monitoring Ab Initio graphs.

How do I access the Air Commands in Ab Initio?

Ab Initio Command Prompt

Shell from Ab Initio Installation Directory

What are the air commands used in Ab Initio?

• Air object ls

• Air object rm

• Air project modify

• Air lock show -user

• Air sandbox status

What is the difference between checkpoint and phase?

Checkpoint: Restart Ability. When a graph fails in the middle of the process, a recovery point is created,
known as checkpoint. The rest of the process will be continued after the checkpoint instead of starting
from the beginning.
Phase: Phases are used to break up a graph into blocks for performance tuning. If a graph is created with
phase, each phase is assigned to some part of memory one after another. All the phases will run one by
one. The intermediate file will be deleted.

List out the file extensions used in Ab Initio?

.mp - Stores graph

.ksh - Deployed scripts (run)

.dml – record format files (Data Manipulation Language files)

.dbc - database table files

.dat – Data File

.xfr – Transform function file

. mdc – data set

.sql - sql statements

. plan - plans

.pset - pset

Runtime Status of graph in Ab Initio

Partition in Ab Initio?

Partition is the process of dividing data sets into multiple sets for further processing

Partition by Round-Robin: Distributing the data evenly in block size across the output partitions.

Partition by Range: Split the data evenly based on a key and a range among the nodes.

Partition by Percentage: Distribution data, so the output is proportional to fractions of 100

Partition by Load balance: Dynamic load balancing

Partition by Expression: Data split according to a DML expression

Partition by Key: Data grouping by a key

Broadcast:

Explain what is departition in Ab Initio?

Remove partition from; to merge back to single data set.

Combine data records from multiple flow partitions into a single flow.

Gather: Combine data records from multiple flow partitions arbitrarily


Merge: Combine data records from multiple flow partitions that have been sorted according to the key
specifier and maintain the sorted order.

Interleave: Collect blocks of data records from multiple flow partitions in round robin fashion.

Concatenation: Appends multiple flow partitions of data records one after another.

I have one xfr. I would like to identify how many graphs are using that xfr?

grep –lr xfrname * / Air object uses

Grep –i xfrname *.pset in pset folder

When Abinito does create work directories where it stores the temp files? Does it create when the sort
of component uses a particular layout for the first time, or does it have to be created separately?

Ab Initio stores temporary files in $AB_WORK_DIR directory.

The $AB_WORK_DIR is the directory where Ab Initio creates work directories for each graph.

The work directories are used to store temporary files created by Sort component.

Kill vs m_kill

Kill is for unix process and m_kill for ab initio graph.

kill needs process id, and m_kill needs rec file

What is the difference between .dbc and .cfg file?

Answer: .cfg file is for the remote connection and .dbc is for connecting the database.

.cfg contains:

The name of the remote machine

The username/pwd to be used while connecting to the db.

The location of the operating system on the remote machine.

The connection method.

.dbc file contains:

The database names

Database version

Userid/pwd

What is the difference between look-up file and look-up, with a relevant example?

A lookup is a component where we can store data and retrieve it by using a key parameter.

A lookup file is the physical file where the data for the lookup is stored
What is Local lookup?

A lookup file is a multifile and partitioned on a particular key then local lookup function can be used
ahead of lookup function call. This is local to a particular partition depending on the key.

Lookup File represents one or more serial files (Flat files) consisting of data records which can be held in
main memory. This makes the transform function to retrieve the records much faster than retrieving
from disk.

It allows the transform component to process the data records of multiple files fastly.

m_dump

m_dump command prints the data in a formatted way.

This command is used to print information about data records, their record format, and the evaluations
of expressions.

m_dump <dml> <file.dat> - start 10 –end 20

m_eval

This command is used to evaluate DML expressions and displays their derived types.

This command subtracts 10 days from the current date and represents the result in a specific date
format:

$ m_eval '(date("YYYYMMDD")) (today () - 10)' => "20041130"

m_wc

This command is used to count the number of records from one or more data files/multi files.

$ m_wc myrecfmt.dml mfs/mydata.dat

m_env

This command is used to obtain all the Ab Initio configuration variable settings in an environment.

$ m_env

m_kill

This command is used to kill a running job.

m_kill [ signal] {jobname. Rec | jobname}

signal – can have three values –TERM –KILL –QUIT

TERM Kills the job and triggers a rollback. This is the default.

KILL Kills the job immediately without triggering a rollback.

QUIT Kills the job immediately without triggering a rollback and forces the Xmp run process to dump
core
m_kill -TERM f_so_line_work_F2CMART.rec

m_rollback:

This command is used to perform a manual rollback in case of any Abinito job failure.

m_rollback [-d] [-i] [-h] [-kill] recoveryfile

m_mkfs

This command is used to create a multifile system. Multifile system consists of Control file directory
along with the partitions directory.

m_mkfs [ options] control_url partition_url [ partition_url ...]

m_db

This command is used for performing database operation from the command prompt.

m_db test dbc_file

This command can be used to check db connections from wrapper scripts

Explain the differences between Api and utility mode?

Utility mode: (Direct Loading) When we load the data into a table. all the constraints are disabled and
then data is loaded. once the data is loaded all the constraints are enabled (faster access)

Api mode: (Conventional Loading) During API Mode checks record by record and constraints will be
enables so that the access will be slow.

Have you ever encountered an error called “depth not equal”?


When two components are linked together, if their layout does not match then this problem can occur
during the compilation of the graph. A solution to this problem would be to use a partitioning
component in between if there was a change in layout.

What is a Wrapper Script?

Wrapper script is a Unix script, which is helpful in running the graphs directly through Unix and running
it automatically.

What is Generic graph

It is a graph where everything was parametrized. Benefits of this kind of graph are to build once and run
multiple times with different kinds of data.

String Functions

string_split

string_split(“mango eats daily”,” ”) --> [vector “mango” “eats” “daily”]

string_replace
string_replace(“a man can do anything for money”,”an”,”en”) --> a men cen do enything for money

string_replace(“abcdef”,””,”*”) --> *a*b*c*d*e*f*

string_prefix

string_prefix(“Ab Initio”,5) --> abini

string_suffix

string_suffix(“Ab Initio”,5) --> nitio

string_index

string_index(“to be or not to be”,”be”) --> 4

string_rindex

string_rindex(“to be or not to be”,”be”) --> 17

string_like(str,pattern) 1/0

string_like(“abcdef”,”abc%”) --> 1

string_like(“abcdef”,”abc%_) --> 0

string_like(“abcdef”,”abc_ef”) --> 1

string_substring

string_substring(“Kolli Narendra”,3,4) --> lli

string_substring(“Kolli Narendra”,7,5) --> Naren

string_length

string_length(“”) --> 0

string_length(“abc def”) --> 7

Validation Functions

is_defined: It is used to check if a field is defined or not, before applying any string functions to avoid
errors.

is_error(): It is used to check if an error has occurred in a component.

is_null(): It used to check if a field value is null.

is_valid(): It used to check if a field value is valid.

is_blank(): It tests whether a string contains only blank characters

first_defined() function
The first_defined() function returns the first non-NULL argument passed to it. Here, it ensures that for
any groups that do not have any body records, zero will be returned instead of NULL.

Vector Functions

vector_append: vector_append(input_vector1, input_vector2)


It adds an object to existing vector
vector_concat:
It will combine two vector elements
vector_slice: vector_slice(input_vector, start_index, end_index)

It is a function that returns a sub vector of a given vector.

vector_bsearch:

It is used to check the presence of an element. If the element is present, it returns only the first index of
the element

vector_sort:

It is used for reordering data.

vector_dedup_first:

It is used to remove duplicates from a vector

vector_rank

What are the primary keys and foreign keys?

The relationship between the two tables is represented as Primary key and foreign key.

Whereas the primary key table is the parent table and foreign key table is the child table.

The criteria for both the tables are there should be matching column.

Conduct>IT (It is an integrated with GDE)

Group a set of Ab Initio jobs by setting up dependencies like file, time, job based.

It is an environment for creating enterprise Ab Initio data integration systems, its main role is to create
Ab Initio plans which is special type of graph constructed of another graph and scripts. Ab Initio provides
graphical and command line interface to conduct it.

It provides sophisticated job automation and monitoring for Ab Initio applications and other programs,
linking them together into a single execution environment.

Job A (Extract the data from source) – Job B (Source to stage transformation) – Job C (Stage to Load)

3 Jobs = For scheduling it using a Scheduler.

Job A (Extract the data from source) – Job B (Source to stage transformation) – Job C (Stage to Load)

Conduct IT Plan
1 Plan job => Schedule it from Scheduler.

Scheduler – For Automation of manual execution of jobs.

Ex. Autosys, Control-M, Trivoli Work Scheduler (TWS)

File – Based on the file arrival, we should start the Ab Initio job.

Time – Job should get started at a particular time.

Job – Based on the SU of predecessor job, successor job should get executed.

Ex. Predecessor for Job A – No

Successor for Job A – Job B

Predecessor for Job B – Job A

Successor for Job B – Job C

Plan>It

It's easy to connect different graphs. creating graphs of graphs which is known as Plan.

It helps to create, manage and run large-scale data processing systems by sequencing the ETL jobs.

Plan>It provides a framework to create complete production systems consisting of Ab Initio graphs,
custom scripts and third-party programs.

Plan consists of elements.

Tasks – which can be graphs, scripts and other programs.

Methods – which performs the actions of the tasks.

Parameters – which pass information between tasks.

Relationships – which determines the order in which the tasks to be executed

Different types of Tasks

Graphs Tasks - Perform method is a Graph

Program Tasks - Perform method is a script or program

Plan Tasks - Perform method is a plan

Conditional Tasks - Perform method consists of evaluating an expression

Perform method gives the task its identity

Different types of Methods

Trigger, Perform, At Start, At Success, At Failure, At Warning, At Rollback, At Shutdown, At Period

Batch Graphs
A batch graph stops after it processes all the input data that was available when it started.

If new data arrives while the graph is running, the graph must be restarted to process it.

Continuous Graph

Continuous Graph is a type of graph that continuously reads input data from a source and processes it in
real-time without having to stop and restart the graph. This type of graph is useful when dealing with
streaming data or when a high-speed data transfer is required.

A Continuous Graph consists of one or more input components, one or more output components, and
one or more transform components. The input components continuously receive data from a source,
such as a messaging system or a sensor network. The transform components process the data and
perform any necessary transformations, such as filtering or aggregating the data. The output
components continuously write the processed data to a target, such as a database or a file.

A continuous graph includes

A subscriber is the only allowed data source.

A publisher at end of each data flow.

In between the subscribe and publish component there could be a number of continuous components
depending upon the requirements.

Subscribers

Data enters the continuous graph through a subscriber component.

It is used to write the data from various sources into a continuous flow graph.

Creates computepoints and checkpoints.

Batch Subscribe, Generate Records, JMS Subscribe, MQ Subscribe, Subscribe and Universal Subscribe

Publishers

Data leaves the continuous graph through a publisher component.

It is used to write the data to various destinations

Consumes computepoints and checkpoints.

Multipublish, MQ Publish, JMS Publish, Publish, Trash, Continuous Multi Update table

Compute-points And Checkpoints

Continuous graphs are made possible by the continuous components and by the compute-points and
checkpoints.

Computepoints and checkpoints are the extra packets of information sent between records on the data
flow.

Subscribers create these packets, and the publishers consume them.


Computepoints

They mark the block of records that wanted to be processed as a group.

In a multiple input flows continuous graph, Computepoints is used to indicate which block of data on
one flow corresponds to which blocks of data on other flow

Checkpoints

These graphs periodically save intermediate states at special markers in data flows called checkpoints

How do you design and develop a Continuous Graph in Ab Initio?

1. Define the requirements: Understanding the input data sources, the output targets, and the
processing requirements.
2. Design the input and output components: Selecting the appropriate component type and
configuring the runtime properties, such as the buffer size and the timeout settings.
3. Design the transform components: Selecting the appropriate transform type, configuring the
inputs and outputs, and designing the processing logic using Ab Initio's graphical language.
4. Configure the Continuous Graph runtime properties: This includes setting the runtime
environment, defining the scheduling and concurrency options, and specifying any required
system resources.
5. Test and validate the Continuous Graph: This involves running the graph in a development or
test environment and verifying that it processes the data correctly.
6. Deploy the Continuous Graph: This involves configuring the runtime properties and scheduling
options and monitoring the graph for performance and errors.

How do you handle errors and exceptions in a Continuous Graph?

Use error handling components: Ab Initio provides several error handling components, such as the Error
Logging and Recovery component, which can be used to detect and log errors and exceptions in a
Continuous Graph. These components can also be used to recover from errors and resume processing.

Implement retries: Retrying failed operations is another common technique for handling errors in a
Continuous Graph. You can implement retries by configuring the retry count and interval for each
operation and adding logic to check for errors and retry the operation as needed.

Use notifications: Sending notifications when errors or exceptions occur is a useful technique for
alerting operators or administrators to potential issues. You can configure notifications using Ab Initio's
built-in notification components or by integrating with external monitoring tools.

Monitor the graph: Monitoring the Continuous Graph for errors and exceptions is another important
best practice. This involves configuring alerts and dashboards to track key metrics, such as input and
output rates, error rates, and processing times. You can also use Ab Initio's Command Center to monitor
and manage graphs in real-time.

Test and validate error handling: Finally, it's important to test and validate the error handling logic in a
Continuous Graph to ensure that it functions as expected. This involves running the graph in a test
environment and intentionally triggering errors to verify that the error handling and recovery logic
works correctly.

Explain the difference between a synchronous and asynchronous input in a Continuous Graph?
Synchronous input: Continuous Graph reads data from the input source in a blocking manner. This
means that the graph waits for data to be available before processing it. Once the data is available, the
graph processes it and then waits for the next data block. Synchronous input is typically used when the
input data source is slow.

Asynchronous input: Continuous Graph reads data from the input source in a non-blocking manner. This
means that the graph continues to process data while waiting for new data to arrive. Once the new data
is available, the graph processes it along with any previously buffered data. Asynchronous input is
typically used when the input data source is fast.

What is the role of the Continuous Graph Development Environment (CGDE) in Ab Initio, and what tools
does it provide?

The role of the CGDE is to provide a centralized environment where developers can design, develop, and
manage Continuous Graphs in a consistent and efficient manner. It provides a graphical interface for
designing and editing graphs, as well as a range of debugging, testing, and performance analysis tools

Tools:

Graph Editor

Debugging

Profiler

Test Framework

Deployment Manager

ICFF stands for Index Compressed Flat File

What is ICFF in Ab Initio?

ICFF is a file format used in Ab Initio for storing and processing large volumes of data.

ICFF format is designed to optimize the performance and storage space by compressing and indexing
data in a way that enables fast random access.

Ab Initio's own highly efficient parallel and indexed file system can store data in a wide range of file
systems such as S3, Hadoop (HDFS)

How does ICFF differ from other file formats in Ab Initio?

Indexing: ICFF files are indexed, which means that data can be accessed more quickly than in other file
formats, index stores the offsets of the blocks in the file, which allows for efficient random access.
Compression: ICFF files are compressed, which means that they take-up less space than other file
formats. This can be especially useful when working with large datasets.

Record Layout: ICFF files have a fixed record layout, which means that the fields in each record are
defined in advance. This can help ensure data consistency and make it easier to work with the data.

Partitioning: ICFF files can be partitioned, which means that the data can be split into multiple files
based on the partition key. This can be useful when working with large datasets.

What are the ICFF Advantages and Benefits?

Less disk requirements – Compressed data store, Efficient Storage

Less memory requirements – chucked blocks, Fast Access

Speed

Performance

Large volumes of data

Parallel processing

Fixed record layout

Explain the structure of an ICFF file?

Header: The header contains metadata about the file, including the version number, record length, and
compression type

• Version: 2
• Record length: 100
• Compression type: gzip
Data Blocks: The data blocks contain the actual data in the file. Each data block is preceded by an index
entry that stores the offset of the block within the file. The data blocks are compressed using a variety of
compression algorithms, including gzip, bzip2.

• Block 1: [compressed data]


• Block 2: [compressed data]
• Block 3: [compressed data]
Index: The index contains a list of pointers to the data blocks in the file. The index is used to locate the
data blocks when reading the file. The index is stored in a separate section of the file.

• Entry 1: offset=0, length=1000


• Entry 2: offset=1000, length=1200
• Entry 3: offset=2200, length=800
How do you create an ICFF file in Ab Initio?
Create a graph

Drag and drop a "Write" component onto canvas

Configure the "Write" component:

Output File format tab: Set ICFF

Output file name: Specify the name and location of the output file

Record format: Define the record format for the output file

This includes the length and format of each field in the record

Compression settings: Specify the compression algorithm gzip or bzip2

Connect the input data to the "Write" component

Save and run the graph to create output ICFF file.

How do you read data from an ICFF file in Ab Initio?

Create a graph

Drag and drop a "read" component onto canvas

Configure the "read" component:

input File format tab: Set ICFF

input file name: Specify the name and location of the input file

Record format: Define the record format for the input file

This includes the length and format of each field in the record

Compression settings: Specify the compression algorithm gzip or bzip2

Drag and drop an output component onto the canvas, "Transform" or "Write" component.

Connect the "Read" component to the output component.

Save and run the graph to read input ICFF file and write it to output component.

Parallel Processing:

It is a process split data into partition, distribute processing across multiple nodes and merge it

ICFF Parallel Processing:

To use ICFF files in parallel processing in Ab Initio,

We can use the "Partition" Continuous Graph component to split the input ICFF file into multiple
partitions, each of which can be processed independently. we can then use parallel processing
components, such as "Join", "Merge", and "Rollup", to combine the results of the partitions into a final
output
What is Flow?

A flow carries a stream of data between components in a graph. Flows connect components via ports.
Ab Initio supplies four kinds of flows with different patterns: straight, fan-in, fan-out, and all-to-all.

Evaluation of Parameters order

When we run a graph, parameters are evaluated and are determined by the order of their dependencies.

Host setup script is run.

Common (that is, included) sandbox parameters are evaluated.

Sandbox parameters are evaluated.

Project-start.ksh script is run.

Formal parameters are evaluated.

Graph parameters are evaluated.

Graph Start Script is run.

DML Evaluation

Graph end script is executed

PDL stands for Parameter Definition Language.

It is a scripting language used in Ab Initio for defining parameters that are used in data processing jobs.

It is used in Ab Initio to make a graph behave dynamic.

Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing
the graph, then a graph level parameter can be defined and utilized while embedding the DML in output
port.

For example, We have a graph that reads data from a file and writes it to another file. We want to pass
the input and output file names as parameters so that we can reuse the same graph for different files.
We can define two parameters in PDL, one for input file name and another for output file name. Then
we can use these parameters in our graph wherever we want to use input and output file names.

Metadata Hub

in Ab Initio, Metadata hub provides a centralized repository for managing metadata across the
enterprise. It is a web-based platform that allows users to store, search, and manage metadata across
multiple Ab Initio applications and environments.

The Metadata Hub serves as a single source of truth for metadata, which means that all metadata
information is stored in one location, making it easier to manage and access different applications and
teams.

Metadata Discovery: The Metadata Hub can automatically discover metadata from different sources and
store it in a central repository.
Metadata Management: The Metadata Hub provides a set of tools for managing metadata, including the
ability to create, modify, and delete metadata objects. It also allows users to assign ownership, add tags,
and create relationships between metadata objects.

Collaboration: The Metadata Hub provides a platform for collaboration between different teams and
users.

Data Lineage: The Metadata Hub provides visibility into the data lineage of metadata objects, which
means that users can track the origin and transformation of data across different systems and
applications.

Overall, the Metadata Hub is a powerful tool for managing metadata across the enterprise, providing a
centralized repository for storing and managing metadata information, and enabling collaboration and
data lineage tracking across different teams and applications

Business Rules Environment (BRE):

This is new way of creating XFR's with ease from mapping rules written in English

Even business team also can understand these generic transformation rules

Rules are written on spreadsheet kind of page and then converted to component by a click on button

Connect host and database with EME.

Advantages of BRE

Business users and GDE developers uses BRE to develop and implement the Business Rules

Implement the rules time is getting reduced

This tool is more transparent for Business Users/Analysts as well as developers

The traceability is also another benefit for BRE

Create a new Ruleset

Open the project path of EME


Select the ruleset directory/subdirectory.

Name the ruleset

Open an existing Ruleset

Once you open the rule set, you may view it in two ways, either by Rules or by Output

BRE Rulesets

The object which is created by BRE is a ruleset, which consists of one or more closely related generic
transformation rules. This computes a value for one field in the output record.

The rule consists of cases and each case is having a set of conditions which determines one of the
possible values for the output field

The case grid contains a set of conditions (in Triggers) and determines the value of output (in Outputs)
based on the input value.

Reformat Rulesets: This ruleset takes the input one by one and applies the transformation and produces
the output.

Filter Rulesets: This ruleset reads the input and based on the conditions specified, either keep or discard
the record and given the value to the output variable

Join Rulesets: Reads the inputs from multiple sources and combines them, then apply the
transformation specified in ruleset and moves the value to the output.

The ruleset generally contains:

i. Input and Output datasets

ii. Lookup files

iii. Other set of Business rules with repetitive actions.

iv. Parameters whose values are determined upon running the graph

v. Special formulae and functions

Explain what does dependency analysis means in Ab Initio?

In Ab Initio, Dependency Analysis is process through which EME examines a project entirely and traces
how data is transformed, transformed from component to component, field by field with and between
graphs

Lineage Diagrams

Graph Lineage and Dataset Lineage are techniques used to track the dependencies between different
data processing components and datasets within an Ab Initio application

The lineage diagrams are the method to represent the data flows between the ruleset's elements
lineage diagrams are used to visually represent the data flow of a particular process or job.

These diagrams are useful for understanding the relationships between various components in a job and
for identifying issues that may arise during execution.

By examining the data flow through a job, it is possible to identify where data may be getting stuck or
where errors are occurring.

These diagrams enable developers and operators to identify potential issues and make improvements
that can improve the efficiency and effectiveness of data processing jobs.

This tool is used for understanding and optimizing data processing workflows in Ab Initio.

In Ab Initio, lineage diagrams can be generated automatically using the "Breath First Search" algorithm,
which traces the data flow through the various components of a job. The resulting diagram shows each
component as a node, with arrows indicating the direction of data flow between nodes

XML Processing

Ab Initio provides a flexible and powerful environment for processing XML data, allowing users to easily
read, transform, and write XML files using a wide range of components and functions.

Reading XML Files: The first step is to read the XML files using the XML_Reader component in Ab Initio.
This component can parse the XML file and convert it into a hierarchical data structure that can be
processed further.

Transforming XML Data: Once the XML data is read, it can be transformed using various Ab Initio
components such as Reformat, Rollup, and Join. These components can be used to aggregate, filter, and
join data from different sources to create the desired output.

Writing XML Data: After the data is transformed, it can be written to an XML file using the XML_Write
component in Ab Initio. This component can generate the XML output based on a predefined schema or
format.

Error Handling: When processing XML data in Ab Initio, it is important to handle errors properly. The
XML Validator component can be used to validate the XML data against a schema and identify any
errors.
For common and custom format XML processing

READ XML: Reads a stream of characters, bytes, or records; then translates the data stream into DML
records.

READ XML TRANSFORM: Reads a record containing a mixture of XML and non-XML data; transforms the
data as needed, and translates the XML portions of the data into DML records

WRITE XML: Reads records and translates them to XML, writing out an XML document as a string.

WRITE XML TRANSFORM: Translates records to a string containing XML.

For common XML processing only

XML SPLIT: Reads, normalizes, and filters hierarchical XML data. This component is useful when you
need to extract and process subsets of data from an XML document.

xml-to-dml: Derives the DML-record description of XML data. You access this utility primarily through
the Import from XML dialog, though you can also run it directly from a shell.

Import from XML: Graphical interface for accessing the xml-to-dml utility from within XML-processing
graph components.

Import for XML Split: Graphical interface for accessing the xml-to-dml utility from within the XML SPLIT
component.

For XML validation

VALIDATE XML TRANSFORM: Separates records containing valid XML from records containing invalid
XML. You must provide an XML Schema to validate against

Common Processing approaches: 98% of the time when you want to bring XML data into graphs,
transform the data, and write it back out as XML.

You might also like