Abinitio Preperation
Abinitio Preperation
Data Warehousing
Data Warehousing is a process of collecting, storing, and managing data from different sources to
provide meaningful business insights.
It is a database which stores current data and historical data and used for reporting and analysis
purposes.
Datamart
Data Modelling
Data Modelling is the diagrammatic representation showing how the entities are related to each other.
It is the initial step towards database design. We first create the conceptual model, then the logical
model and finally move to the physical model.
Logical Model will be showing entity names, entity relationships, attributes, primary keys and foreign
keys in each entity.
Physical Data Model will be showing primary keys, foreign keys, table names, column names and column
data types.
Star Schema: we have a fact table in the center that references multiple dimension tables around it. All
the dimension tables are connected to the fact table.
The primary key in all dimension tables acts as a foreign key in the fact table.
Snowflake Schema: the level of normalization increases. The fact table here remains the same as in the
star schema. However, the dimension tables are normalized.
Ab Initio
The word comes from “Latin” which means “From the Beginning
It is Designed to solve simple to complex business problem in an efficient manner from the beginning.
It is a secured ETL tool used to extract, transform and load data.
It is used for data analysis, data manipulation, batch processing, and graphical user interface based
parallel processing
It is mainly used to apply transformation rules on data as per business requirements and feed all
business units in the required format.
Scalability – Even if there is an increase in data volume, there won't be much impact to the
performance.
It is Heart of Ab Initio.
The Co>Operating System is a high-performance parallel processing framework that is optimized for
data integration and processing.
Co>Operating System is an environment for building, integrating, running and deploying large enterprise
business data applications. It is the foundation for all Ab Initio's technologies and provides a complete
and seamless application development and execution environment.
It operates on the top of the operating system and is a base for all Ab Initio processes. It provides
additional features known as air commands which can be installed on a variety of system environments
such as Unix, HP-UX, Linux, IBM AIX, Windows systems.
Actual Execution of the Ab initio Job by creating the processes, scheduling the processes, file
management and checkpointing.
Manage and execute Ab Initio graphs and control the ETL processes
It is an IDE (Integrated Development Environment) that enables creation of applications by dragging and
dropping components onto a canvas and configuring them using point and click operations.
It is an end user environment where we can develop the graph, Run the Graph and check the status of
Graph.
It is used to develop the ab initio jobs with the help of components and debugging.
EME is a high-performance object-oriented data storage system that version controls and manages
various kinds of information associated with Ab Initio applications, which may range from design
information to operational data
EME will be using for storing the graph and managing metadata (graphs and their associated files wrt
version)
Technical Metadata-Applications related business rules, record formats and execution statistics.
EME metadata can also be accessed from the Ab Initio GDE, internet browser or Ab Initio Co>Op
command line (Air commands).
Sandbox is a collection of various directories like bin, mp, dml, run etc which contains the metadata
(Graphs and their associated files).
In the graph, instead of specifying the entire path for any file location, we specify only the sandbox
parameter variable.
Standard Environment
Builds the basic infrastructure and environment for running Ab Initio applications.
Contains the SERIAL and MFS locations, enterprise level parameters and values that will be used by
private projects.
Every project includes stdenv project and inherit the parameters defined
Checkout
1st time checkout – Sandbox is created (project path in EME moved to sandbox project path)
From next time checkout, Updated project versions will be moved from EME to Sandbox.
Project Level Checkout: All the new object versions will be moved from EME to Sandbox
Object Level Checkout: Update object version will be moved from EME to Sandbox
Checkin
Once the development and testing of code is complete, we will do checkin. Code will move to EME
repository, and version will be created for code.
Graph is translated into a script that can be executed in the Shell Development Environment.
This script and any metadata files stored on the GDE client machine are shipped (via FTP) to the server.
The script is invoked (via SSH or REXEC or TELNET) on the server.
The script creates and runs a job that may run across many nodes.
Component Execution
Component processes communicate directly with datasets and each other to move data around.
As each Component process finishes with its data, it exit with success status.
Agent Termination
When all of an Agent’s Component processes exit, the Agent informs the Host process that those
components are finished. The Agent processes then exits.
Host Termination
When all Agents have exit, the Host process informs the GDE that the job is complete.
Source/Destination tables are stored in Database server. We need to extract / load data from database
server to Ab Initio co>op. We need to config connection between them.
Input table component -> config -> new -> edit database configuration
database Version
db_home
db_name
db_nodes
db_character_set
user
password
Close editor and save in Sandbox -> db folder -> dbc file
Parameter
Parameter Set
The collection of all the parameters defined on an object (e.g. graph) is called its parameter set. By
changing the parameter set's values, the object’s behavior can be controlled.
Upon running the graph which has graph parameters set in, a parameter setting window pops up
showing the input parameters and local parameters declared and defined for that particular graph.
One graph can have multiple pset’s defined over it with same parameter name or declaration but with
different definition.
Instead of hard-coding specific values directly into the application logic, We can use parameter sets to
easily switch between different configurations or environments. This makes it easier to manage and
maintain Ab Initio applications, especially when dealing with different data sources, target systems, or
processing requirements.
The source graph associated with pset can be changed to another graph if required. Upon opening the
pset, change the source graph by going to
Edit (tab) --> change source --> new source graph name
Formal Parameter / Input Parameter: We don't need to initialize the data. it will prompt at the time of
running the graph for that parameter.
How would you do performance tuning for already built graph? Can you let me know some examples?
Instead of reading many small files many times, we can directly use “Read Multiple Files” one time.
Connect multiple flows directly into an Input file. After input file not required replicate to process
multiple flows.
Filter the data as early as possible. (Drop unwanted data fields/ data records early in the graph)
Avoid using Sort component multiple times. The “SORT” component breaks pipeline parallelism.
Sort is used in front of merge component it's no use of using sort! because we have sort component
built in merge.
Sort the data in parallel by using Partition by Key and Sort, rather than sorting it serially
Phases are used to break up a graph into blocks for performance tuning.
Use phases with checkpoints when we have large volumes of data to be processed.
Wrong placement of phase breaks. putting a phase break just before a “Filter by Expression” is a bad
idea since after using this component the size of data may get reduced.
Never put a checkpoint or a phase after sort, instead use check pointed sort.
Use lookup local than lookup when we have large volumes of data to be processed.
Partition the data as early as possible and departition the data as late as possible.
Use reformat with multiple output ports instead of replicating and using multiple reformats. Use
Reformat like Replicate component
Use in memory option (false) for sorted input parameter within rollup/join/scan components. should
have the option Input must be sorted.
Use data parallel processing wherever possible. Make a graph parallel as early as possible and keep it
parallel as long as possible.
Handle data quality issues as soon as possible, since it will reduce the data which is unnecessary. Do not
spread the data quality rules throughout the graph until necessary.
Maxcore
It is a parameter that specifies the maximum amount of memory that can be utilized for a component.
Else disk memory will be utilized
Rollup-64Mb,
Scan-64 mb,
Sort-100mb,
Why does the Sort component divide data into 1/3 of the Max core?
It is an efficient way to sort large amounts of data. The sort component sorts data in memory and then
writes it to disk. By dividing the data into 1/3 of max core, it ensures that there is enough memory
available for sorting and writing.
Sort component use before Deadup sorted, Rollup, Scan, Merge components for performance tuning.
Separates one specified data record in each group of data records from the rest of records in the group.
Reject Threshold parameter Abort on first reject, Never Abort, Use Limit/Ramp
Normalize Component:
It is used for splitting a single record into N number of records based on length function defined in
normalize
Always clean and validate data before normalizing it. Because normalize uses a multistage transform
Filter by Expression
Filter data records according to specified DML expression
use_package
reject-threshold parameter
(Default) Abort on first reject — Write Multiple Files stops the execution of the graph at the first reject
event it generates.
Never abort — the component does not stop the execution of the graph, no matter how many reject
events it generates.
Use ramp/limit — the component uses the settings in the ramp and limit parameters to determine how
many reject events to allow before it stops the execution of the graph.
The limit and ramp are the variables that are used to set the reject tolerance for a particular graph. This
is one of the options for reject-threshold properties. The limit and ramp values should pass if enables
this option. Graph stops the execution when the number of rejected records exceeds the following
formula. limit + (ramp * no_of_records_processed). The default value will be set to 0.0. The limit
parameter contains an integer that represents a number of reject events. The ramp parameter contains
a real number that represents the rate of reject events in the number of records processed.
Redefine format
It copies data records from its input to its output without changing the values in the data records.
Reformat Component
To convert data from one format to another format we use reformat component
Transforming the Data (Business Rules / Mapping) we can write multiple transformations
Change DML
Count 2 – Transform 1
Count 3 – Transform 2
Reject threshold – abort on first reject, never abort, Limit/Ramp
Output_Index and Output_Indexes are two functions available in Reformat component in Ab Initio.
They are useful when we use count parameter set to number greater than 1.
Output_index function, we can direct the current input record to a single transform-output port.
Out:outpit_index(in) =
begin
Out.x::if(in.region=’USA’) 0
Out.x::if(in.region=’Australia’) 1
Out.x::if(in.region=’UK’) 2
end
Output_indexes function, we can direct the current input record to more than one port.
Out:outpit_indexs(in) =
begin
Out.x::if(in.region=’USA’) [vector 0]
Out.x::if(in.region=’Australia’) [vector 1, 2]
Out.x::if(in.region=’UK’) [Vector 0, 2]
end
Rollup Component
Scan Component
Scan is a transform component that generates a series of cumulative summary records for groups of
data records.
Rollup can perform some additional functionality, like input filtering and output filtering of records.
Aggregate does not display the intermediate results in main memory, whereas Rollup can.
Rollup provides more control over record selection, grouping, and aggregation.
Then click edit option. There we can see the default rules option. Once click on add default rules we get
build transform using dialog box open.
1 Match Names: Generates a set of rules that copies input fields to output fields with the same name
2. Wildcard (.*) Rule: Generates one rule that copies input fields to output fields with the same name.
Join Component
Left Join: record_required' as true for the left join component and false for right component
R ight Join: record_required' as true for the right join component and false for left component
In join the key parameter must be specified from input flow ascending or descending order
We put 'record_required' as true for the required component and false for other components.
Then we can use overridekey0 to specify what is the field name in in0
Broadcast and Replicate are very similar components in that they both copy each input record to
multiple downstream components.
Broadcast (partition component) is used for Data parallelism. It is a fan-out or All-To-All flow
It will put all the data in the input flow to all the data at output flow.
Layout
Parallelism
Component Parallelism: An application that has multiple components running on the system
simultaneously. But the data are separate. This is achieved through component level parallel processing.
Pipeline Parallelism: An application with multiple components but running on the same dataset. This
uses pipeline parallelism.
Data Parallelism: Data is split into segments and runs the operations simultaneously. This kind of process
is achieved using data parallelism.
To execute graph infinitely, the graph end script should call the .ksh file of the graph. Therefore, if the
graph name is abc.mp then in the end script of the graph it should call abc.ksh. This will run the graph
for infinitely.
End script – This script runs after the graph has completed running.
Both DML and XFR have to be passed as graph level parameter during the runtime.
Start script – A script which gets executed before the graph execution starts.
assign_keys component
Scan component
It holds connection information such as Connection method, Login id, Password, co>op location.
An Air Command is a command-line interface tool used to interact with the Ab Initio server. It allows
developers to perform various tasks such as creating, deploying, and monitoring Ab Initio graphs.
• Air object ls
• Air object rm
Checkpoint: Restart Ability. When a graph fails in the middle of the process, a recovery point is created,
known as checkpoint. The rest of the process will be continued after the checkpoint instead of starting
from the beginning.
Phase: Phases are used to break up a graph into blocks for performance tuning. If a graph is created with
phase, each phase is assigned to some part of memory one after another. All the phases will run one by
one. The intermediate file will be deleted.
. plan - plans
.pset - pset
Partition in Ab Initio?
Partition is the process of dividing data sets into multiple sets for further processing
Partition by Round-Robin: Distributing the data evenly in block size across the output partitions.
Partition by Range: Split the data evenly based on a key and a range among the nodes.
Broadcast:
Combine data records from multiple flow partitions into a single flow.
Interleave: Collect blocks of data records from multiple flow partitions in round robin fashion.
Concatenation: Appends multiple flow partitions of data records one after another.
I have one xfr. I would like to identify how many graphs are using that xfr?
When Abinito does create work directories where it stores the temp files? Does it create when the sort
of component uses a particular layout for the first time, or does it have to be created separately?
The $AB_WORK_DIR is the directory where Ab Initio creates work directories for each graph.
The work directories are used to store temporary files created by Sort component.
Kill vs m_kill
Answer: .cfg file is for the remote connection and .dbc is for connecting the database.
.cfg contains:
Database version
Userid/pwd
What is the difference between look-up file and look-up, with a relevant example?
A lookup is a component where we can store data and retrieve it by using a key parameter.
A lookup file is the physical file where the data for the lookup is stored
What is Local lookup?
A lookup file is a multifile and partitioned on a particular key then local lookup function can be used
ahead of lookup function call. This is local to a particular partition depending on the key.
Lookup File represents one or more serial files (Flat files) consisting of data records which can be held in
main memory. This makes the transform function to retrieve the records much faster than retrieving
from disk.
It allows the transform component to process the data records of multiple files fastly.
m_dump
This command is used to print information about data records, their record format, and the evaluations
of expressions.
m_eval
This command is used to evaluate DML expressions and displays their derived types.
This command subtracts 10 days from the current date and represents the result in a specific date
format:
m_wc
This command is used to count the number of records from one or more data files/multi files.
m_env
This command is used to obtain all the Ab Initio configuration variable settings in an environment.
$ m_env
m_kill
TERM Kills the job and triggers a rollback. This is the default.
QUIT Kills the job immediately without triggering a rollback and forces the Xmp run process to dump
core
m_kill -TERM f_so_line_work_F2CMART.rec
m_rollback:
This command is used to perform a manual rollback in case of any Abinito job failure.
m_mkfs
This command is used to create a multifile system. Multifile system consists of Control file directory
along with the partitions directory.
m_db
This command is used for performing database operation from the command prompt.
Utility mode: (Direct Loading) When we load the data into a table. all the constraints are disabled and
then data is loaded. once the data is loaded all the constraints are enabled (faster access)
Api mode: (Conventional Loading) During API Mode checks record by record and constraints will be
enables so that the access will be slow.
Wrapper script is a Unix script, which is helpful in running the graphs directly through Unix and running
it automatically.
It is a graph where everything was parametrized. Benefits of this kind of graph are to build once and run
multiple times with different kinds of data.
String Functions
string_split
string_replace
string_replace(“a man can do anything for money”,”an”,”en”) --> a men cen do enything for money
string_prefix
string_suffix
string_index
string_rindex
string_like(str,pattern) 1/0
string_like(“abcdef”,”abc%”) --> 1
string_like(“abcdef”,”abc%_) --> 0
string_like(“abcdef”,”abc_ef”) --> 1
string_substring
string_length
string_length(“”) --> 0
Validation Functions
is_defined: It is used to check if a field is defined or not, before applying any string functions to avoid
errors.
first_defined() function
The first_defined() function returns the first non-NULL argument passed to it. Here, it ensures that for
any groups that do not have any body records, zero will be returned instead of NULL.
Vector Functions
vector_bsearch:
It is used to check the presence of an element. If the element is present, it returns only the first index of
the element
vector_sort:
vector_dedup_first:
vector_rank
The relationship between the two tables is represented as Primary key and foreign key.
Whereas the primary key table is the parent table and foreign key table is the child table.
The criteria for both the tables are there should be matching column.
Group a set of Ab Initio jobs by setting up dependencies like file, time, job based.
It is an environment for creating enterprise Ab Initio data integration systems, its main role is to create
Ab Initio plans which is special type of graph constructed of another graph and scripts. Ab Initio provides
graphical and command line interface to conduct it.
It provides sophisticated job automation and monitoring for Ab Initio applications and other programs,
linking them together into a single execution environment.
Job A (Extract the data from source) – Job B (Source to stage transformation) – Job C (Stage to Load)
Job A (Extract the data from source) – Job B (Source to stage transformation) – Job C (Stage to Load)
Conduct IT Plan
1 Plan job => Schedule it from Scheduler.
File – Based on the file arrival, we should start the Ab Initio job.
Job – Based on the SU of predecessor job, successor job should get executed.
Plan>It
It's easy to connect different graphs. creating graphs of graphs which is known as Plan.
It helps to create, manage and run large-scale data processing systems by sequencing the ETL jobs.
Plan>It provides a framework to create complete production systems consisting of Ab Initio graphs,
custom scripts and third-party programs.
Batch Graphs
A batch graph stops after it processes all the input data that was available when it started.
If new data arrives while the graph is running, the graph must be restarted to process it.
Continuous Graph
Continuous Graph is a type of graph that continuously reads input data from a source and processes it in
real-time without having to stop and restart the graph. This type of graph is useful when dealing with
streaming data or when a high-speed data transfer is required.
A Continuous Graph consists of one or more input components, one or more output components, and
one or more transform components. The input components continuously receive data from a source,
such as a messaging system or a sensor network. The transform components process the data and
perform any necessary transformations, such as filtering or aggregating the data. The output
components continuously write the processed data to a target, such as a database or a file.
In between the subscribe and publish component there could be a number of continuous components
depending upon the requirements.
Subscribers
It is used to write the data from various sources into a continuous flow graph.
Batch Subscribe, Generate Records, JMS Subscribe, MQ Subscribe, Subscribe and Universal Subscribe
Publishers
Multipublish, MQ Publish, JMS Publish, Publish, Trash, Continuous Multi Update table
Continuous graphs are made possible by the continuous components and by the compute-points and
checkpoints.
Computepoints and checkpoints are the extra packets of information sent between records on the data
flow.
In a multiple input flows continuous graph, Computepoints is used to indicate which block of data on
one flow corresponds to which blocks of data on other flow
Checkpoints
These graphs periodically save intermediate states at special markers in data flows called checkpoints
1. Define the requirements: Understanding the input data sources, the output targets, and the
processing requirements.
2. Design the input and output components: Selecting the appropriate component type and
configuring the runtime properties, such as the buffer size and the timeout settings.
3. Design the transform components: Selecting the appropriate transform type, configuring the
inputs and outputs, and designing the processing logic using Ab Initio's graphical language.
4. Configure the Continuous Graph runtime properties: This includes setting the runtime
environment, defining the scheduling and concurrency options, and specifying any required
system resources.
5. Test and validate the Continuous Graph: This involves running the graph in a development or
test environment and verifying that it processes the data correctly.
6. Deploy the Continuous Graph: This involves configuring the runtime properties and scheduling
options and monitoring the graph for performance and errors.
Use error handling components: Ab Initio provides several error handling components, such as the Error
Logging and Recovery component, which can be used to detect and log errors and exceptions in a
Continuous Graph. These components can also be used to recover from errors and resume processing.
Implement retries: Retrying failed operations is another common technique for handling errors in a
Continuous Graph. You can implement retries by configuring the retry count and interval for each
operation and adding logic to check for errors and retry the operation as needed.
Use notifications: Sending notifications when errors or exceptions occur is a useful technique for
alerting operators or administrators to potential issues. You can configure notifications using Ab Initio's
built-in notification components or by integrating with external monitoring tools.
Monitor the graph: Monitoring the Continuous Graph for errors and exceptions is another important
best practice. This involves configuring alerts and dashboards to track key metrics, such as input and
output rates, error rates, and processing times. You can also use Ab Initio's Command Center to monitor
and manage graphs in real-time.
Test and validate error handling: Finally, it's important to test and validate the error handling logic in a
Continuous Graph to ensure that it functions as expected. This involves running the graph in a test
environment and intentionally triggering errors to verify that the error handling and recovery logic
works correctly.
Explain the difference between a synchronous and asynchronous input in a Continuous Graph?
Synchronous input: Continuous Graph reads data from the input source in a blocking manner. This
means that the graph waits for data to be available before processing it. Once the data is available, the
graph processes it and then waits for the next data block. Synchronous input is typically used when the
input data source is slow.
Asynchronous input: Continuous Graph reads data from the input source in a non-blocking manner. This
means that the graph continues to process data while waiting for new data to arrive. Once the new data
is available, the graph processes it along with any previously buffered data. Asynchronous input is
typically used when the input data source is fast.
What is the role of the Continuous Graph Development Environment (CGDE) in Ab Initio, and what tools
does it provide?
The role of the CGDE is to provide a centralized environment where developers can design, develop, and
manage Continuous Graphs in a consistent and efficient manner. It provides a graphical interface for
designing and editing graphs, as well as a range of debugging, testing, and performance analysis tools
Tools:
Graph Editor
Debugging
Profiler
Test Framework
Deployment Manager
ICFF is a file format used in Ab Initio for storing and processing large volumes of data.
ICFF format is designed to optimize the performance and storage space by compressing and indexing
data in a way that enables fast random access.
Ab Initio's own highly efficient parallel and indexed file system can store data in a wide range of file
systems such as S3, Hadoop (HDFS)
Indexing: ICFF files are indexed, which means that data can be accessed more quickly than in other file
formats, index stores the offsets of the blocks in the file, which allows for efficient random access.
Compression: ICFF files are compressed, which means that they take-up less space than other file
formats. This can be especially useful when working with large datasets.
Record Layout: ICFF files have a fixed record layout, which means that the fields in each record are
defined in advance. This can help ensure data consistency and make it easier to work with the data.
Partitioning: ICFF files can be partitioned, which means that the data can be split into multiple files
based on the partition key. This can be useful when working with large datasets.
Speed
Performance
Parallel processing
Header: The header contains metadata about the file, including the version number, record length, and
compression type
• Version: 2
• Record length: 100
• Compression type: gzip
Data Blocks: The data blocks contain the actual data in the file. Each data block is preceded by an index
entry that stores the offset of the block within the file. The data blocks are compressed using a variety of
compression algorithms, including gzip, bzip2.
Output file name: Specify the name and location of the output file
Record format: Define the record format for the output file
This includes the length and format of each field in the record
Create a graph
input file name: Specify the name and location of the input file
Record format: Define the record format for the input file
This includes the length and format of each field in the record
Drag and drop an output component onto the canvas, "Transform" or "Write" component.
Save and run the graph to read input ICFF file and write it to output component.
Parallel Processing:
It is a process split data into partition, distribute processing across multiple nodes and merge it
We can use the "Partition" Continuous Graph component to split the input ICFF file into multiple
partitions, each of which can be processed independently. we can then use parallel processing
components, such as "Join", "Merge", and "Rollup", to combine the results of the partitions into a final
output
What is Flow?
A flow carries a stream of data between components in a graph. Flows connect components via ports.
Ab Initio supplies four kinds of flows with different patterns: straight, fan-in, fan-out, and all-to-all.
When we run a graph, parameters are evaluated and are determined by the order of their dependencies.
DML Evaluation
It is a scripting language used in Ab Initio for defining parameters that are used in data processing jobs.
Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing
the graph, then a graph level parameter can be defined and utilized while embedding the DML in output
port.
For example, We have a graph that reads data from a file and writes it to another file. We want to pass
the input and output file names as parameters so that we can reuse the same graph for different files.
We can define two parameters in PDL, one for input file name and another for output file name. Then
we can use these parameters in our graph wherever we want to use input and output file names.
Metadata Hub
in Ab Initio, Metadata hub provides a centralized repository for managing metadata across the
enterprise. It is a web-based platform that allows users to store, search, and manage metadata across
multiple Ab Initio applications and environments.
The Metadata Hub serves as a single source of truth for metadata, which means that all metadata
information is stored in one location, making it easier to manage and access different applications and
teams.
Metadata Discovery: The Metadata Hub can automatically discover metadata from different sources and
store it in a central repository.
Metadata Management: The Metadata Hub provides a set of tools for managing metadata, including the
ability to create, modify, and delete metadata objects. It also allows users to assign ownership, add tags,
and create relationships between metadata objects.
Collaboration: The Metadata Hub provides a platform for collaboration between different teams and
users.
Data Lineage: The Metadata Hub provides visibility into the data lineage of metadata objects, which
means that users can track the origin and transformation of data across different systems and
applications.
Overall, the Metadata Hub is a powerful tool for managing metadata across the enterprise, providing a
centralized repository for storing and managing metadata information, and enabling collaboration and
data lineage tracking across different teams and applications
This is new way of creating XFR's with ease from mapping rules written in English
Even business team also can understand these generic transformation rules
Rules are written on spreadsheet kind of page and then converted to component by a click on button
Advantages of BRE
Business users and GDE developers uses BRE to develop and implement the Business Rules
Once you open the rule set, you may view it in two ways, either by Rules or by Output
BRE Rulesets
The object which is created by BRE is a ruleset, which consists of one or more closely related generic
transformation rules. This computes a value for one field in the output record.
The rule consists of cases and each case is having a set of conditions which determines one of the
possible values for the output field
The case grid contains a set of conditions (in Triggers) and determines the value of output (in Outputs)
based on the input value.
Reformat Rulesets: This ruleset takes the input one by one and applies the transformation and produces
the output.
Filter Rulesets: This ruleset reads the input and based on the conditions specified, either keep or discard
the record and given the value to the output variable
Join Rulesets: Reads the inputs from multiple sources and combines them, then apply the
transformation specified in ruleset and moves the value to the output.
iv. Parameters whose values are determined upon running the graph
In Ab Initio, Dependency Analysis is process through which EME examines a project entirely and traces
how data is transformed, transformed from component to component, field by field with and between
graphs
Lineage Diagrams
Graph Lineage and Dataset Lineage are techniques used to track the dependencies between different
data processing components and datasets within an Ab Initio application
The lineage diagrams are the method to represent the data flows between the ruleset's elements
lineage diagrams are used to visually represent the data flow of a particular process or job.
These diagrams are useful for understanding the relationships between various components in a job and
for identifying issues that may arise during execution.
By examining the data flow through a job, it is possible to identify where data may be getting stuck or
where errors are occurring.
These diagrams enable developers and operators to identify potential issues and make improvements
that can improve the efficiency and effectiveness of data processing jobs.
This tool is used for understanding and optimizing data processing workflows in Ab Initio.
In Ab Initio, lineage diagrams can be generated automatically using the "Breath First Search" algorithm,
which traces the data flow through the various components of a job. The resulting diagram shows each
component as a node, with arrows indicating the direction of data flow between nodes
XML Processing
Ab Initio provides a flexible and powerful environment for processing XML data, allowing users to easily
read, transform, and write XML files using a wide range of components and functions.
Reading XML Files: The first step is to read the XML files using the XML_Reader component in Ab Initio.
This component can parse the XML file and convert it into a hierarchical data structure that can be
processed further.
Transforming XML Data: Once the XML data is read, it can be transformed using various Ab Initio
components such as Reformat, Rollup, and Join. These components can be used to aggregate, filter, and
join data from different sources to create the desired output.
Writing XML Data: After the data is transformed, it can be written to an XML file using the XML_Write
component in Ab Initio. This component can generate the XML output based on a predefined schema or
format.
Error Handling: When processing XML data in Ab Initio, it is important to handle errors properly. The
XML Validator component can be used to validate the XML data against a schema and identify any
errors.
For common and custom format XML processing
READ XML: Reads a stream of characters, bytes, or records; then translates the data stream into DML
records.
READ XML TRANSFORM: Reads a record containing a mixture of XML and non-XML data; transforms the
data as needed, and translates the XML portions of the data into DML records
WRITE XML: Reads records and translates them to XML, writing out an XML document as a string.
XML SPLIT: Reads, normalizes, and filters hierarchical XML data. This component is useful when you
need to extract and process subsets of data from an XML document.
xml-to-dml: Derives the DML-record description of XML data. You access this utility primarily through
the Import from XML dialog, though you can also run it directly from a shell.
Import from XML: Graphical interface for accessing the xml-to-dml utility from within XML-processing
graph components.
Import for XML Split: Graphical interface for accessing the xml-to-dml utility from within the XML SPLIT
component.
VALIDATE XML TRANSFORM: Separates records containing valid XML from records containing invalid
XML. You must provide an XML Schema to validate against
Common Processing approaches: 98% of the time when you want to bring XML data into graphs,
transform the data, and write it back out as XML.