Difference Between Lookup Join and Merge Stage
Difference Between Lookup Join and Merge Stage
DataStage has three processing stages that can join tables based on the values of key columns:
Lookup, Join and Merge. In this post, we discuss when to choose which stage, the difference
between these stages, and development references when we use those stages.
Multiple update and reject links are needed (e.g. Combining a master data set with one
or more update datasets)
Lookup Stage
Key Points
The Lookup stage has a reference link, a single input link, a single output link and a
single rejects link.
It does not required data on the input link or reference link to be sorted.
Lookup stage is a in-memory processing stage. Large look up table will result in the job
failure if DataStage engine server runs out of memory.
The Key column names in main and lookup tables do not need to be the same as you
map them in the stage.
Make sure to select the right Lookup Stage Conditions (see Example step 3).
Development Reference
In this example, we will add employees’ information to the sales record by joining two table by
the key columns, Empl_Id
(1) Map the key column and map the output in the Lookup stage.
(2) Select Lookup Stage Conditions to specify the actions when Lookup condition is not met and
Lookup fails.
Continue: When the lookup table does not have the value appears in the main table, it
will assign null values to the lookup table columns. In another word, this option works like
Left Join.
Drop: When the lookup table does not have the value appears in the main table, it will
drop the row all together. In another word, this option works like Inner Join.
Fail: When the lookup table does not have the value appears in the main table, the job
will fail. This is the default option for the Lookup stage.
Drop: When the lookup table does not have the value appears in the main table, it will
output to the reject output (as in this example).
Join Stage
Key Points
Development Reference
In this example, we join Employee and Products tables to Sales_Records based on Empl_Id and
Product_Id. Then, calculate the revenue by multiplying the price column from Products by the
number of units sold.
(1) In each join stage, make sure to choose join key and type (Left outer, right outer, full outer,
etc).
Merge Stage
Key Points
The Merge stage can have any number of input links, single output links and the same
number of reject output links as the update input links.
A master record and an update record are merged only if both of them have the same
values for the specified merged key. In another word, merge stage does not do range
lookup.
To minimise memory requirements, we can ensure that rows with the same key column
values are located in the same partition and is processed in the same node by
partitioning. However, the ‘auto’ option for partitioning usually works fine.
As part of preprocessing, duplicate records need to be removed from the master. If there
are more than one update data sets, it only updates the first record as below.
Development Reference
(1) Merge stage has only 3 options, Unmatched Master Mode, Warn On Reject Updates and
Warn On Unmatched Master. All the tables must have the same column names for the merge
keys.
(2) Configure input and output links. Map them to the right link order.
Reference Datasets
Sales_Records
Employee
Joined
Difference Between Normal Lookup and Sparse Lookup
Normal Lookup:-
Normal might provide poor performance if the reference data is huge as it has to put
all the data in memory.
Sparse Lookup:-
If the input stream data is less and reference data is more like 1:100 or more in
such cases sparse lookup is better.
Sparse lookup sends individual sql statements for every incoming row.(Imagine if
the reference data is huge).
This Lookup type option can be found in Oracle or DB2 stages. Default is Normal.