DATAWAREHOUSE AND
AUTOMATION
All about knowledge of datawarehouse and automation of routine tasks
Skip to content
HOME
INFORMATICA : DOCUMENTATION OF CODE
DATASTAGE : DOCUMENTATION OF CODE
ETL : JOB CONTROL TABLE AND ITS IMPLEMENTAION FOR INCREMENTAL LOAD.
UNIX: SHELL SCRIPT TO PULL REQUIRED FIELDS FROM THE SOURCE FILE.
UNIX: S3 CODE SNIPPETS
ABOUT
SQOOP : MYSQL TO S3
ETL : JOB CONTROL TABLE AND ITS
IMPLEMENTAION FOR
INCREMENTAL LOAD.
Job Control Table is used in ETL tools like Informatica, datastage and SSIS to get the newly inserted/updated
data since the last run date of ETL jobs. The below diagram is specifically done by considering Informatica as
the ETL tool. The same can be implemented in other ETL tools with some modifications.
Tables used are as below:
1. ETL Batch table.
2. ETL Control table.
To view the image clearly, save the image in local disk and zoom in.
Initial Values in ETL Control table : The initial values for High and Low watermark dates will be set to
1/1/1900 12:00 and process name = <name of the job> will be inserted into Job Control table for all the
dataflow jobs. This could be inserted in the deployment script as a one time activity.
ETL_Control_Tabl
e
Proces Proces Proces Process Failur
LWM HMW
Batch_ID Job_Name s Start s End s Status e
Date Date
date Date Status Descriptio Reaso
Code n n
wf_Appointme 1/1/190 1/1/190
-1 NULL NULL
nt 0 0:00 0 0:00
1/1/190 1/1/190
-1 wf_Patient NULL NULL
0 0:00 0 0:00
Explanation of the flow:
1. Batch Identifier is a sequentially generated number which is unique for each run of the jobs. A batch id is
generated initially when we start our jobs. The batch start date is inserted into table. Batch End Date will be
updated at the end of each workflow.
The batch table is used to monitor the performance of the jobs over a period of time.
2. The dataflow jobs which would be run after the Batch Identifier job, will get the previous successful run of
the respective dataflow from the Job Control table. The High Water mark date of the previous run will be used
as Low Water Mark date of current run.
High Watermark date of current run is determined by the max date of source system.
Low Watermark date = High Watermark date of recent previous success run.
High Watermark date = Max date of source records.
The incremented data is retrieved using the above 2 watermark dates.
3. Once the dataflow completes its execution, the status of the execution is updated in Job Control tables with
Low water mark and High water mark. This record will be used to get the Low Watermark of the next run.
In case of failure, the error message will also be updated in the control table.
4. The batch end date is updated with dataflow’s completed date.
Note : if we restart a particular workflow without starting the entire workflow, then the same batch id will be
used and on completion Batch end date will be updated in the batch table.
Table structure of ETL Batch and Control table:
ETL Batch table:
Column Name DataType Description
Batch Identifier, this will be
Batch_ID number
generated sequentially.
Batch_Start_DateTime dateTime Start DateTime of the batch.
Batch_End_DateTime dateTime End DateTime of the batch.
ETL Control Table:
Column Name DataType Description
Batch Identifier, this will be
Batch_ID number generated sequentially before the
jobs are executed.
Process_Start_DateTime dateTime Start DateTime of the process.
Process_Name varchar(100) Name of the process.
Process_End_DateTime dateTime End DateTime of the process.
Low Watermark Date i.e. date
LWMDate date from which records should be
fetched from source.
High Watermark Date i.e. date
HWMDate date till which records should be
fetched from source.
Status code associated with the
Process_Status_Code char(1) process. Refer the below table for
values related to this column.
Status description associated with
the process status code. Refer the
Process_Status_Description varchar(20)
below table for values related to
this column.
The description of the error if the
Failure_Reason varchar(255)
process is failed.
Sample Data of the Control tables:
ETL Batch
Batch_ID Batch_St_Dt Batch_End_Dt
1 8/5/2014 16:21 8/5/2014 17:30
2 8/6/2014 16:21 8/6/2014 16:37
ETL_Control_Tab
le
Proces Process Failur
Process Process
LWM HMW s Status e
Batch_ID Job_Name Start End
Date Date Status Descriptio Reaso
date Date
Code n n
wf_Appointme 1/1/190 1/1/190
-1 NULL NULL
nt 0 0:00 0 0:00
1/1/190 1/1/190
-1 wf_Patient NULL NULL
0 0:00 0 0:00
wf_Appointme 8/5/201 8/5/201 1/1/190 8/4/201
1 Y Success
nt 4 16:21 4 16:30 0 0:00 4 0:00
8/5/201 8/5/201 1/1/190 8/4/201
1 wf_Patient Y Success
4 16:30 4 17:30 0 0:00 4 0:00
wf_Appointme 8/6/201 8/6/201 8/4/201 8/5/201
2 Y Success
nt 4 16:21 4 16:35 4 0:00 4 0:00
Data
too
8/6/201 8/6/201 8/4/201 8/5/201 large
2 wf_Patient E Error
4 16:35 4 16:37 4 0:00 4 0:00 for
colum
n
Name(required)
Email(required)
Website
Comment(required)
Submit
REPORT THIS AD
REPORT THIS AD
Share this:
Twitter
Facebook
Leave a Reply
REPORT THIS AD
Blog at WordPress.com.
Close and accept
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
Follow