Talend Scenario Based Interview Questions

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 21

TALEND Interview questions and Answers

Talend Interview Questions - Deep In Open Source


Real time Talend interview questions ,with detailed explanations ,to help you
tackle your interview with ease

1. Talend – Merge multiple files into single file with sorting operation.
2. Loading Fact Table Using Talend
3. ROWNUM Analytical Function in Talend
4. SCD-2 Implementations in Talend
5. Deployment strategies in Talend
6. Custom Header Footer in Talend
7. Data Masking Using Talend
8. How to use Shared DB Connection in Talend
9. Load all rows from source to target except last 5
10. Late Arriving Dimension Using Talend
11. Date Dimension Using Talend
12. Dynamic Column Ordering Of Source File Using Talend
13. Incremental Load Using Talend
14. Getting Files From FTP Server
15. Initializing Context At Run Time Using Popup
16. User Define Function In Talend
17. Calling DB Sequence From Talend

1. Difference between tAggregatedRow and tAggregateSortedRow in Talend


2. How to resume job execution from same location if job get failed in Talend
3. How to execute more than one sub jobs parallel in Talend
4. How to iterate filename and directories in Talend
5. What is the difference between OnSubjobOK and OnComponentOK in Talend
6. How can you pass a value form parent job to child job in Talend
7. How to call stored procedure and function in Talend Job
8. How to export job and execute outside from Talend Studio
9. How to pass value from outside in Talend
10. Can I define schema of database or tables at run time
11. What is tReplicate in Talend
12. What is tUnite in Talend Open Studio
13. How to optimize talend job to stop outOfMemory runtime error
14. How to optimize Talend Performance
15. How to execute multipule SQL statements with one component in Talend
16. What is tSystem component in Talend
17. Can I execute multiple commands at one time with a tSystem component
18. What is difference between tMap and tFilterrow in Talend

1. What is the difference between the ETL and ELT components of Talend Open Studio?
2. How does one deploy Talend projects?
3. What are the elements of a Talend project?
4. What is the most current version of Talend Open Studio?
5. How do you implement versioning for Talend jobs?
6. What is the tMap component?
7. What is the difference between the tMap and tJoin components?
8. Which *component* is used to sort data?

1. What is Talend ?
2. What is difference between ETL and ELT components of Talend ?
3. How to deploy talend projects ?
4. What are types of available version of Talend ?
5. How to implement versioning for talend jobs ?
6. What is tMap component ?
7. What is difference between tMap and tJoin compoents ?
8. Which component is used to sort that data ?
9. How to perform aggregate operations/functions on data in talend ?
10. What types of joins are supported by tMap component ?
11. How to schedule a talend job ?
12. How to runs talend job as web service ?
13. How to Integrate SVN with Talend ?
14. How to run talend jobs on Remote server ?
15. How to pass data from parent job to child jobs through trunjob component ?
16. How to load context variables dynamically from file/database ?
17. How to run talend jobs in Parallel ?
18. What is Context variables ?
19. How to export a talend job ?

5.( Talend best practices)


1. Talend workspace path should not contain any spaces.
2. Never forget to perform Null Handling.
3. Create Repository Metadata for DB connections and retrieve database table schema
for DB tables.
4. Use Repository Schema for Files/DB and DB connections.
5. Create Database connection using t<Vendor>Connection component and use this
connection in the Job. Do not make new connection with every component.
6. Always close the connection to database using t<Vendor>Close component.
7. Create a Repository Document corresponding to every Talend job including
revision history.
8. Provide Sub Job title for every sub job to describe the sub job
purpose/objective.
9. Avoid Hard Coding in Talend Job component. Instead use Talend context variables.
10. Create Context Groups in Repository
11. Use Talend.properties file to provide the values to context variables using
tContextLoad.
12. Create Variables in tMap and use the variables to assign the values to target
fields.
13. Create user routines/functions for common transformation and validation.
14. Develop Talend job iteratively.
15. Always Exit Talend open studio before shutting down the PC.
16. Always rename Main Flows in Talend Job to meaningful names.
17. Always design Talend jobs by keeping performance in mind.
Talend Job Design - Performance Optimization Tips
1. Remove Unnecessary fields/columns ASAP using tFilterColumns component.
2. Remove Unnecessary data/records ASAP using tFilterRows component
3. Use Select Query to retrieve data from database
4. Use Database Bulk components
5. Store on Disk Option
6. Allocating more memory to the Jobs
7. Parallelism
8. Use Talend ELT Components when required
9. Use SAX parser over Dom4J whenever required
10. Index Database Table columns
11. Split Talend Job to smaller Subjobs

1. Talend – Merge multiple files into single file with sorting operation.

Scenario: Merging multiple input sources into a single target along with the file
names with a additional column and sorting operations on all files with in the flow
The snapshot below shows the overall mapping.

As a source we have taken tFileList which will pull all our source files from the
specified directory, then tFileInput Component will read all files one by one. In
source file path you have to specify the global variable which holds the address of
the current file in process using–

((String)globalMap.get(“tFileList_1_CURRENT_FILEPATH”))

Later tMap is used to assign one extra column which holds the name of the file and
again name of the file is to be retrieve using global variable in our case it is

((String)globalMap.get(“tFileList_1_CURRENT_FILE”))

After completion of first job we fire a onSubJobOk trigger to do our rest work.
now,
tBufferInput will hold all the source file data at once and later tBufferInput will
pull all the buffered data produced by tBufferOutput.
Finally, tSort Component will sort all the three source file data using some column
as a key and tFileout- component will produce the final output.
that’s all, now you can create a job and simply run it—>>>

2.Loading Fact Table Using Talend


In this post we will discuss how to load a fact table in a data warehouse using
your dimension table and the data staged in a staging table. We will just show you
what is the procedure of loading and further complexities depends upon your
business requirements.
In our example we took THREE dimension tables.
1. DIM_BOOK
2. DIM_CUSTOMER
3. DIM_TIME
All of the dimension tables are SCD-type implemented other than TIME DIMENSION.
STEPS:—->>>
1) create an connection first as i am using oracle as a database that’s why i used
tOracleConnection.
2) After Successful connection run the further part.
From Dimension tables you have to fetch the data using Query Editor and place a
condition where
END_DATE is null, if you have implemented SCD type-2 This will bring all those
recordswho have currently validated state.
Now Fetch the OLTP data stored currently in Staging Table, and put all connections
in tMap.
REMEMBER—-your staging table that has to be loaded must be first linked then only
it will work fine and all other dimensions connected will work as a LOOKUP.

Now in tMap join all the dimensions with your source data using the keys and fetch
down the SURROGATE_KEY and put all those skeys in the fact table. In my join
condition i have used inner join as a join method.
In tmap component i have use some calculations to find out what are the percentage
in discount and total value for the order . It may be anything depend upon your
requirements.

note:- Just keep in mind data type conversion you have to keep in mind other wise
it’ll give you trouble a lot in my case i just converted my data types in staging
itself.
3.ROWNUM Analytical Function in Talend
Analytical funtions are quite useful in SQL and help us avoid extra coding.We
encountered a similar scenario where in we had to implement a function similer to
the ROWNUM() OVER (partition by clause) analytical functionality in Talend.
We ll give an overview of the Talend Job we created in order to attain the
functionality:
Prerequisites:
Create a context variable and Initialize it to 0.
Scenario: Implement the ROW_NUMBER() over (partition by) analytical functionality
in Talend
The image below gives the overview of the job, the components to focus on would be
the tMemorizeRows and the tJavaRow component.

The input file is as shown in the snapshot below:

We need output to look something like this:

The tMemorizeRows is configured as shown in the snapshot below:

This configuration specifies that we MEMORIZE the ID column , and the row count
specifies that we remember upto 2 rows flowing into the tMemorize component.

The most important part of the job ,is configuring the tJavaRow component ,wherein
we specify the logic for implementing this analytical function.
The java code compares the previous and current row using the index for
tMemorizeRows as shown in the snapshot..
4. SCD-2 Implementations in Talend
CREATING STAGING TABLE

1)From tOracleInput_3 bring the max surrogate key present in your dimension table
using query
“SELECT max(user_profile_sur_key) FROM ANKIT_KANSAL.user_profile”
USER PROFILE IS OUR DIMENSION TABKE
2)Create a context variable snumber of integer type and assign a default value 0
and in the next step assign the max surrogate key to the variable using tJavaFlex
component.

4) tOracleInput4 takes all your data from the source oracle system KEEP ALL THE
SOURCE AS A VARCHAR2 FORMAT ONLY.
5)using tMap component calculate your date and also you can find out your GOOD and
BAD records depending upon the business requiremetns.
AS FOLLOWS:-

In the value part perform these operations

checking tweet_id is numeric or not


6) Finally load the data using running the job.
NOW SCD TYPE 2 IMPLEMENTATION

In the above figure


tOracleInput_1 contains the staging data.
tOracleInput_2 contains the data from the dimension table thus lookup from the
dimension table must be performed to check weather the record exists or not.

creating the relationship using equi join

*If both data structure does not matches because staging may consist the data in
string format for that column then you can achieve this join by converting the data
type with in tmap only Integer.parseInt(row1.USER_ID) at the join level only.
CATCHING FRESH INSERTS:-
At the right hand side you can achieve this functionality by enabling catch lookup
inner join reject to true.

IF A ROW ALREADY EXISTS THEN INSERTING A NEW RECORD WITH A FRESH VALUE AND UPDATING
THE OLD RECORD WITH END DATE
TO UPDATE THE EXISTING RECORD WITH ASSIGNING THE END DATE WITH THE CURRENT DATE

5.Deployment strategies in Talend


Talend provides the flexibility to provide multiple deployment strategies namely:
1) Deployment using script/batch files. 2) Deployment using a Web Service.
Deployment using script/batch files
Deployment Stratergies Talend involves generation of .bat or .sh files that can be
executed on their respective OS. When we export a job from Talend, a set of files
will be generated which includes class and jar files specific to that job, a .bat
and .sh files are also generated which can be executed for a particular job. The
job executed through these files run in silent-mode. One of the main advantages of
this strategy is that the client or user is not exposed to the complexity of the
job all he needs is java installed on his system and simply run the batch file.
Procedure to export a job: 1) Right –Click on the job that needs to be exported
from the Repository Panel.
2) Click on the Export Job option.
3) A dialog box appears, specify the path to which you want to export the job.
4) Choose Export type as Autonomous job and click on okay.
5) The job will be exported and be saved as a ZIP file on the specified path.
Deployment using a Web Service(Exeuting jobs REMOTELY!!)
This method involves remote execution of the job via a web service. This method is
independent of the OS used and can be executed via a web browser. As we know Talend
is based on java, the web deployment feature is used extensively.
A WAR file is generated for the specified job, which is used to deploy the job on
the application server.
Procedure to export a job: 1) Right –Click on the job that needs to be exported
from the Repository Panel.
2) Click on the Export Job option.
3) A dialog box appears , specify the path to which you want to export the job.
4) Choose Export type as WAR file and click on okay.
5) The job will be exported and be saved as a WAR file on the specified path.
6) Run Application server ,like Apache Tomcat and add the WAR file to the Tomcat
directory as shown in the screenshot below.

Deployment Stratergies Talend


OR
you can directly go the webapps folder of your Tomcat Server and simply paste the
WAR files over there.
7)Once deployed we can run the job using the following link on the web browser.
http://localhost:8989/File_Input_0.1/services/File_Input?method=runJob
or
localhost:8989/File_Input_0.1/services/File_Input?runJob()
where File_Input refers to the job name. Localhost refers to the server method
refers to the execution method.8989 is the application server port no.
8)On successful execution the following screen is seen.

Deployment Stratergies Talend

6.Custom Header Footer in Talend


Scenario:- Adding custom header and footer using Talend where in Header includes
spaces which are generally not allowed and as well as footer contains no of records
to be written on to the file.
below snapshot shows the overall progress

tFileInputDelimited1_1 is used to take input source file, just select limit=1 while
fetching records and in, tJavaFlex_1 write your own header which you want to
populate on the destination file in the main body code section.
let’s say in my case i have used
row3.first=”First Name”;
row3.last=”Last Name”;
Please be sure that in your tFileOutPutExcel_1 you have to uncheck include header
option. This completes your first job.
now,
In your second job use your input source file and process the data as per your
requirements and further after processing put all your data to the same output file
as defined earlier kindly uncheck include header option and check append existing
file and append existing sheet option.
last,
In your third job while fetching data use limit = 1 for records and only select one
column in your schema definition and further in tJavaFlex component use the
following code to print the number of rows at the header.
row5.first=((Integer)globalMap.get(“tFileInputDelimited_3_NB_LINE”)).toString()+”
“+”rows”;

for last two output_files

*file names in all the three jobs must be same with different select options.
7.Data Masking Using Talend
Data Masking–>>Data Masking is a process of encrypting a data field to protect data
that is classified as personal identifiable data, personal sensitive data or
commercially sensitive data, however the data must remain usable for the purposes
of undertaking valid test cycles. It must also look real and appear consistent.
production systems generally consists of intensively sensitive data and you can not
take chances to make it available for others easily without some changes and for
better performance of applications it’s also required to have similar kind of data
for testing purposes.
If we are talking to current social networking boom if any of user related
information become available without any security measures than it may lead to
various disastrous results. So, to overcome these problems some encrypting
strategies are required which will fulfill the needs.
Now,
Implementing Data Masking Using Talend
Overall Job Descriptive Image:-

JOB DESCRIPTION–>>
A source– file is taken which contains all valid data but which you do not want to
give directly after known risks involved.
Replacement– File contains the data which is to be used to replace against the
source matching data.
Take a tReplaceList component from the palette, and it properties define as:-

Lookup search column is the column which contains the same data present in your
source file column and corresponding Lookup replacement column contains the data
which is used to map/replace with the matches found against the source column.

In column option part you have to check the box on which you want to search
performed and simultaneously replacement should be done.

e.g.
Your source column contains:-
Text(/*Col_name*/)
“Hello this is a text which is to be replaced”

replacement file contains


Text| Replacement
Hello| HO
text|msg
is|tis
replaced|wow
which|haha

output generated
HO this tis a msg haha tis to be wow

and finally using tSendMail component the transformed file is sent to the
Destination with all the security measures applied.
8.How to use Shared DB Connection in Talend
Today we will discuss how to use or register a shared DB connection, across your
jobs. In real time scenarios it’s not recommended to create connections again and
again so instead of creating new connections you can use your old registered
connections through out your sub jobs.
For Demonstration purpose we have created three jobs in which we are using a single
connection through out our all jobs.
JOBS DESC:-
1. shared_conn_demo–>> This job holds our two child jobs (parent,child)
2. parent_db_conns–>>The primary job is the one in which connection(shared) is
registered for our sub jobs.
3. child_job_conn–>>This is the third job which will be using the connection
created in parent job.
STEPS
1) parent_db_conns:- Create a connection as you create normally and then at the
last component property check use or register a shared DB connection and put a name
you want to give for shared connection.
NOTE- The name must be in “”

We cannot share our own db_conn values


just fill all the enteries required to create a connection here.
2) child_job_conn:- Again check the shared DB component tab and write”shared_conn”
or name which you had given in the property box to utilize the connection.
and leave all the field blanks.

2) shared_conn_demo:-Now you just have to connect both the jobs and then run them.
Give the trigger as OnSubjobOk

That’s it now you can use or share your connections through out your all jobs and
save connections.
9.Load all rows from source to target except last 5
SCENARIO
Today we will discuss a situation in which we have to dump all source records to
target except the last 5 records.
SOLUTION
We need a couple of components to achieve this requirement as given below in
snapshot.

SEQUENCE CREATION

By using above class and method you can generate a sequence in Talend.
LOGIC
Pull all the rows from source and further in tmap create a sequence using
Numeric.sequence method further sort all the rows using tSortRow in descending
order key based upon that sequence generated in tMap, now create another sequence
using tmap. Resulting your last rows from source that are currently at top because
of sort method will assigned a sequence from 1 and so on.
Finally, use a tFilter and restrict all rows that you want to limit in my case its
5
Just use filter and give condition as

The number you select in tFilter the process will leave those rows to reach to the
target.
10.Late Arriving Dimension Using Talend
In one of our previous posts we had already covered the concept and implementation
of late arriving dimension using informatica.
So, directly we will show you how to achieve this functionality using Talend DI
tool.
COMPONENTS REQUIRED
• tOracleInput
• tOracleOutput
• tLogRow
• tJavaFlex
• tMap
Given below is the complete job description:-

JOB DESCRIPTION:-
1. Create a first job that will bring the max surrogate key (using sequel query)
from the production dimension table and save in one of context variable using
tJavaFlex Component. In main code section of tJavaFlex write
context.prod_skey=row4.prod_skey;
2. Create a second job that runs after successful completion of first job and here
bring the tOracleInput_1 component that will bring all the source data that resides
into your staging area.
3. Pull two more oracle input sources that will act as a lookup one for prod_dim
table and other one is for cust_dim table.
4. tMap Job Desc as described in the below figure

Use Left Outer Joiner Model at the left hand side in both the lookup i.e. for
customer and product.
THIS JOB IS CREATED TO IMPLEMENT LATE ARRIVING DIMENSION CONCEPT FOR PROD_DIM ONLY.

5. The Top right corner output in tMap is used to populate your normal fact table
but make sure to check whether your prod_dim skey must not be null.
6. The second insert is used again to update your fact table but with the record
that contains no surrogate key as late arriving dimension concept here you just
increment your context variable and populate using ++context.prod_skey.
7. And Finally Insert a record in the product dimension for that just add one more
output in tMap with the name like insert_prod with the same check constraint and
use the incremented context variable.
8.In my job i have used tLogRow for logging purposes in real scenario you have to
use tOracleOutPut component to be effective for DB.
9. For prod_dim insert output in tOracleOutput remember to use update or insert
because if any changes will come then it will directly effect to your database as
SCD1.
11.Date Dimension Using Talend
So, in this post we will demonstrate the same implementation using Talend Open
Studio.
COMPONENTS TAKEN:-
1)Oracle Db(tOracleConnection)
2)tMap(for data manipulation)
3)tConvertType(to change the data type)
4)tLogRow(Displaying Result)
COMPLETE JOB DESCRIPTION:-

job_info
Now,
In tOracleInput component define two columns as a schema both must be of data type
and in sql editor write the query given below.
“select sysdate+level,trunc(sysdate+level,’q’) from dual connect by level < 366″

first column will return you data serially from the current date and second column
will give you first date for every quarter.

Most of the implementation is completed using existing components.


To compute monthOfQuarter–
var monthOfQuarter=Mathematical.MOD(TalendDate.getPartOfDate(“MONTH”,row1.ora_date)
+1,3)
then
(Var.monthOfQuarter==0.0)?3.0:Var.monthOfQuarter

To pull out Weekday in character format use–


TalendDate.formatDate(“EEEE”,row1.ora_date)
To calculate days from first date of quarter–
TalendDate.diffDate(row1.ora_date,Var.quaterFirstDate,”DD”)+1

So, this is how you can implement a basic date-time dimension. Beyond this you can
alter this current implementation as per your requirement.
12.Dynamic Column Ordering Of Source File Using Talend
In of our last post we demonstrate how to handle dynamic column ordering from
source file using Informatica and finally generating a defined output. So, In this
post we will show you how to implement the same functionality using Talend Open
Studio,
OVERALL JOB DESC:-

job desc
First, of all create some context variables which will help you to build up the
complete job as given in the screen shot.

first variable count is used for some process which i will tell you later during
the post, and produce/create a number of context variables as the number of columns
in your source file, i have three columns for demo so i have created three
variables as var1,var2,var3.
I have a source file which is a comma separated file, which is introduced to the
environment using tFileInputDelimited component.
Remenber to bring the data from first row only do not skip the first row.
Now in tJavaRow Component build a schema as given in the snapshot attached.

Finally use the code given below in the tJavaRow component to successfully build
the job.
*****************************CODE BEGINS*************************************
//used to perform checks for the first time on the basis of first row.
context.count=context.count+1;
if(context.count==1)
{
System.out.println(“in first if”);
//checks the position of column using first row coming from source file and assign
variable value if found similar logic will be applied for all the columns involved.
if(input_row.field1.equals(“ename”))
{
context.var1=1;
}
if(input_row.field2.equals(“ename”))
{
context.var2=1;
System.out.println(“ename var2″);
}
if(input_row.field3.equals(“ename”))
{
context.var3=1;
}
if(input_row.field1.equals(“deptno”))
{
context.var1=2;
}
if(input_row.field2.equals(“deptno”))
{
context.var2=2;
}
if(input_row.field3.equals(“deptno”))
{
context.var3=2;
}
if(input_row.field1.equals(“salary”))
{
context.var1=3;
}
if(input_row.field2.equals(“salary”))
{
context.var2=3;
}
if(input_row.field3.equals(“salary”))
{
context.var3=3;
}
}
//at this stage your mapping variables already know which column contains which
data so use all the mapping variables and forward the data to the defined schema
columns.
if(context.var1.equals(1)){ output_row.ename = input_row.field1;}
if(context.var2.equals(1)){ output_row.ename = input_row.field1;}
if(context.var3.equals(1)){ output_row.ename = input_row.field1;}
if(context.var1.equals(2)){ output_row.deptno = input_row.field2;}
if(context.var2.equals(2)){ output_row.deptno = input_row.field2;}
if(context.var3.equals(2)){ output_row.deptno = input_row.field2;}
if(context.var1.equals(3)){ output_row.salary = input_row.field3;}
if(context.var2.equals(3)){ output_row.salary = input_row.field3;}
if(context.var3.equals(3)){ output_row.salary = input_row.field3;}
*Take all your source data as a string data type and later on using tConvertTyoe
change the schema data type def.
******************************CODE ENDS***************************************
So, this is one of the way you can handle dynamic column ordering of your source
file
13.Incremental Load Using Talend
Incremental Load is way of loading your target tables in data warehouse such that
new or updated data from your source system will only affect your target system.
It’s pretty similar with your novel reading, let’s say first day you read some
pages then on other day you will start reading from the point you left and so on.
So, today we will discuss one of the implementation technique for incremental
loading using Talend Open Studio.
Overall Job Desc:-

overall_structure
Execution Steps:
1. Create two context variables one will hold the directory structure of your
parameter file(file will store your last run date info) specify some default value
which you can further change using property file while deployment and the other
variable is used within the job to hold the last run status. Load the last run
context variable using tContextLoad.
2. Create a sql query which will hit your source database and pull down data as per
your needs. In my case query is with condition tdate < sysdate and
tdate>'”+context. Last_Load_Date+”‘”.
As in most of the cases reporting is done till the previous day data, tdate is a
column which holds the record’s updated/insert date status so we are pulling data
as per the given condition.
Total Data Fetched = Data Greater than Last_Time_Run + Data Less then current_date.
3.Simple update your parameter file that is holding your previous run date with
current_date-1.
“select ‘Last_Load_Date;’|| to_char(sysdate-1,’dd-mon-yyyy’) as prev_day from dual”

14.Getting Files From FTP Server


Today, we came up with a scenario which appears in front of most of the developers.
Getting Files From FTP Server Using Talend. The provided solution is achieved using
Talend Open Studio as a tool.
In the given scenario we assume that we don’t know the exact names of the files to
be downloaded from the remote location and we just have a knowledge of the
extensions type and the path of the incoming files .
SOLUTION:-
Take Three Components as
1. tFTPConnection
2. tFTPFileList
3. tFTPGet
Note:- If you know the name of the file then you just need tFTPConnection and
tFTPGet to fulfill the requirement just you need to enter the name of the file in
tFTPGet component.

Getting Files From FTP Server Using Talend


DESC:-
tFTPConnection creates a connection to your FTP server, you must provide all the
necessary entries to create a connection to the FTP server.
For our demo purposes we have use tSendMail component to capture refuse connection
exception and send a mail to the service team if the connection can’t be made to
the server.
Then, in tFileList component provide the remote directory address in double quotes
“/home/—-” and in file list using file mask just specify all the types of files you
want to retrieve from the remote server.
1. *.csv
2. *.tsv
3. *.gz e.t.c
in different lines by pressing add button.
Connect your tFTPList to the tFTPGet component using Iterate link. This will
provide file names list to the tFTPGet component one at a time.
Finally, In tFTPGet component fill all the details required such as the local
directory path to which you want to copy the files and carefully provide the same
remote directory path as provided earlier in the tFTPList and finally in the Files
option of the tFTPGet component write the followingCODE.
((String)globalMap.get(“tFTPFileList_1_CURRENT_FILE”))
The given job will successfully retrieve all the files which will satisfy your
extension condition specified earlier in the tFileList component.
15.Initializing Context At Run Time Using Popup
we will discussing a basic functionality for Talend Open Studio, which can be very
much useful and can be implemented wide range of scenarios.
Initializing Context At Run Time Using Popup window.
Overall Job Info : -

Initializing Context At Run Time Using Popup

PROBLEM: -
It happens to most of the times that your business logic remains the same but your
parameters such as file name, db name, passwords e.t.c. changes frequently, and to
make the things workout either you have to make changes in the Default.properties
or you have to open the Talend Studio and make explicit changes to run the job.
SOLUTION :-
Talend provides a simple solution by providing a context prompt option to your
variables thus every time you run your job a popup-window will appear and there u
just need to simply put your values.

Initializing Context At Run Time Using Popup


Just click on the prompt tab in front of the variable and check the text box for
those variable who you want to assign values dynamically..
RUN JOB : -
This is how it looks during RUN time.

Initializing Context At Run Time Using Popup


This way you can overcome most of the basic problems occurs during your development
and execution….

16.User Define Function In Talend


Talend list of unique features also includes writing your own set of functions,
which provides you great flexibility in your work , and one of its feature that is
widely used is “User Define Function In Talend“
STEPS:-
1) In left panel of the environment right click the routines tab and create routine
by providing suitable names such as user_sum.
(you can write any routine name as per your requirement but it should make sense
for better understandablity)
2) Create a desired function to achieve your desired functionality like f_sum.

User Define Function In Talend


(you can create any function as per your requirement)
Note : – Use of static keyword while creating function helps in calling a that
function without creating a object for that class. If you do not specify static
keyword while creating a function then it becomes mandatory for you to call that
function by first creating an object for that class.
3)Call to a function

User Define Function In Talend


In this job our function f_sum takes two numbers and returns the sum for those two
numbers. A call is made to function in tMap describing by the image given below.

User Define Function In Talend


however you can use this function at any place inside the job where talend permits
to write your own code or to use methods.
CALLING A FUNCTION : – user_sum.f_sum(row1.first, row1.second)
You can also create/import libraries from outside which contains Classes and
functions and after importing those libraries into your job you can use features
from those…
So, this how you can create and use “User Define Function In Talend” and use them
in different situations.
17.Calling DB Sequence From Talend
Today we will discuss how call db sequence from Talend while loading data. For demo
purposes we have taken Oracle as our Database.STEPS:
1) Create a sequence in Database using : -create sequence test_seq start with 1
increment by 1;
The above statement will create a sequence in Oracle Databse. select
test_seq.nextval from dual;
This statement is very important as it will initialize your sequence.2)Take a
source file/Databse and connect it to your oracle output component with all schema
defined at Target Level.

call db sequence from Talend

3)Go to Advance Setting Tab of oracle output component and change the settings as
defined in the screenshot.

call db sequence from Talend


seqence_name.nextval is the command which you have to write in SQL Expression. so
that for every row this expression will call your Sequence present at Database
Level.

1. Difference between tAggregatedRow and tAggregateSortedRow in Talend

tAggregateRow is most useful component in Talend open studio. It works on SQL Group
by features. It receives a flow and aggregates it based on one or more columns. It
is widely used to perform operation on common SQL aggregate functions like min, max
sum etc. After aggregation, tAggregateRow provide output for each line.
Before jumping to Talend tAggregateRow component in Talend, let me explain you
aggregate function in SQL.
Aggregate function always returns a single result based on groups of rows. It
generally used with SELECT statement in SQL queries. For example if we want to get
average of the maximum salary of all department in given table, we can use this SQL
query
SELECT AVG (MAX (salary)) as AVG_SALARY FROM employees GROUP BY department_id;

AVG_SALARY

----------------

18929.33

Talend Open Studio component (tAggregatedRow) performs same calculation and return
same expected result. Some common aggregate function commonly used in Talend are
min, max, avg, sum, first, last, list, count etc.
Talend provide two components for above data aggregation.
• tAggregateRow
• tAggregateSortedRow
The differences between tAggregateRow and tAggregateSortedRow are
tAggregateRow
It can’t sort the value based on any key field, simply return aggregated value
All standard SQL aggregate function applied in this component
Sorting of data either ascending or descending could be achieved by tSortRow

tAggregateSortedRow
This component sort the data based on key field along with aggregation
It also gives facility to provide “Input rows count” where we can pass number that
represent how many row we want to aggregate.
Below screenshot explain how to use tAggregateRow in Talend

Here is the list of records available in my input file. Based on the first column
“Grappe”, I need to aggregate data from this list and try to find
• List of all “Name of BTS” for a particular group.
• Total number of “degree of BTS” in a particular group.

I have used three components (tFileInputDelimited, tAggregateRow & tLogRow) to


check how to aggregate data in Talend, combined all three components via Main flow.

You can double click on first component (inputfile in above screen) and set its
property and schema. Please click on Edit Schema to add schema of input file.

Now you need to set properties for tAggregateRow, simply double click and set it as
per above screenshot.
You need to select “schema” as “Built – In”
For “Group by” you need to provide column name by which you want to grouping.
Here I have applied two aggregate function (count, list) in input file and
generated same output column.

Your job is finally ready, after clicking on run button in tool bar, you have above
output.
In output screen you can see list of column values and their count, but if you
closely have a look into output your data is not sorted.
If you want your result in sorted way, you can use tSortRow in above code, or you
can also used tAggregateSortedRow
Now let me create another sub job with tAggregateSortedRow and combine in main job.
This component will show you the differences between tAggregateRow &
tAggregateSortedRow.

Follow the instruction according to above screenshot. Double click and locate your
input file and add schema by clicking on Edit schema button. Combine all three
components via Main link.

Add tAggregatedSortedRow from component pane and set its property similar to
tAggregateRow.
For “Group by” you need to add column from input file based on that you need for
grouping.
For “Input row count” you need to set number of columns needs to be sorted. You can
use your total number of row value here.

After setting properties for tAggregateSortedRow you need to make it sub job of
main above tAggregateRow job, so that the sub job only executed after completion of
main job.
Now everything is complete. You need to execute this complete job and compare both
the the output.
You will found that the second sub job has sorted value based on “Group by” field
for tAggregateSortedRow.

2. How to resume job execution from same location if job get failed in Talend

3. How to execute more than one sub jobs parallel in Talend


Before you jump to tutorials, you must know that Talend is completely developed
using Java language. So each and every beauties of java can be directly used inside
Talend Open Studio.

Sometime in Talend we need to execute multiple sub jobs in parallel. It means that
while executing one job, Talend will start another sub job without interrupting the
previous running job. This is something what known as parallel execution in Talend
and same concept known as multithreading in Java language.
Before jump to main article let me briefly explain what is multithreading in Java.

Multithreading is a process of executing more than one sub-process simultaneously.


It means two different code blocks (known as thread) from your single Java program
will share same memory location but execute independently to each other.
As it share common memory location so switching between different Java code block
takes lesser time. As well as we save memory for other resource. The idea behind of
multithreading is to provide parallel execution of program

Life cycle of a thread (Sub block)


According to Sun Microsystems, thread can have 4 states. But for better
understanding I am explaining you in 5 states.

1. New: This is the case where you just created instance of thread class but it’s
not started.
2. Runnable: This is the state where a thread or sub program is executing in Java
Virtual Machine (JVM), but the thread scheduler has not selected it to be the
running state. You can simply say that it is ready to run.
3. Running: This is the state where thread scheduler has selected it for running
state.
4. Blocked: This is the state where thread is still alive but currently not
eligible to run.
5. Time waiting: This is the state where a thread is waiting to execute another
thread in specified time.
6. Terminated: A thread that has exited from memory.
The above theory I just explained you only for your knowledge that how Talend work
in multithreading environment and how it handle multiple sub jobs in their
execution.
But you should not worry about above technical details. You can simply follow
screenshot to enable multithreading or parallel execution of sub jobs in Talend
Open Studio.
Note: Make sure that your all sub jobs should be independent to each other, so that
it will not affect to other part of your job.
I simply used three components tRowGenerator, tLogRow, tJavaRow and connected via
main link.

tRowGenerator – basically used to generate random row for testing purpose. I said
it to generate 5 dummy rows with three columns each row for my first job.
Double click on tRowGenerator and click on (+) button to add column name in schema
and you can also assign its value under “function. You can set parameters for
function below the schema tab. Have a look into the screenshot

Once you define your columns name and its length and other parameters, make sure
that you have entered value for “Number of Rows for RowGenerator”. This is the
number which will decide to generate numbers of row. For above example, I am
generating only 5 rows. Click on Ok to complete it.
Map it with tLogRow to display all generated record on the screen. After that
connect it with tJavaRow to display custom message. I simply wrote
System.out.println("Executing 1st Job");
Here is the screenshot.

Repeat same sub job block for three times to make three independent sub jobs.
Make sure that for every tJavaRow you should write different message
For example I have written below 3 different messages for tJavaRow
tJavaRow_1 System.out.println("Executing 1st Job");
tJavaRow_3 System.out.println("Executing 2nd Job");
tJavaRow_4 System.out.println("Executing 3rd Job");
Finally my one main job with three different sub-jobs is ready to execute.
Now click execute this job and you will see the result, its execute in sequential
(first job then second job then third job).

Now let me enable the parallel execution for all three sub jobs.
Go to right bottom corner of Talend Open Studio and click on Parallel job execution
button

Once you clicked on this button it open Parallel Job Execution window as below

Click on “Extra” and check Multithread execution

Finally threading is applied in your job. This exercise will execute all three jobs
independently based on java multithreading concept.
Now save your job and click run to execute it.
Here is your output after execution

If you not able to see Parallel job execution button at the bottom right corner in
your Talend Open Studio, go to Window in menu bar then click on show view then
click on Talend then select Job.After doing this exercise you will be able to show
button on bottom right corner.

4.How to iterate filename and directories in Talend


File operation is very important in data integration. In this tutorial I am going
to explain how to get list of files and folders name from a directory or a sub
directory.

tFileList component in Talend Open studio is used for listing of all files and
directories. You can iterate it and get list of all files, directory from current
folder as well as sub directory.

As usual, I am goring to read configuration setting from an xml file and store into
a global variable. I shall use it to connect to the desired folder.

Here is content written in my config file “Config.xml” and store in my local drive.

<ServerConfigration>
<!-- Input file configuration -->

<arg name="InputFilePath">C:/Config_Files/input_files</arg>
<arg name="LogPath">C:/logs</arg>

</ServerConfigration>

I have connected tPrejob, tFileInputXML, tSetGlobalVar component according to below


screenshot.

tPrejob : It’s always a good practice to start our job from tPrejob. Once tPrejob
component is ok I have connected to tFileInputXML.

tFileInputXML : It is an input component and used to map xml file. Double click on
tFileInputXMLto set its properties setting

Note: you can click on “Edit schema” to add columns, available in your config file
and map them using xPath query.

tSetGlobalVar : It is used to store global values for your next component. Double
click ontSetGlobalVar and provide details.

Here you just created two global variables. It will widely accessible in your
entire Talend application by pre-defined Talend method globalMap.get()

You can access these global variables like this


(String)globalMap.get(“InputPath”);
(String)globalMap.get(“LogPath”);

So you just finished your configuration part of your job. Now let me jump to file
iteration part and list me all files & folder name.

File Iteration

We need three components for file iteration.

1. tFileList
2. tIterateToFlow
3. tLogRow
tFileList: It contain list of all files and directories. It iterate on a set of
files from a particular directory. This component comes with lots of useful
properties. You can set it as per your needs

tFileList basic settings

Directory (String)globalMap.get(“InputPath”)
File Type 1. Both (List all files and all folders)
2. Directories (List only for folders)
3. Files (List only files)
Include subdirectories Check if you want to iterate in child directory.
Files You can decide which type of file you want to iterate.
You can add more file extension by simply clicking on (+) button
Order by You can decide which file iterate first.
Oder action Either file iterates will asc or desc.
For file operation, Talend gives some predefined constant for tFileList

Assume that you are using tFileList_1 your component

tFileList_1_CURRENT_FILE : Get current file name


tFileList_1_CURRENT_FILEPATH : Get current filename with complete path
tFileList_1_CURRENT_FILEEXTENSION : Get extension of current file.
tFileList_1_CURRENT_FILEDIRECTORY: Get current file directory.
tFileList_1_NB_FILE : Get number of total files.

tIterateToFlow : It transform an iterate link to a flow input.


When you iterate a directory using tFileList, there could be a chance to found more
than one files and more inner directories. So we need looping to get all lists.

tIterateToFlow component convert each iteration into an input flow, so that input
flow can be stored some place to get it list.

In this example tIterateToFlow will receive each iteration and convert into a flow
and we can write this flow into a file or store into a log.

Double click on tIterateToFlow then click on Edit schema to add column name.

You can use ((String)globalMap.get("tFileList_1_CURRENT_FILE")) to list of all


filename based on your selection for tFileList.

Now you have successfully designed your job, this job will return all csv file, as
we have used .csv for a mask for tFileList component above.

Here is your complete job screenshot


Once you finish your above job with complete setting, click run button to execute
this job. I am sure you will return list of csv files in log screen

You can also download this example at end of article. If you feel any issue to run
this job, don’t hesitate to contact me, I am sure I shall try to help you.

5.What is the difference between OnSubjobOK and OnComponentOK in Talend


6. How can you pass a value form parent job to child job in Talend
7.How to call stored procedure and function in Talend Job
8.How to export job and execute outside from Talend Studio
9.How to pass value from outside in Talend
10.Can I define schema of database or tables at run time
11.What is tReplicate in Talend
12.What is tUnite in Talend Open Studio
13.How to optimize talend job to stop outOfMemory runtime error
14.How to optimize Talend Performance
15.How to execute multipule SQL statements with one component in Talend
16.What is tSystem component in Talend
17.Can I execute multiple commands at one time with a tSystem component
Talend is an award winning open source ETL (Extract, Transform, Load) tool that
provides data integration, data management, enterprise application integration
services for small projects to enterprise-wide implementations. We from Talend
Tutorials provide Talend Training Courses for individual & corporate professionals.
We offer free class room Talend training in Delhi. You can attend our weekend
Talend ETL training course. We also offer free Talend training materials download,
and you can watch Talend video tutorials, download Talend sample jobs and read how
to work with Talend Open Studio
Talend is an award winning open source ETL (Extract, Transform, Load) tool that
provides data integration, data management, enterprise application integration
services for small projects to enterprise-wide implementations. We from Talend
Tutorials provide Talend Training Courses for individual & corporate professionals.
We offer free class room Talend training in Delhi. You can attend our weekend
Talend ETL training course. We also offer free Talend training materials download,
and you can watch Talend video tutorials, download Talend sample jobs and read how
to work with Talend Open Studio
Talend is an award winning open source ETL (Extract, Transform, Load) tool that
provides data integration, data management, enterprise application integration
services for small projects to enterprise-wide implementations. We from Talend
Tutorials provide Talend Training Courses for individual & corporate professionals.
We offer free class room Talend training in Delhi. You can attend our weekend
Talend ETL training course. We also offer free Talend training materials download,
and you can watch Talend video tutorials, download Talend sample jobs and read how
to work with Talend Open Studio

18.What is difference between tMap and tFilterrow in Talend

What is the difference between the ETL and ELT components of Talend Open Studio?
How does one deploy Talend projects?
What are the elements of a Talend project?
What is the most current version of Talend Open Studio?
How do you implement versioning for Talend jobs?
What is the tMap component?
What is the difference between the tMap and tJoin components?
Which *component* is used to sort data?

1. Talend workspace path should not contain any spaces.


Workspace paths to avoid:
c:\Open Project\Talend Open Studio\workspace
d:\My Projects\Talend\workspace
Recommended Workspace Path:
c:\Talend\workpace
d:\OpenProject\Talend\workspace
c:\MyProject\repository
It is always recommended to not to have any space between the Talend workspace
path. We tend to encounter few issues if we have these spaces in path. Hence we
should always avoid these spaces.
2. Never forget to perform Null Handling.
Example #1 – Bad
if(myString.length() > 0)
System.out.println(myString.toUpperCase());
Example #2 – Good
if(! Relational.ISNULL(myString) && myString.length() > 0)
System.out.println(myString.toUpperCase());
Always perform NULL handling for the field which is going to be used in any kind of
expression. Otherwise Talend Job will throw NullPointerException.
3. Create Repository Metadata for DB connections and retrieve database table schema
for DB tables.
It allows you to quickly retrieve the schema of database tables and help rapid
development. If you will try to create schema for database table one by one it will
take long time. Click here for more details on creating DB connections and
retrieving schema.
4. Use Repository Schema for Files/DB and DB connections.
It allows you to change the schema at one place, without having to change the
schema in every job. Also, you don't need to open every job to find out if the
changed schema is part of the Job or not. Changing at one place in Repository will
allow you to change in every job. Click on links below to know How to create
Repository metadata:

Creating Metadata for Delimited files.


Creating Metadata for XML files.
Creating Metadata for Excel files.
5. Create Database connection using t<Vendor>Connection component and use this
connection in the Job. Do not make new connection with every component.
As most of the database have maximum connection limit. In your Talend Job if you
are using multiple database components then it may fail because of maximum allowed
connection issue. Click here to know, How to share database connection.
6. Always close the connection to database using t<Vendor>Closecomponent.

7. Create a Repository Document corresponding to every Talend job including


revision history.
This will allow you to track the changes done on any Talend Job.

Sample documentation below:

Talend Job Description:


In this paragraph, write down the high level description and functionality of
Talend Job

Revision History
1.0 04-10-2014 Initial Development
1.1 07-10-2014 Modification to Source and Target repository Schema
1.2 09-10-2014 Modification to transformation logic.

8. Provide Sub Job title for every sub job to describe the sub job
purpose/objective.
9. Avoid Hard Coding in Talend Job component. Instead use Talend context variables.

10. Create Context Groups in Repository

Context Group will allow you to use the same context variables in any number of
jobs without having to create again and assign value again to them. Imagine your
Project requires 20 context variables and there are 10 jobs that require those
context variables. Without context groups it will be very difficult to create those
context variables again and again in every job.

You can create different context groups for different functionality of variables.
For example, you can have different context group for database parameters , SMTP
params and SFTP params etc.

Click on links below to know more about context variables and context groups:
1. Understand Context Variables Part 1 (Context Variables, Context groups)
2. Understand Context Variables Part 2 ( Define context variables in Repository,
which can be made available to multiple jobs)
3. Understand Context Variables Part 3 (Populate the values of context variables
from file. tContextLoad)
4. How to Pass Context Variables to Child Jobs.
5. How to Pass context Variables/ Parameters through command line.

11. Use Talend.properties file to provide the values to context variables using
tContextLoad.

Always provide the value of context variables either through database table or
through Talend.properties file. Below is sample of Talend.properties file.

Click hereto understand How to Populate the values of context variables from
fileusing tContextLoad component.

12. Create Variables in tMap and use the variables to assign the values to target
fields.
For multiple use of single expression or for using the same mapping for multiple
target fields, it is always good to create a variable in tMap and assign the value
of that variable in target fields. It will allow to only evaluating the expression
once for multiple number of times.

13.Create user routines/functions for common transformation and validation.

Always create routines/functions for all common transformations and validation


rules which can be used in multiple Talend jobs.

Click hereto know, How to create user routines and functions.

14. Develop Talend job iteratively.

Divide the Talend Job to multiple sub jobs for easy maintainability. First create a
subjob and then test it and then move to next sub job.

15. Always Exit Talend open studio before shutting down the PC.

Talend workspace may get corrupted sometimes, if you shutdown your machine before
exiting Talend Open Studio. So always exit Talend before shutting down PC.

16. Always rename Main Flows in Talend Job to meaningful names.


Thanks to +Balázs Gunics for this point. It is always good to rename the main flows
in Talend Job to more meaning full names so that when you refer the fields in tMap
component or using tFlowIterate it will be easy to refer and understand which data
is coming from which flow.

17. Always design Talend jobs by keeping performance in mind.

Talend Job Design - Performance Optimization Tips


I am going to share few of the Performance tuning tips that I follows while
designing Talend Job. Let me know your comments on the same and also let me know,
if there are any other performance optimization methods you follows and are
helpful.

You might also like