Talend Best Practices
Talend Best Practices
Cover
Title Updated at
Table of Contents
Cover
Standards Details
Comparison of ETL components
Component Settings
Connectors
Context Propagation
Section 5
Appendix
Document History
Standards Details
1. Comparison of ETL components
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 1/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
It is advisable but not required to have only one instance of GPLoad per job/child job, as this helps with error
tracking.
GPLoad Preferred when ingesting non-Greenplum data (from other Exception: tInput/tOutput is
databases or flat files) into Greenplum acceptable when ingesting less than
1000 records.
tGreenplumRow Since Greenplum functions are almost always more efficient Possible exception: When joining
than Talend functions, tGreenplumRow is preferred wherever data from two separate databases
possible. within Greenplum (tMap acceptable
instead)
tMap
When ingesting data, avoid using tMap as much as possible. When ingesting external data into Greenplum such as flat
files or non-Greenplum database data, tMap may be used if necessary
When you must use tMap, use only a single tMap component per data flow. That is, do all the transformations and
calculations within a single tMap component. Don't do this:
<Go to top>
2. Component Settings
t...Input
Input components are used to extract data from a database. The results of the SELECT query are returned as a Talend flow.
"Use existing connection" and "Die on Error" should always be checked. By default, cursor should usually be checked as
well.
Be sure to populate the component's schema and use the correct SELECT statement. Identify each field in order
with SELECT field1, field2, .... FROM <tablename> and DON'T use SELECT * FROM <tablename>. If the table structure
is ever altered, the source column order will change and no longer match the schema, causing an error.
Keep in mind that the schema should conform to what is coming out of the source. For example, if a field is renamed in the
SELECT statement (<fieldname. AS <new_fieldname>) the component's schema should reflect the new name. Also note
that field names in the schema are case sensitive and must match the SELECT statement exactly.
Checks to perform:
Queries shouldn’t contain line comments - - these should be refactored into block comments: Replacing the – with /* and
putting a */ at the end of the line.
Newer Talend versions (5.5 and up) can handle line comments, but will throw Java code generation errors if there
is a blank line after the SQL.
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 2/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
2. Component Settings
Use Cursor should usually be checked
Make sure the columns are in the correct order in both SELECT and the component schema
Schema must match output exactly, including ... AS <new_fieldname>
Column names are case sensitive
t...Output
Output components are used to ingest data into a database. The schema should have the target type listed correctly for that
database type. "Die on Error" should always be checked.
Column value
In the advanced settings there’s an option to use database functions to provide a value to the columns. The column shouldn’t
be in the schema. The column name and the database function have to be provided. E.g.: added_on and now()
Batch Size
Batch size, like cursor, keeps only a certain number of rows in working memory at a time. To obtain a good write speed go to
the component's Advanced Settings tab and check the “Use Batch” option. The default batch size is 1000 but if the incoming
Cursor size is different, update the batch size to match. If you don't see this option, do not check the "Use existing
connection" box.
Column value
In the advanced settings there’s an option to use database functions to provide a value to the columns. The column shouldn’t
be in the schema. The column name and the database function have to be provided. E.g.: added_on and now()
Action on Table
This isn’t reliable, and should be set to None. For example, MSSQL always uses dbo for the table schema. It is better to do
these kind of steps (Delete/truncate) in a separate t…Row component, so if there’s any errors the error message will be
more informative.
Action on Data/Row
Action on data can be Insert, Update or Insert & Update. The latter is known to cause problems and should be avoided since
it performs only a few row/sec.
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 3/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
2. Component Settings
This approach can be much more performant, especially when upserting large data sets as the first option has to attempt to
insert each line, line by line, and then if it fails it then rewrites it as an update statement.
One other benefit of the second is that it can also be contained in a single batch/transaction as you are no longer having to
deliberately catch a failed insert/update and then update/insert instead. This means you can make your job a lot more robust
and handle errors more appropriately.
t...Row
t...Row components can be used to execute one or more queries. "Use Existing Connection" should always be checked.
Checks to perform:
Queries shouldn’t contain line comments - - these should be refactored into block comments: Replacing the – with /* and
putting a */ at the end of the line.
Newer Talend versions (5.5 and up) can handle line comments, but will throw Java code generation errors if there
is a blank line after the SQL.
tLogRow
Prints the values to the console.
Checks to perform:
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 4/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
When “Table” mode is used, there shouldn’t be too many lines or memory overflow may occur.
2. Component Settings
tMap
Allows to join tables, execute transformations. tMap related generation errors likely to appear at other components.
Checks to perform:
All inputs should be named when multiple inputs are used. Lookups should start with lkp_ main dataflows should start
with main. This makes development and debugging easier.
Join type should be checked, by default Talend does a left join.
In case of modifying the input, the input line name can be simply renamed. For example, if you add a tLogRow
between tGreenplumInput_2 and tMap_2, you need to make sure the link between tLogRow and tMap_2 uses the
name lkp_GPinput_2, then tMap_2 will work accordingly.
Lookup settings
Lookup Model:
https://help.talend.com/display/KB/tMap+lookup+models
Load once: before processing each record of the main flow, this option loads once (and only once) all the records
from the lookup flow either in the memory or in a local file in case the Store temp data option is set to true. This is the
default setting for join and the best option if you have a large set of records in the main flow to be processed using a
join to the lookup flow.
In case of OutOfMemoryException store the lookup values in files.
Reload at each row: all the records of the lookup flow are loaded again for each record of the main flow. Generally,
this option increases the Job execution time due to the repeated loading of the lookup flow for each main flow record.
However, this option should preferred in the following two situations:
The lookup data flow is constantly updated and you want to load the newest lookup data for each record of main
flow in order to get the new data after the join execution.
You have less rows in your main flow and have a large data set that comes from a database table in the lookup
flow. It might cause an OutOfMemoryException if you use the Load once option. In this situation, the Reload at
each row option will be considered.
Reload at each row (cache): this model functions like the Reload at each row model, all the records of the lookup flow
are loaded again for each record of the main flow. However, this model can't be used with Store temp data on disk option.
The lookup data are cached in memory, and when a new loading occurs, only the records which not existing yet in the
cache will be loaded, in order to avoid loading the same records twice.
Match Model:
https://help.talend.com/display/KB/The+differences+between+Unique+match,+First+match+and+All+matches
Unique Match: No row duplicates, uses the last matching entry.
First Match: No row duplicates, uses the first matching entry.
All Matches: Rows are duplicated, uses all matching entries.
Store on disk
If the lookup values can’t fit in the memory, the data can be stored on the disk. For this a temporary directory should be set
up at the components settings. This could be either in the systems temp directory, or under the project files. The latter is
preferred, e.g.:
context.storagePath +"/tmp/"+jobName
Using the jobName global variable, a separate folder can be created for the job.
If the Job fails, the temporary files created by tMap are not being deleted, this can quickly result consuming the disk
space, if you have access to the server please remove these files.
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 5/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
tFlowMeter
2. Component Settings
tFlowMeter can be used to log how many records was processed during the job. Unfortunately this does not provide a real
time result, the value is available once the job execution is over.
tWarn / tDie
These components can be used to log additional informations. When using tDie the job will die once the information is
logged. These components can be used to log the query that is being executed, or the actual query that failed.
tJava
Can be used to initialize variables, use if you want to have your code executed only once. Could be used as a dummy
component: There are cases when a component doesn’t accept an IF link, or only OnComponentOk can be used, in these
cases tJava can help to overcome on these problems.
tJavaRow
tJavaRows code is executed once for every line. If a static variable is used here it will be always initialized. The globalMap
can be used to store the value for a later execution, for example with error handling:
The one-liner (<expression> ? <true> : <false>) is commonly used. It works similarly like the DECODE() function in Oracle.
tJavaFlex
tJavaFlex is an advanced component, please refer to the Talend documentation about all of its features.
It can be used as an exception handler, if we have an iterate link and the target sometimes throw exceptions, tJavaFlex can
be used like this to catch and log the exceptions, without having the job die:
Begin:
try {
Main:
End:
In the End part we could have add these messages to the logs as well, then use a tJava -IF->tWarn combination to push it
into the logs as well.
tSetGlobalVar / globalMap
Can be used to store local values to the job, when a value is initialized with tSetGlobalVar it can be accessed from anywhere
using CTRL + SPACE. However it will always returns the values as String. Make sure you cast these objects to their original
type. There are exceptions, for example:
int (Integer)
bool (Boolean)
String (String)
tSchemaComplianceCheck
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 6/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
2. Component Settings
Can be used to filter for non-schema compliant records. Should be used with text files. In case a row is not valid, that row will
be written to a rejects file that can be sent out in a warning email. The good rows will be ingested without killing the job.
<Go to top>
3. Connectors
OnSubjobOk/Error
OnSubjobOk/Error and OnComponentOk/Error basically do the same thing - tell Talend in what order to execute the tasks.
However, there are some important differences. OnSubjobOk/Error can be started only from the first component in a subjob,
such as tFileList_1 in the example. It will continue the execution only after the entire function has finished, such as the loop
in the example below. It's good to use in most cases, as peak memory usage can be reduced for the job.
OnComponentOk/Error
OnComponentOk/Error can be started from nearly any component, but should usually be avoided for memory reasons. It's
different from OnSubjobOk/Error because it continues as soon as its component is completed even if the subjob isn't finished
yet. This is very useful in some cases, such as performing several tasks within a loop. In the above example, the first subjob
iterates through all files in a folder before continuing. FOR each file, Talend reads in the file and sends the output to Oracle.
You can only attach OnSubjobOk to the first component in a subjob and it will only fire once all iterations are complete. If you
want to perform more tasks within the loop, you have to use OnComponentOk to connect the last component in the iterative
subjob (tJavaFlex_1) to the next component (tFileInputDelimited).
Be careful mixing several different types of connectors on a single component. While Talend 6.1.1 permits you to specify in
which order several OnSubjobOk or OnComponentOk connectors should run, this is not an ideal solution. For one thing, it's
difficult to see at a glance what fires when. In the below case, t_analytics_view(tGreenplumConnection) is firing at the same
time as tCreateTemporaryFile_1, causing the next step - tGreenplumInput_1 - to fire twice.
t_analytics_view(tGreenplumClose) would fire twice, too. tGreenplumGPLoad_1 would probably fail as it tries to write to the
same file.
It's also easier to understand the flow if components are arranged in the order they will fire, from left to right and from the top
down.
Iterate
Used to execute a process for each input row. A Flow can be converted to an iterate job via tFlowToIterate component. In
this case the flow variables are accessible through the globalMap, i.e.
((String)globalMap.get("tFileList_1_CURRENT_FILE")) in the above example.
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 7/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
Parallellism can be configured in the Advanced Settings tab or a context variable can be used for this purpose.
3. Connectors
Flow
Also known as main or row. Used to propagate data through contexts. The name can be changed by either triple clicking on
the row or by selecting the row, waiting a few seconds, and clicking on it again. It is advised to rename the rows when using
tMap.
If
Continues the workflow only if certain conditions are met. For example, the below job makes sure the file has a nonzero size
before calling GPLoad. The tJavaFlex component assigns the file size to context.fileSize. The If condition reads:
“context.fileSize > 0”.
The If connector uses Java logic but note that there is no semicolon at the end of the If statement.
A single component can have many If connectors leading to various targets. Talend will check each condition in the order
assigned. The order can be adjusted by right-clicking on a connector and selecting “Modify If links order”:
<Go to top>
4. Context Propagation
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 8/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
4. Context Propagation
Let's assume we have a master (parent) and a worker (child) job. We schedule the master job in TAC.
Let's see how we can affect the contexts. In the order of override in the worker job (and the master, of course):
In the contexts we can set a default context group per context.
In the job we can also set a default context group.
We can select the context group we'd like to use in tRunJob.
By enabling "Apply Context to Children" in TAC we can override these settings.
Please note that this is case-senstivie, so if you have PROD in your master and Prod in the worker the worker will not
use Prod.
By enabling "Transmit whole context" in tRunJob all the contexts (KEY-VALUES) available in your job will be transferred
to the worker job.
This means if you have a 100 level hierarchy, the contexts (keys, values) from the first job will be transferred to the last
job, no matter whether you had that context defined in the interim jobs or not.
Contexts can be overridden:
in tRunJob, where we can set a value for the key
in TAC where we can also set a value for the key
Before execution in studio when a popup comes up
By using the context.setProperty() function in the master job.
By using tJava or tContextLoad and explicitly setting the context variable's value. (context.key = 1)
<Go to top>
5. Custom Components
<Go to top>
6. Debugging Talend
<Go to top>
<Go to top>
8. Job Examples
<Go to top>
9. Miscellaneous
<Go to top>
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Talen… 9/10
10/29/2019 04 (b) Talend Best Practices - Governance Development Space - Developer Cloud
<Go to top>
11. .
<Go to top>
Appendix
Acronym / Term Description
<Go to top>
Document History
Update description Updated date
<Go to top>
https://devcloud.swcoe.ge.com/devspace/pages/viewpage.action?spaceKey=IOPKF&title=04+%28b%29+Talend+Best+Practices#id-04(b)Tale… 10/10