Tos DQS en
Tos DQS en
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 1/288
26/07/2022 09:49 PDF Export
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 2/288
26/07/2022 09:49 PDF Export
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 3/288
26/07/2022 09:49 PDF Export
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 4/288
26/07/2022 09:49 PDF Export
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 5/288
26/07/2022 09:49 PDF Export
Procedure
TOS_DQ-win-x86_64.exe
Windows
TOS_DQ-linux-gtk-x86_64
Linux on x86
TOS_DQ-gtk-aarch64
Linux on ARM
TOS_DQ-macosx-cocoa.app
MacOS on x86
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 6/288
26/07/2022 09:49 PDF Export
TOS_DQ-macosx-cocoa-
MacOS on ARM
aarch64.app
Tip: If you work on MacOS and install Talend Studio manually using the zip archive file, you might get one of the
following two messages as shown in below screenshots when trying to launch Talend Studio for the first time. To fix the
issue, you can open the terminal, go to the directory above the Talend Studio top directory, execute the command xattr
-d com.apple.quarantine <Talend-Studio>/* , where <Talend-Studio> is the root folder name of your Talend Studio,
and then relaunch your Talend Studio.
3. If you launch Talend Studio for the first time, in the User License Agreement dialog box that opens, read and accept the terms
of the end user license agreement.
Results
A default project is created in the workspace folder under the installation directory of Talend Open Studio for Data Quality.
You can now start working with your project and items.
Talend Studio requires specific third-party Java libraries or database drivers (.jar files) to be able to connect to sources and targets.
These libraries or drivers, known as external modules, are required by some Talend components and/or connection wizards. Due to
license restrictions, Talend may not be able to ship certain external modules within Talend Studio. For more information, see
Installing external modules.
Those libraries or database drivers, known as external modules, may be required by some of Talend components or by some
connection wizards or by both. Due to license restrictions, Talend cannot ship some of these external modules within Talend Studio.
You need to install them for Talend Studio to function properly.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 7/288
26/07/2022 09:49 PDF Export
Warning: Make sure that the -Dtalend.disable.internet parameter is not present in the Studio .ini file or is set to false .
Talend Studio will let you know when you need to install external modules and what external modules you need to install.
Talend Studio notify you about required external modules in several ways.
On your design workspace, if a component requires the installation of external modules before it can work properly, a red
error indicator appears on the component. With your mouse pointer over the error indicator, you can see a tooltip message
showing which external modules are required for that component to work.
When you open the Basic settings or Advanced settings view of a component for which one or more external modules are
required, you will see a piece of highlighted information about external modules, followed by an Install button. Clicking the
Install button opens a wizard that will show you the external modules to be installed.
The Modules view lists all the modules required for the Studio to work properly, including those Java libraries and drivers
that you must install.
If the Modules view is not shown under your design workspace, go to Window > Show view... > Talend and then select
Modules from the list.
In this view:
Item Description
Filter text field Allows you to search external modules based on the status, the context, the
module file name, and the Maven URI.
The icon indicates that the module is not necessarily required for the
corresponding component or Metadata connection.
The icon indicates that the module is absolutely required for the
corresponding component or Metadata connection.
Context Gives the name of the component or Metadata connection using the
module. If this column is empty, the module is then required for the
general use of your Talend Studio.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 8/288
26/07/2022 09:49 PDF Export
Item Description
You can customize the Maven URI of a module by clicking the Maven URI
field and then clicking [...] that appears. For more information, see
Customizing the Maven URI for external module deployment.
You can configure whether to share libraries at Talend Studio startup. For
more information, see Artifact repository for libraries preferences.
If you are using a local libraries repository configured with proxy, libraries
will not be shared when clicking this button. For more information about
configuring proxy for a local libraries repository, see Configuring a proxy
repository for libraries in Talend Studio.
Opens the Third-party Libraries wizard, which allows you to install all
required and/or optional libraries in one go. For more information, see
Installing all external modules in one go.
drop a component from the Palette if one or more external modules required for that component to work are missing
in the Studio.
click the Test connection button in a Metadata connection setup wizard in the Studio if one or more external modules
required for the connection are missing in the Studio.
click the Guess schema button in the Component view of a component if one or more external modules required for
that component to work are missing in the Studio.
click Install on the top of the Basic settings or Advanced settings view of a component for which one or more required
external modules are missing.
run a Job that involves components or Metadata connections for which one or more required external modules are
missing.
select one or more modules that are not integrated in the Studio and click the button in the Modules view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 9/288
26/07/2022 09:49 PDF Export
This wizard:
lists the external modules to be installed and the licenses under which they are provided,
provides the default Maven URIs that identify the deployment of the modules,
provides the official websites where you can learn more about the modules,
lets you download and install automatically all the modules available in the Talend repository,
allows you to install those not available in the Talend repository manually.
When you drop a component, set up a connection, or guess the schema of a database, that requires an external module for
which neither the Jar file nor its download URL information is available on the Talend website, the Jar installation wizard
does not appear, but the Error Log view will present an error message informing you that the download URL for that module
is not available. You can try to find and download it by yourself, and then install it manually into the Studio.
Tip: To show the Error Log view on the tab system, go to Window > Show views..., then expand General and select Error
Log.
In Talend Studio, each external module is given a default URI to identify its deployment in Maven. When needed, you can change the
Maven URI.
For example, when replacing an installed database driver with a new version, you need to specify another Maven URI for it.
Note:
Changing the Maven URI for an external module will affect all the components and metadata connections that use the module
within the project.
When working on a remote project, your custom Maven URI settings will be automatically synchronized to the Talend Artifact
Repository and will be used when other users working on the same project install the external module.
Procedure
1. In the Modules view, click the Maven URI you want to customize and then click the [...] button that appears.
The Install Module dialog box opens.
2. If you want to install another version of the external module, specify the full path to the module file in the Module File field,
or click the [...] button to browse in your local file system.
If the MVN URI of the library is within the jar file and is different from the default MVN URI, it is automatically detected and
filled in the Custom MVN URI field.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 10/288
26/07/2022 09:49 PDF Export
3. Select the Custom MVN URI check box and enter a new URI in the field.
4. Click Detect the module install status and then OK to validate the custom URI and close the dialog box.
Results
The new Maven URI takes effect and is displayed in the Modules view, from which you can export all your Maven URI changes into a
local JSON file.
You can download and install all required and/or optional external modules in one go automatically.
Procedure
2. Select the Required third-party libraries and/or Optional third-party libraries check box(es) according to your needs.
3. Click OK.
The Review Licenses dialog box displays.
4. Accept the license terms and start the download and installation process:
To download and install the external module(s) provided under a particular license, select that license from the
Licenses pane, review the license terms, select the I accept the terms of the selected license agreement option, and
click Finish.
To download and install all external modules provided under all the listed licenses, click Accept all.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 11/288
26/07/2022 09:49 PDF Export
When the installation process is completed, the chosen external module or modules are installed into your Talend Studio,
and you can use Talend Studio features that depend on these modules.
If you have already downloaded external modules, you can install them manually into your Talend Studio.
If you are going to install the JDBC driver for Oracle 9i into your Talend Studio, change the file name from ojdbc14.jar to ojdbc14-
9i.jar first.
Procedure
1. Click in the upper right corner of the Modules view or in the Jar installation wizard to browse your local file system.
2. In the Open dialog box of your file system, browse to the module you want to install, double-click the .jar file, or select it
and then click Open to install it into your Talend Studio.
The dialog box closes and the selected module is installed in the library folder of the current Talend Studio.
If you have different versions of a database driver, you can override the current version of the driver by customizing its Maven URI.
That is, upgrading to the latest version or roll back to a previous version. Overriding a driver allows your Jobs to use any available
driver version.
Make sure the desired versions of the driver (for example, ojdbc14.jar and ojdbc14-9i.jar in the previous section) are available.
Note:
To avoid overriding failures, make sure the dependencies of the desired driver are resolved.
If you override a driver using the up-to-date driver (for example, version 2021) and then Talend Studio upgrades with a
newer driver version (for example, version 2022), the newer version will be used.
If you override a driver using an earlier driver (for example, overriding version 2021 using version 2020) and then Talend
Studio upgrades with the latest driver version (for example, version 2022), the latest version (that is, version 2022) will be
used.
Procedure
1. In the Modules view, locate the driver you want to override, click the Maven URI of the driver you want to customize, and
then click the [...] button that appears.
The Install Module dialog box opens.
2. Enter the full path to the desired driver file in the Module File field or click the [...] button and navigate to the desired driver
file.
If the MVN URI of the library is within the jar file and is different from the default MVN URI, it is automatically detected and
filled in the Custom MVN URI field.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 12/288
26/07/2022 09:49 PDF Export
3. Select the Custom MVN URI check box and enter a new URI in the field.
4. Click Detect the module install status and then OK to validate the custom URI and close the dialog box.
Note: To revert to the original driver, clear the check box to the left of the Custom MVN URI field.
Data Profiling
The following sections introduce Talend Data Quality and list its key features.
This data profiling tool allows you to identify potential problems before beginning data-intensive projects such as data integration.
a data profiler;
a data explorer;
a pattern manager; for more information about the pattern manager, see Patterns and indicators;
a metadata manager; for more information about the metadata manager, see Metadata repository.
Core features
Metadata repository
Using Talend data quality, you can connect to data sources to analyze their structure (catalogs, schemas and tables), and stores the
description of their metadata in its metadata repository. You can then use this metadata to set up metrics and indicators.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 13/288
26/07/2022 09:49 PDF Export
Patterns are sets of strings against which you can define the content, structure and quality of high complex data. The Profiling
perspective of Talend Studio lists two types of patterns: regular expressions, which are predefined regular patterns, and SQL
patterns which are the patterns you add using LIKE clauses.
Indicators are the results achieved through the implementation of different patterns. They can represent the results of data
matching and different other data-related operations. The Profiling perspective of Talend Studio lists two types of indicators: system
indicators, a list of predefined indicators, and user-defined indicators, a list of those defined by the user.
Functional architecture
The functional architecture of your Talend Studio is an architectural model that identifies the Talend Studio functions, interactions
and corresponding IT needs. The overall architecture has been described by isolating specific functionalities in functional blocks.
The chart below illustrates the main architectural functional blocks explored within the studio.
A Profiling perspective where you can use predefined or customized patterns and indicators to analyze data stored in
different data sources.
A Data Explorer perspective where you can browse and query the results of the profiling analyses done on data.
From the Profiling perspective of Talend Studio, you can examine the data available in different data sources and collect statistics
and information about this data.
A typical sequence of profiling data using Talend Studio involves the following steps:
1. Connecting to a data source including databases and delimited files in order to be able to access the tables and columns on
which you want to define and execute analyses. For more information, see Creating connections to data sources.
2. Defining any of the available data quality analyses including database content analysis, column analysis, table analysis,
redundancy analysis, correlation analysis, etc. These analyses will carry out data profiling processes that will define the
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 14/288
26/07/2022 09:49 PDF Export
content, structure and quality of highly complex data structures. The analysis results will be displayed graphically next to
each of the analysis editors, or in more detail in the Analysis Results view.
Note: While you can use all analyses types to profile data in databases, you can only use Column Analysis and Column
Set Analysis to profile data in delimited files.
Talend provides you with different demo projects you can import into your Talend Studio. Available demos depend on the and may
include ready to use Jobs which help you understand the functionalities of different Talend components.
input files and databases necessary to run the demo Jobs and analyses are imported under the Documentation folder in the
Integration perspective of the studio.
profiling analyses are imported in the DQ Repository tree view of the Profiling perspective. These analyses run on the
databases and files you installed initially as pointed out by the data quality tutorials.
data quality Jobs are imported in the Repository tree view of the Integration perspective. These Jobs use different data
quality components to standardize, deduplicate and match data for example.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 15/288
26/07/2022 09:49 PDF Export
You can run most of these Jobs without any prerequisites. However, for few Jobs, you must restore in your Mysql the
databases, tbi , tutorials , cif and crm , and download some files locally. You can find the databases and files under
the Documentation folder in the Repository tree view in the Integration perspective.
Note: As some of the demo Jobs are shared with the data quality Tutorials, they may have A, B, C, etc. or 1, 2, 3, etc. that
precede their names. You must run these Jobs in the specified order.
You can import the demo project either from the login window of your studio as a separate project, or from the Integration
perspective into your current project.
Procedure
1. Launch your Talend Studio and from the login window select Import a demo project and then click Select.
2. In the open dialog box, select the demo project you want to import and click Finish.
Note: The demo projects available in the dialog box may vary depending on the you are using.
3. In the dialog box that opens, type in a name for the demo project you want to import and click Finish.
A bar is displayed to show the progress of the operation.
4. On the login window, select from the project list the demo project you imported and then click Finish to open the demo
project in the studio.
All the samples of the demo project are imported into the studio under different folders in the repository tree view including
the input files and metadata connection necessary to run the demo samples.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 16/288
26/07/2022 09:49 PDF Export
Procedure
1. Launch your studio and in the Integration perspective, click the icon on the toolbar.
2. In the open dialog box, select the demo project to import and click Finish.
A bar is displayed to show the progress of the operation and then a confirmation message opens.
3. Click OK.
This section details some important information about analysis editors, the Error log view and the help context embedded in Talend
Studio.
From Talend Studio, you can control memory usage when using the Java engine to run two types of analyses: column analysis and
the analysis of a set of columns.
If you use column analysis or column set analysis to profile very big sets of data or data with many problems, you may run out of
memory and end up with a Java heap error. By defining the maximum memory size threshold for these analyses, Talend Studio will
stop the analysis execution when the memory limit size is reached and provide you with the analysis results that were measured on
the data before the analysis execution was terminated by the memory limit size.
Procedure
1. On the menu bar, select Window > Preferences to display the Preferences window.
3. In the Memory area, select the Enable analysis thread memory control check box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 17/288
26/07/2022 09:49 PDF Export
4. Move the slider to the right to define the memory limit at which the analysis execution will be stopped.
Results
The execution of any column analysis or column set analysis will be stopped if it exceeds the allocated memory size. The analysis
results given in Talend Studio will cover the data analyzed before the interruption of the analysis execution.
You can decide once for all what sections to fold by default when you open any of the connection or analysis editors. It also offers
the possibility to set up the display of all analysis results and whether to show or hide the graphical results in the different analysis
editors.
Procedure
1. On the menu bar, select Window > Preferences to display the Preferences window.
3. In the Folding area, select the check box(es) corresponding to the display mode you want to set for the different sections in
all the editors.
4. In the Analysis results folding area, select the check boxes corresponding to the display mode you want to set for the statistic
results in the Analysis Results view of the analysis editor.
5. In the Graphics area, select the Hide graphics in analysis results page option if you do not want to show the graphical results
of the executed analyses in the analysis editor. This will optimize system performance when you have so many graphics to
generate.
6. In the Analyzed Items Per Page field, set the number for the analyzed items you want to group on each page.
7. In the Business Rules Per Page field, set the number for the business rules you want to group in each page.
Note: You can always click the Restore Defaults tab on the Preferences window to bring back the default values.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 18/288
26/07/2022 09:49 PDF Export
8. Click Apply and then OK to validate the changes and close the Preferences window.
Results
While carrying on different analyses, all corresponding editors will open with the display mode you set in the Preferences window.
In the Profiling perspective, when viewing the results of an analysis, 10 results are shown in the frequency tables by default. From
the Preferences window of Talend Studio, you can edit the default value for frequency and low frequency tables.
locked analyses,
open analyses, and
analyses with frequency indicators that use the current default value.
Procedure
1. On the menu bar, select Window > Preferences to display the Preferences window.
3. In the Number of result shown fields, set the default value for Frequency table and Low frequency table.
5. In the Set the Frequency Table Parameters dialog box, select the analyses for which to apply the new frequency table
parameters, and click OK.
Talend Studio provides you with cheat sheets that you can use as a quick reference that guides you through all common tasks in
data profiling.
You can also have access to a help panel that is attached to all wizards used in Talend Studio to create the different types of analyses
or to set thresholds on indicators.
When you open Talend Studio for the first time, the Cheat Sheets view opens by default in the Profiling perspective.
If you close the Cheat Sheets view in the Profiling perspective of Talend Studio, it will be always closed anytime you switch back to
this perspective until you open it manually.
Procedure
1. Press the Alt+Shift+Q and then H shortcut keys, or select Window > Show View from the menu bar.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 19/288
26/07/2022 09:49 PDF Export
Or,
4. Select Help > Cheat Sheets from the menu bar. The Cheat Sheet Selection dialog box opens.
You can also press the Alt+H shortcut keys to open the Help menu and then select Cheat Sheets.
5. Expand the Talend > Cheat Sheets folder, select the cheat sheet you want to open in Talend Studio and then click OK .
The selected cheat sheet opens in the Talend Studio main window. Use the local toolbar icons to manage the display of the
cheat sheets.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 20/288
26/07/2022 09:49 PDF Export
A help panel is attached to the wizards used in Talend Studio to create and manage profiling items. This help panel opens by default
in all wizards.
Procedure
1. Select Window > Preferences > Talend > Profiling > Web Browser.
The Web Browser view opens.
2. Select the Block browser help check box and then click OK.
From now on, all wizards in Talend Studio display without the help panel.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 21/288
26/07/2022 09:49 PDF Export
Talend Studio provides you with a Module view. This view shows if a module is necessary and required for creating a connection to a
database.
Checking the Module view helps you to verify what modules you have or should have to run smoothly your profiling analyses.
Procedure
Icon To
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 22/288
26/07/2022 09:49 PDF Export
Icon To
Talend Studio provides you with very comprehensive log files that maintain diagnostic information and record any errors that are
encountered in the data profiling process.
The Error Log view is the first place to look when a problem occurs while profiling data, since it will often contain details of what
went wrong and how to fix it.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 23/288
26/07/2022 09:49 PDF Export
Note: The filter field at the top of the view enables you to do dynamic filtering, for example as you type your text in the
field, the list will show only the logs that match the filter.
You can use icons on the view toolbar to carry out different management options including exporting and importing the error
log files.
Each error log in the list is preceded by an icon that indicates the severity of the log: for errors, for warnings and for
information.
4. Double-click any of the error log files to open the Event Detail dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 24/288
26/07/2022 09:49 PDF Export
5. If required, click the icon in the Event Detail dialog box to copy the event detail to the clipboard and then paste it
anywhere you like.
It is possible to open new analysis or SQL editors in the Profiling and Data Explorer perspectives respectively.
To be able to use Data Explorer in Talend Studio, you must install certain SQL explorer libraries that are required for data quality. If
you do not install these libraries, the Data Explorer perspective will be missing from Talend Studio and many features will not be
available.
For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide .
Procedure
To open an empty new SQL editor from the Data Explorer perspective, do the following:
3. In the Connections view of the Data Explorer perspective, right-click any connection in the list.
A contextual menu is displayed.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 25/288
26/07/2022 09:49 PDF Export
What to do next
To open an empty SQL editor from the Profiling perspective of Talend Studio, see the procedure outlined in Creating and storing SQL
queries.
When you create any analysis type from the Talend Studio, a corresponding analysis item is listed under the Analyses folder in the
DQ Repository tree view.
Note: The number of the analyses created in the studio will be indicated next to this Analyses folder in the DQ Repository tree
view.
This analysis list will give you an idea about any problems in one or more of your analyses before even opening the analysis.
If an analysis fails to run, a small red-cross icon will be appended on it. If an analysis runs correctly but has violated thresholds, a
warning icon is appended on such analysis.
The Profiling perspective of Talend Studio enables you to create connections to databases and to delimited files in order to profile
data in these data sources.
Connecting to a database
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 26/288
26/07/2022 09:49 PDF Export
Before proceeding to analyze data in a specific database, you must first set up the connection to this database. From the Profiling
perspective of Talend Studio, you can create a connection on the DataBase Management System (DBMS) and show database
content.
For more information about the supported databases for profiling data, see Talend Data Fabric Installation Guide.
Connections to different databases are reflected by different tree levels and different icons in the DQ Repository tree view because
the logical and physical structure of data differs from one relational database to another. The highest level structure "Catalog"
followed by "Schema" and finally by "Table" is not applicable to all database types.
Creating a connection
You have read What you need to know about some databases carefully.
Procedure
1. In the DQ Repository tree view, expand Metadata, right-click DB Connections and select Create DB Connection.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 27/288
26/07/2022 09:49 PDF Export
2. In the Name field, enter a name for this new database connection.
Do not use spaces in the connection name.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
3. If required, set other connection metadata (purpose, description and author name) in the corresponding fields and click Next
to proceed to the next step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 28/288
26/07/2022 09:49 PDF Export
4. In the DB Type field and from the drop-down list, select the type of database to which you want to connect. For example,
MySQL .
For further information about supported databases, see the Talend Installation Guide.
Note: If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is
recommended to use the Java engine to execute the column analyses created on the selected database. For more
information on column analyses, see Defining the columns to be analyzed and setting indicators, and for more
information on the Java engine, see Using the Java or the SQL engine.
5. In the DB Type field and from the drop-down list, select the type of database to which you want to connect. For example,
MySQL .
For further information about supported databases, see the Talend Installation and Upgrade Guide .
If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is
recommended to use the Java engine to execute the column analyses created on the selected database. For more
information on column analyses, see Defining the columns to be analyzed and setting indicators, and for more information
on the Java engine, see Using the Java or the SQL engine.
6. In the DB Version field, select the version of the database to which you are creating the connection.
7. Enter your login, password, server and port information in their corresponding fields.
8. In the Database field, enter the database name you are connecting to. If you need to connect to all of the catalogs within one
connection, if the database allows you to, leave this field empty.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 29/288
26/07/2022 09:49 PDF Export
If you have not already installed the database driver (.jar file) necessary to use the database, you will have a wizard
prompting you to install the relative third-party module, click Download and Install and then close the wizard.
For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide or
click the How to install a driver link in the wizard.
For further information about the Module view, see Displaying the Module view.
11. If you created this connection in a reference project, expand Tables > table name > Columns.
Expanding the columns in a reference project allows you to select them from the main project.
Results
When you created the connection, you can open in Talend Studio a preview of the data in a specific database table. For further
information, see Previewing data in the SQL editor.
Click Connection information to show the connection parameters for the relevant database.
Click the Check button to check the status of your current connection.
Click the Edit... button to open the connection wizard and modify any of the connection information.
You can create a connection on a database catalog or schema directly from a database connection.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 30/288
26/07/2022 09:49 PDF Export
At least one database connection is set in the Profiling perspective of Talend Studio. For further information, see Connecting to a
database
Procedure
1. In the DQ Repository tree view, expand Metadata > DB Connections and browse to the catalog or schema on which you want
to create the connection.
3. Click OK.
Results
A new connection named after the selected connection and catalog is created under DB Connections.
The database connection wizard in Talend Studio lists the databases to which you can create a connection and do profiling
processes.
You can still use Talend Studio to connect to a custom "unsupported" database.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 31/288
26/07/2022 09:49 PDF Export
What to do next
After creating the connection to a custom database, you can profile and monitor data in this database by using different analyses
and indicators, as you do with supported databases. But you may need to change, in the Indicator Settings editor, the SQL query
template for some indicators, such as the regex indicator where each database has a different function to call. For further
information, see Editing a system indicator and Editing a user-defined indicator.
Note: If you have a problem profiling a custom database even though you use a JDBC connection, the reason could be that some
JDBC functions are not implemented by the JDBC driver library. Please raise an issue or ask support via Talend Community at:
https://community.talend.com/
Google BigQuery
Profiling data from Google BigQuery requires to go through a JDBC connection setting.
When you set up a JDBC connection, specify each jar file extracted from the zip file.
Hive
The Hive server requires sufficient memory to run correctly. Before connecting to a Hive database:
If you select to connect to the Hive database, you will be able to create and execute different analyses as with the other database
types.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 32/288
26/07/2022 09:49 PDF Export
In the connection wizard, you must select from the Distribution list the platform that hosts Hive. You must also set the Hive version
and model.
If you decide to change the user name in an embedded mode of a Hive connection, you must restart the studio before being able to
successfully run the profiling analyses that use the connection.
If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks Data Platform V1.3 , you must set
proper memory allocations for the map and reduce computations to be performed by the Hadoop system. In the second step in the
connection wizard:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 33/288
26/07/2022 09:49 PDF Export
1. Click the button next to Hadoop Properties and in the open dialog box click the [+] button to add two lines to the table.
If the Hadoop distribution to be used is Hortonworks Data Platform V2.0 (YARN), you must set the following parameter in the Hadoop
Properties table:
mapreduce/lib/
Note that one analysis type and few indicators and functions are still not supported for Hive, see the table below for more detail:
If you select to connect to the Microsoft SQL Server database with Windows Authentication Mode, you can select Microsoft or JTDS
open source from the Db Version list.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 34/288
26/07/2022 09:49 PDF Export
MySQL
When creating a connection to MySQL via JDBC, it is not mandatory to include the database name to the JDBC URL. Regardless of
whether the database connection URL specified in the JDBC URL field includes the database name, all databases are retrieved.
For example, if you specify jdbc:mysql://192.168.33.41:3306/tbi?noDatetimeStringSync=true , where tbi is the database name,
or jdbc:mysql://192.168.33.41:3306/?noDatetimeStringSync=true , all databases are retrieved.
To support surrogate pairs in data and metadata, you need to edit the following properties in the MySQL server configuration file:
[client]
default-character-set=utf8mb4
[mysql]
default-character-set=utf8mb4
character-set-server=utf8mb4
Netezza
The Netezza database does not support regular expressions. If you want to use regular expressions with this database, you must:
Install the SQL Extensions Toolkit package on a Netezza system. Use the regex_like function provided in this toolkit in the
SQL template as documented in
http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/topic/com.ibm.nz.sqltk.doc/r_sqlext_regexp_like.html.
Add the indicator definition for Netezza in the Pattern Matching folder in Talend Studio under Libraries > Indicators > System
Indicators.
The query template you need to define for Netezza is as the following: SELECT COUNT(CASE WHEN
REGEXP_LIKE(<%=COLUMN_NAMES%>,<%=PATTERN_EXPR%>) THEN 1 END), COUNT FROM <%=TABLE_NAME%><%=WHERE_CLAUSE%> .
Oracle
To support surrogate pairs, the NLS_CHARACTERSET parameter of the database must be set to UTF8 or AL32UTF8 .
NLS_CHARACTERSET=WE8ISO8859P15
NLS_NCHAR_CHARACTERSET=AL16UTF16
Note: To check the database parameters, you can run the following SQL query: SQL> SELECT * FROM NLS_DATABASE_PARAMETERS;
PostgreSQL
When you connect to a PostgreSQL database via a JDBC connection, the INT4 and INT8 data types are replaced by a String data
type. As a consequence, if your analysis uses the T-Swoosh algorithm, the survivorship functions are for strings, not for numbers.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 35/288
26/07/2022 09:49 PDF Export
SAP Hana
Profiling data from SAP Hana is only possible for Table, View and Calculation View schemas.
The Soundex frequency statistics indicators support the English alphabet only.
Snowflake
Profiling data from Snowflake requires a JDBC connection.
Teradata
In the Teradata database, the regular expression function is installed by default only starting from version 14. If you want to use
regular expressions with older versions of this database, you must install a User Defined Function in Teradata and add the indicator
definition for Teradata in Talend Studio.
The structure of a database defines how objects are organized in the database. Different data storage structures are used to store
objects in databases. For example, the highest-level structure (such as "Catalog" followed by "Schema" and finally by "Table") is not
applicable to all database types.
AS/400 V5R4
DB2 -
DB2 ZOS -
Informix -
Ingres -
MySQL -
Netezza -
Oracle -
PointBase -
PostgreSQL -
SQLite -
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 36/288
26/07/2022 09:49 PDF Export
Sybase -
Teradata -
Connecting to a file
Before proceeding to analyze data in a delimited file, you must first set up the connection to this file.
Procedure
1. Expand the Metadata folder.
2. Right-click FileDelimited connections and then select Create File Delimited Connection to open the New Delimited File
wizard.
3. Follow the steps defined in the wizard to create a connection to a delimited file.
You can then create a column analysis and drop the columns to analyze from the delimited file metadata to the DQ
Repository tree view to the open analysis editor.
Results
Several management options are available for each of the connections created in Talend Studio.
Many management options are available for database connections including editing and duplicating the connection or adding a task
to it.
You can edit the connection to a specific database and change the connection metadata and the connection information.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 37/288
26/07/2022 09:49 PDF Export
Procedure
2. Either:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 38/288
26/07/2022 09:49 PDF Export
3. Go through the steps in the wizard and modify the database connection settings as required.
5. Select the reload option if you want to reload the new database structure for the updated database connection.
Note: If you select the don't reload option, you will still be able to execute the analyses using the connection even after
you update it.
If the database connection is used by profiling analyses in the Studio, another dialog box is displayed to list all the analyses
that use the database connection. It alerts you that if you reload the database new structure, all the analyses using the
connection will become unusable although they will be always listed in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 39/288
26/07/2022 09:49 PDF Export
6. Click OK to accept reloading the database structure or Cancel to cancel the operation and close the dialog box.
A number of confirmation messages are displayed one after the other.
7. Click OK to close the messages and reload the structure of the new connection.
After setting a specific database connection in the studio, you may not want to view all databases in the DQ Repository tree view of
your Studio.
You can filter your database connections to list the databases that match the filter you set. This option is very helpful when the
number of databases in a specific connection is very big.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 40/288
26/07/2022 09:49 PDF Export
2. Right-click the database connection you want to filter and select Package Filter to open the corresponding dialog box.
3. In the Package Filter field, enter the complete name of the database you want to view and then click Finish to close the
dialog box.
Only the database that matches the filter you set is listed under the database connection in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 41/288
26/07/2022 09:49 PDF Export
To avoid creating a DB connection from scratch, you can duplicate an existing one in the DB Connections list and work around its
metadata to have a new connection.
Procedure
2. Right-click the connection you want to duplicate and select Duplicate from the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 42/288
26/07/2022 09:49 PDF Export
Results
The duplicated database connection shows under the connection list in the DQ Repository tree view as a copy of the original
connection. You can now open the duplicated connection and modify its metadata as needed.
You can add a task to a database connection to use it as a reminder to modify the connection or to flag a problem that needs to be
solved later, for example. You can also add a task to a catalog, a table or a column in the connection.
Procedure
1. Expand Metadata > DB connections.
2. Right-click the connection to which you want to add a task, and then select Add task... from the contextual menu.
The Properties dialog box opens showing the metadata of the selected connection.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 43/288
26/07/2022 09:49 PDF Export
3. In the Description field, enter a short description for the task you want to attach to the selected connection.
4. On the Priority list, select the priority level and then click OK to close the dialog box.
The created task is added to the Tasks list.
Results
What to do next
You can follow the same steps in the above procedure to add a task to a catalog, a table or a column in the connection. For further
information, see Adding a task to a column in a database connection.
For more information on how to access the task list, see Displaying the task list.
You can filter the tables/views to list under any database connection.
This option is very helpful when the number of tables in the database to which the studio is connecting is very big. If so, a message is
displayed prompting you to set a table filter on the database connection in order to list only defined tables in the DQ Repository tree
view.
Procedure
2. Expand the database connection in which you want to filter tables/views and right-click the desired catalog/schema.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 44/288
26/07/2022 09:49 PDF Export
3. Select Table/View Filter from the list to display the corresponding dialog box.
4. Set a table and a view filter in the corresponding fields and click Finish to close the dialog box.
Results
Only tables/views that match the filter you set are listed in the DQ Repository tree view.
You can move a database connection to the studio recycle bin whether it is used by analyses or not.
A database connection is created in Talend Studio. For further information, see Connecting to a database.
Procedure
1. In the DQ Repository tree view, expand Metadata > DB Connections.
2. Right-click a database connection and select Delete in the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 45/288
26/07/2022 09:49 PDF Export
You can still run the analyses that use the connection in the recycle bin. However, an alert message will be displayed next to
the connection name in the analysis editor.
If the connection is used by one or more analyses in Talend Studio, a dialog box is displayed to list such analyses:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 46/288
26/07/2022 09:49 PDF Export
Either click OK to close the dialog box without deleting the database connection from the recycle bin, or
Select the Force to delete all the dependencies check box and then click OK to delete the database connection from
the Recycle Bin and to delete all the dependent analyses from the Data Profiling node.
You can also delete permanently the database connection by emptying the recycle bin.
If the connection is not used by any analysis in the current Studio, a confirmation dialog box is displayed.
c. Click OK to close the dialog box without removing the connection from the recycle bin.
You can restore the deleted database connection from the Talend Studio recycle bin.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 47/288
26/07/2022 09:49 PDF Export
Results
The database connection is moved back to the Metadata node.
Few management options are available for file connections including editing or deleting the connection, adding a task to it, or
importing and exporting the connection.
You can edit the connection to a specific file and change the connection metadata and the connection information.
Procedure
3. Go through the steps in the wizard and modify the file connection settings as required.
Other management procedures for file connection are the same as those for databases.
For further information on how to add a task to a file connection, see Adding a task to a database connection or any of its
elements.
For further information on how to delete or restore a file connection, see Deleting a database connection.
For further information on how to import or export a file connection, see Importing data profiling items.
You can analyze the content of a database to have an overview of the number of tables in the database, rows per table and indexes
and primary keys.
You can also analyze one specific catalog or schema in a database, if a catalog or schema is used in the physical structure of the
database.
From the Profiling perspective of Talend Studio, you can create an analysis to examine the content of a given database.
Before you begin, you have defined at least one database connection in the Profiling perspective of Talend Studio.
To create a database content analysis, you must first define the relevant analysis and then select the database connection you want
to analyze.
Click a catalog or a schema to list all tables included in it along with a summary of their content: number of rows, keys and
user-defined indexes.
The selected catalog or schema is highlighted in blue. Catalogs or schemas highlighted in red indicate potential problems in
data.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 48/288
26/07/2022 09:49 PDF Export
Right-click a catalog or a schema and select Overview analysis to analyze the content of the selected item.
Right-click a table or a view and select Table analysis to create a table analysis on the selected item.
Click any column header in the analytical table to sort alphabetically the data listed in catalogs or schemas.
Procedure
3. In the filter field, start typing connection overview analysis , select Connection Overview Analysis from the list that is
displayed and click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 49/288
26/07/2022 09:49 PDF Export
You can create a database content analysis in a shortcut procedure if you right-click the database under Metadata > DB
connections and select Overview analysis from the contextual menu.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and click Next.
Procedure
1. Expand DB Connections and select a database connection to analyze, if more than one exists.
2. Click Next.
3. Set filters on the tables and/or views you want to analyze in their corresponding fields according to your needs using the SQL
language.
By default, the analysis examines all tables and views in the database.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 50/288
26/07/2022 09:49 PDF Export
A folder for the newly created analysis is listed under the Analyses folder in the DQ Repository tree view, and the connection
editor opens with the defined metadata.
Note: The display of the connection editor depends on the parameters you set in the Preferences window. For more
information, see Setting preferences of analysis editors and analysis results.
6. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
The table in this view lists all context environments and their values you define in the Contexts view in the analysis editor. For
further information, see Using context variables in analyses.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 51/288
26/07/2022 09:49 PDF Export
You can use the Profiling perspective of Talend Studio to analyze one specific catalog or schema in a database, if this entity is used in
the physical structure of the database.
The result of the analysis gives analytical information about the content of this schema, for example number of rows, number of
tables, number of rows per table and so on.
At least one database connection has been created to connect to a database that uses the "catalog" or "schema" entity. For further
information, see Connecting to a database.
Procedure
1. Under DB connections in the DQ Repository tree view, right-click the catalog or schema for which you want to create content
analysis and, select Overview analysis from the contextual menu.
This example shows how to create a schema analysis.
2. In the wizard that opens, enter a name for the current analysis.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
3. If required, set the analysis metadata (purpose, description and author name) in the corresponding fields and click Next.
4. Set filters on the tables and/or views you want to analyze in their corresponding fields according to your needs using the SQL
language.
By default, the analysis examines all tables and views in the catalog.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 52/288
26/07/2022 09:49 PDF Export
5. Click Finish.
A folder for the newly created analysis is listed under Analysis in the DQ Repository tree view, and the analysis editor opens
with the defined metadata.
Click the schema to list all tables included in it along with a summary of their content: number of rows, keys and
user-defined indexes.
The selected schema is highlighted in blue. Schemas highlighted in red indicate potential problems in data.
Right-click a schema and select Overview analysis to analyze the content of the selected item.
Right-click a table or a view and select Table analysis to create a table analysis on the selected item. You can also view
the keys and indexes of a selected table. For further information, see Displaying keys and indexes of database tables.
Click any column header in the analytical table to sort the listed data alphabetically.
After you create a connection to a database, you can open a view in Talend Studio to see actual data in the database.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 53/288
26/07/2022 09:49 PDF Export
Procedure
1. In the DQ Repository tree view, expand Metadata > DB Connections.
Results
The query is listed under the Libraries > Source Files folder in the DQ Repository tree view.
After analyzing the content of a database, you can display the details of the key and user-defined index of a given table. This
information could be very interesting for the database administrator.
At least one database content analysis has been created and executed in the Profiling perspective of Talend Studio.
Procedure
1. In the Analysis Results view of the analysis editor, click a catalog or a schema under Statistical Information.
All the tables included in the selected catalog or schema are listed along with a summary of their content: number of rows,
keys and user-defined indexes.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 54/288
26/07/2022 09:49 PDF Export
2. In the table list, right-click the table and select View keys.
You cannot display the key details of tables in a Hive connection.
The Database Structure and the Database Detail views display the structure of the analyzed database and information about
the primary key of the selected table.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 55/288
26/07/2022 09:49 PDF Export
3. Optional: If one or both views do not show, select Window > Show View > Database Structure or Window > Show View >
Database Detail.
4. In the table list, right-click the table and select View indexes.
You cannot display the index details of tables in a Hive connection.
The Database Structure and the Database Detail views display the structure of the analyzed database and information about
the user-defined index of the selected table.
5. If required, click any of the tabs in the Database Detail view to display the relevant metadata about the selected table.
When the data in a source database is changed or updated, it is necessary that the relevant connection structure in Talend Studio
follows that change or update as well. Otherwise, errors may occur when trying to analyze a column that has been modified/deleted
in a database.
You can synchronize the connection structure displayed in the DQ Repository tree view with the database structures to eliminate any
incoherence. You can perform synchronization at the following three different levels:
You can compare and match the catalog and schema lists in the DQ Repository tree view with those in the database.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 56/288
26/07/2022 09:49 PDF Export
Procedure
2. Right-click the database connection you want to synchronize with the database and select Reload database list.
A message will prompt you for confirmation as any change in the database structure may affect the analyses created on
these catalogus or schemas from Talend Studio.
Results
The selected database connection is updated with the new catalogs and schemas, if any.
You can compare and match the table lists in the DQ Repository tree view with those in the database.
Procedure
2. Browse through the entities in your database connection to reach the Table folder you want to synchronize with the
database.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 57/288
26/07/2022 09:49 PDF Export
A message will prompt you for confirmation as any change in the database structure may affect the analyses created on
these tables from the Talend Studio.
You can compare and match the column lists in the DQ Repository tree view with those in the database.
Procedure
2. Browse through the entities in your database connection to reach the Columns folder you want to synchronize with the
database.
A message will prompt you for confirmation as any change in the database structure may affect the analyses created on
these columns from the Talend Studio.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 58/288
26/07/2022 09:49 PDF Export
Results
The selected column list is updated with the new column in the database, if any.
Redundancy analyses
What are redundancy analyses?
Redundancy analyses are column comparison analyses that better explore the relationships between tables through:
From Talend Studio, you can create an analysis that compares two identical sets of columns in two different tables. This redundancy
analysis supports only database tables.
Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see
Connecting to a database.
Through this view, you can also access the actual analyzed data via the Data Explorer.
To access the analyzed data rows, right-click any of the lines in the table and select:
Option To...
View match rows access a list of all rows that could be matched in the two identical column
sets
View not match rows access a list of all rows that could not be matched in the two identical
column sets
View rows access a list of all rows in the two identical column sets
Warning: The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server.
If you analyze data using such connection and you try to view data rows in the Data Explorer perspective, a warning message
prompt you to set your connection credentials to the SQL Server.
The figure below illustrates the data explorer list of all rows that could be matched in the two sets, eight in this example.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 59/288
26/07/2022 09:49 PDF Export
From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree
view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.
The figure below illustrates the data explorer list of all rows that could not be matched in the two sets, three in this example.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 60/288
26/07/2022 09:49 PDF Export
3. In the filter field, start typing redundancy analysis , select Redundancy Analysis from the list and click Next.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 61/288
26/07/2022 09:49 PDF Export
Procedure
1. Expand DB connections and in the desired database, browse to the columns you want to analyze, select them and then click
Finish to close the wizard.
A file for the newly created analysis is listed under the Analysis folder in the DQ Repository tree view. The analysis editor
opens with the defined analysis metadata.
The display of the analysis editor depends on the parameters you set in the Preferences window. For more information, see
Setting preferences of analysis editors and analysis results.
2. Click Analyzed Column Sets to open the view where you can set the columns or modify your selection.
In this example, you want to compare identical columns in the account and account_back tables.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 62/288
26/07/2022 09:49 PDF Export
3. From the Connection list, select the database connection relevant to the database to which you want to connect.
You can find in this list all the database connections you create and centralize in the Studio repository.
5. Browse the catalogs/schemas in your database connection to reach the table holding the columns you want to analyze.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.
6. Click the table name to list all its columns in the right-hand panel of the Column Selection dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 63/288
26/07/2022 09:49 PDF Export
7. In the list to the right, select the check boxes of the column(s) you want to analyze and click OK to proceed to the next step.
You can drag the columns to be analyzed directly from the DQ Repository tree view to the editor.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column will be automatically located under the corresponding connection in the tree view.
8. Click B Column B Set and follow the same steps to select the second set of columns or drag it to the right column panel.
9. Select the Compute only number of A rows not in B check box if you want to match the data from the A set against the data
from the B set and not vice versa.
10. Select the Ignore Null check box if you want to ignore the NULL values when matching.
This check box is available only if you have installed the R2021-05 Studio monthly update or a later one provided by Talend.
Procedure
1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
2. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.
3. If you have defined context variables in the Contexts view in the analysis editor, complete the following steps:
a. Use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the
number of concurrent connections per analysis respectively.
b. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
For further information about contexts and variables, see Using context variables in analyses.
Results
In this example, 72.73% of the data present in the columns in the account table could be matched with the same data in the columns
in the account_back table.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 64/288
26/07/2022 09:49 PDF Export
You can create an analysis that matches foreign keys in one table to primary keys in the other table and vice versa. This redundancy
analysis supports only database tables.
Prerequisites
A database connection is created in the Profiling perspective of Talend Studio.
Through this view, you can also access the actual analyzed data via the data explorer.
To access the analyzed data rows, right-click any of the lines in the table and select:
Option To...
View match rows access a list of all rows that could be matched in the two identical column
sets
View not match rows access a list of all rows that could not be matched in the two identical
column sets
View rows access a list of all rows in the two identical column sets
Warning: The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server.
If you analyze data using such connection and you try to view data rows in the Data Explorer perspective, a warning message
prompt you to set your connection credentials to the SQL Server.
The figure below illustrates in the data explorer the list of all analyzed rows in the two columns.
From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree
view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 65/288
26/07/2022 09:49 PDF Export
Procedure
1. In the DQ Repository tree view, expand the Data Profiling folder.
3. In the filter field, start typing redundancy analysis and then select Redundancy Analysis, click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 66/288
26/07/2022 09:49 PDF Export
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Finish.
A file for the newly created analysis is displayed under the Analysis folder in the DQ Repository tree view. The analysis editor
opens with the defined analysis metadata.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 67/288
26/07/2022 09:49 PDF Export
Procedure
1. Click Analyzed Column Sets to display the corresponding view.
In this example, you want to match the foreign keys in the customer_id column of the sales_fact_1998 table with the primary
keys in the customer_id column of the customer table, and vice versa. This will explore the relationship between the two
tables to show us for example if every customer has an order in the year 1998.
2. From the Connection list, select the database connection relevant to the database to which you want to connect.
You have in this list all the connections you create and centralize in the Talend Studio repository.
4. Browse the catalogs/schemas in your database connection to reach the table holding the column you want to match.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 68/288
26/07/2022 09:49 PDF Export
In this example, the column to be analyzed is customer_id that holds the foreign keys.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.
5. Click the table name to display all its columns in the right-hand panel of the Column Selection dialog box.
6. In the list to the right, select the check box of the column holding the foreign keys and then click OK to proceed to the next
step.
You can drag the columns to be analyzed directly from the DQ Repository tree view to the editor.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column will be automatically located under the corresponding connection in the tree view.
7. Click B Column Set and follow the same steps to select the column holding the primary keys or drag it from the DQ
Repository to the right column panel.
If you select the Compute only number of rows not in B check box, you will look for any missing primary keys in the column
in the B set.
8. Click Data Filter in the analysis editor to display the view where you can set a filter on each of the analyzed columns.
Results
In this example, every foreign key in the sales_fact_1998 table is identified with a primary key in the customer table. However,
98.22% of the primary keys in the customer table could not be identified with foreign keys in the sales_fact_1998 table. These
primary keys are for the customers who did not order anything in 1998.
Table analyses
Steps to analyze database tables
You can examine the data available in single tables of a database and collect information and statistics about this data.
The sequence of profiling data in one or multiple tables may involve the following steps:
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 69/288
26/07/2022 09:49 PDF Export
1. Defining one or more tables on which to carry out data profiling processes that will define the content, structure and quality
of the data included in the table(s).
2. Creating SQL business rules based on WHERE clauses and add them as indicators to table analyses.
3. Creating column functional dependencies analyses to detect anomalies in the column dependencies of the defined table(s)
through defining columns as either "determinant" or "dependent".
What to do next
Check Analyzing tables in databases for information about the different options to analyze a table.
Table analyses can range from simple table analyses to table analyses that uses SQL business rules or table analyses that detect
anomalies in the table columns.
Using Talend Studio, you can better explore the quality of data in a database table through either:
Creating a simple table analysis through analyzing all columns in the table using patterns.
Adding data quality rules as indicators to table analysis.
Detecting anomalies in column dependencies.
Comparing a set of columns and creating groups of similar records using blocking and matching keys and/or survivorship
rules.
You can analyze the content of a set of columns. This set can represent only some of the columns in the defined table or the table as
a whole.
The analysis of a set of columns focuses on a column set (full records) and not on separate columns as it is the case with the column
analysis. The statistics presented in the analysis results (row count, distinct count, unique count and duplicate count) are measured
against the values across all the data set and thus do not analyze the values separately within each column.
With the Java engine, you may also apply patterns on each column and the result of the analysis will give the number of records
matching all the selected patterns together. For further information, see Adding patterns to the analyzed columns.
Note: When you use the Java engine to run a column set analysis on big sets or on data with many problems, it is advisable to
define a maximum memory size threshold to execute the analysis as you may end up with a Java heap error. For more
information, see Defining the maximum memory size threshold.
This type of analysis provides simple statistics on the full records of the analyzed column set and not on the values within each
column separately. For more information about simple statistic indicators, see Simple statistics.
With this analysis, you can use patterns to validate the full records against all patterns and have a single-bar result chart that shows
the number of the rows that match "all" the patterns.
Before you begin, you have defined at least one database connection in the Profiling perspective of Talend Studio.
Procedure
Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the next
step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 70/288
26/07/2022 09:49 PDF Export
Procedure
1. Expand DB connections.
2. In the desired database, browse to the columns you want to analyze, select them and then click Finish to close this New
analysis wizard.
In this example, you want to analyze a set of six columns in the customer table: account number ( account_num ), education
( education ), email ( email ), first name ( fname ), second name ( Iname ) and gender ( gender ). The statistics presented in
the analysis results are the row count, distinct count, unique count and duplicate count which all apply on records (values of
a set of columns).
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 71/288
26/07/2022 09:49 PDF Export
A sample data is displayed in the Data Preview section and the selected columns are displayed in the Analyzed Column
section of the analysis editor.
In this example, you want to analyze a set of six columns in the customer table: account number (account_num), education
(education), email (email), first name (fname), second name (Iname) and gender (gender). The statistics presented in the
analysis results are the row count, distinct count, unique count and duplicate count which all apply on records (values of a
set of columns).
Option To...
New Connection open a wizard and create or change the connection to the
data source from within the editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 72/288
26/07/2022 09:49 PDF Export
Option To...
Select Columns open the Column Selection dialog box where you can
select the columns to analyze or change the selection of
the columns listed in the table.
From the open dialog box, you can filter the table or
column lists by using the Table filter or Column filter fields
respectively.
n first rows or n random rows list in the table N first data records from the selected
columns or list N random records from the selected
columns.
Refresh Data display the data in the selected columns according to the
criteria you set.
Run with sample data run the analysis only on the sample data set in the Limit
field.
4. In the Limit field, set the number for the data records you want to display in the table and use as sample data.
You can add patterns to one or more of the analyzed columns to validate the full record (all columns) against all the patterns, and
not to validate each column against a specific pattern as it is the case with the column analysis. The results chart is a single bar chart
for the totality of the used patterns. This chart shows the number of the rows that match "all" the patterns.
Warning: Before being able to use a specific pattern with a set of columns analysis, you must manually set in the patterns
settings the pattern definition for Java, if it does not already exist. Otherwise, a warning message will display prompting you to
set the definition of the Java regular expression.
Procedure
Select the check box(es) of the expression(s) you want to add to the selected column, then Click OK.
The added regular expression(s) display(s) under the analyzed column(s) in the Analyzed Columns view and the All Match
indicator is displayed in the Indicators list in the Indicators view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 73/288
26/07/2022 09:49 PDF Export
What is left before executing this set of columns analysis is to define the indicator settings, data filter and analysis parameters.
A column set analysis has already been defined in the Profiling perspective of the Talend Studio.
Procedure
In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to
the selected database connection.
You can set this number according to the database available resources, that is the number of concurrent connections
each database can support.
From the Execution engine list, select the engine, Java or SQL, you want to use to execute the analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 74/288
26/07/2022 09:49 PDF Export
If you select the Java engine, the Store data check box is selected by default and cannot be unselected. Once
the analysis is executed, the profiling results are always available locally to drill down through the Analysis
Results > Data view.
Executing the analysis with the Java engine uses disk space as all data is retrieved and stored locally. If you
want to free up some space, you may delete the data stored in the main Talend Studio directory, at Talend-
Studio/workspace/project_name/Work_MapDB .
If you select the SQL engine, you can use the Store data check box to decide whether to store locally the
analyzed data and access it in the Analysis Results > Data view.
Note: If the data you are analyzing is very big, it is advisable to leave the Store data check box unselected
in order not to store the results at the end of the analysis computation.
The analysis editor switches to the Analysis Results view where you can read the analysis results in tables and graphics. The
graphical result provides the simple statistics on the full records of the analyzed column set and not on the values within
each column separately.
When you use patterns to match the content of the set of columns, another graphic is displayed to illustrate the match and
non-match results against the totality of the used patterns.
3. In the Simple Statistics table, right-click an indicator result and select View Rows or View Values.
If you run the analysis with the Java engine, a list of the analyzed data is opened in the Profiling perspective.
If you run the analysis with the SQL engine, a list of the analyzed data is opened in the Data Explorer perspective.
4. In the Data view, click Filter Data to filter the valid/invalid data according to the used patterns.
You can filter data only when you run the analysis with the Java engine.
For further information, see Filtering data against patterns.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 75/288
26/07/2022 09:49 PDF Export
After analyzing a set of columns against a group of patterns and having the results of the rows that match or do not match "all" the
patterns, you can filter the valid/invalid data according to the used patterns.
An analysis of a set of columns is open in the analysis editor in the Profiling perspective of Talend Studio.
Procedure
1. In the analysis editor, click the Analysis Results tab at the bottom of the editor to open the detailed result view.
4. Select the check box(es) of the pattern(s) according to which you want to filter data.
Select To..
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 76/288
26/07/2022 09:49 PDF Export
Select To..
matches show only the data that matches the selected pattern.
non-matches show the data that does not match the selected
pattern(s).
Results
In this example, data is filtered against the Email Address pattern, and only the data that does not match is displayed.
All email addresses that do not match the selected pattern appear in red. Any data row that has a missing value appear with a red
background.
The Previous and Next buttons under the table helps you to navigate back and forth through pages.
Numbered buttons are displayed under the table to access pages directly:
when you open the Data view for the first time after running the analysis,
if you did not select a pattern in the Filter Data dialog box, or
if you selected All data as the display option in the Filter Data dialog box.
You can create a column analysis on one or more columns defined in a simple table analysis (column set analysis).
A simple table analysis is defined in the analysis editor in the Profiling perspective of Talend Studio.
Procedure
1. In the Analyzed Columns view, right-click the column(s) you want to create a column analysis on.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 77/288
26/07/2022 09:49 PDF Export
2. Follow the steps outlined in Creating a basic analysis on a database column to continue creating the column analysis.
You can set up SQL business rules based on WHERE clauses and add them as indicators to table analyses. You can as well define
expected thresholds on the SQL business rule indicator's value. The range defined is used for measuring the quality of the data in the
selected table.
It is also possible to create an analysis with SQL business rules on views in a database. The procedure is exactly the same as that for
tables.
For more information, see Creating a table analysis with an SQL business rule with a join condition.
Note: When you use the Java engine to run a column set analysis on big sets or on data with many problems, it is advisable to
define a maximum memory size threshold in the Talend Studio Preferences to execute the analysis as you may end up with a
Java heap error.
SQL business rules can be simple rules with WHERE clauses. They can also have join conditions in them to combine common values
between columns in database tables and give a result data set.
Procedure
2. Right-click SQL.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 78/288
26/07/2022 09:49 PDF Export
3. From the contextual menu, select New Business Rule to open the New Business Rule wizard.
Consider as an example that you want to create a business rule to match the age of all customers listed in the age column
of a defined table. You want to filter all the age records to identify those that fulfill the specified criterion.
4. In the Name field, enter a name for this new SQL business rule.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set other metadata (purpose, description and author name) in the corresponding fields and then click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 79/288
26/07/2022 09:49 PDF Export
6. In the Where clause field, enter the WHERE clause to be used in the analysis.
In this example, the WHERE clause is used to match the records where customer age is greater than 18.
Note: In the SQL business rule editor, you can modify the WHERE clause or add a new one directly in the Data quality rule
view.
This will act as an indicator to measure the importance of the SQL business rule.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 80/288
26/07/2022 09:49 PDF Export
This step is not obligatory. You can decide to create a business rule without a join condition and use it with only the WHERE clause in
the table analysis.
For an example of a table analysis with a simple business rule, see Creating a table analysis with a simple SQL business rule. For an
example of a table analysis with a business rule that has a join condition, see Creating a table analysis with an SQL business rule
with a join condition.
Procedure
1. In the SQL business rule editor, click Join Condition to open the corresponding view.
2. Click the [+] button to add a row in the Join Condition table.
3. Expand the Metadata folder in the DQ Repository tree view, and then browse to the columns in the tables on which you want
to create the join condition.
This join condition will define the relationship between a table A and a table B using a comparison operator on a specific
column in both tables. In this example, the join condition will compare the "name" value in the Person and Person_Ref
tables that have a common column called name .
Note: You must be careful when defining the join clause. In order to get an easy to understand result, it is advisable to
make sure that the joined tables do not have duplicate values. For further information, see Creating a table analysis with
an SQL business rule with a join condition.
4. Drop the columns from the DQ Repository tree view to the Join Condition table.
A dialog box is displayed prompting you to select where to place the column: in TableA or in TableB.
5. Select a comparison condition operator between the two columns in the tables and save your modifications.
In the analysis editor, you can now drop this newly created SQL business rule onto a table that has an "age" column. When
you run the analysis, the join to the second column is done automatically.
Warning: The table to which to add the business rule must contain at least one of the columns used in the SQL business
rule.
Procedure
1. In the DQ Repository tree view, expand Libraries > Rules > SQL.
2. Right-click the SQL business rule you want to open and select Open from the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 81/288
26/07/2022 09:49 PDF Export
The SQL business rule editor opens displaying the rule metadata.
4. Click the save icon on top of the editor to save your modifications.
The SQL business rule is modified as defined.
You can create analyses on either tables or views in a database using SQL business rules. The procedure for creating such analysis is
the same for a table or a view.
Prerequisite(s):
At least one SQL business rule has been created in the Profiling perspective of Talend Studio.
At least one database connection is set in the Profiling perspective of Talend Studio.
In this example, you want to add the SQL business rule created in Creating an SQL business rule to a top_custom table that contains
an age column. This SQL business rule will match the customer ages to define those who are older than 18.
Procedure
2. In the filter field, start typing business rule analysis , select Business Rule Analysis and click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 82/288
26/07/2022 09:49 PDF Export
3. Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the
next step.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 83/288
26/07/2022 09:49 PDF Export
Note: You can directly select the data quality rule you want to add to the current analysis by clicking the Next button in
the New Analysis wizard or you can do that at later stage in the Analyzed Tables view as shown in the following steps.
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.
3. If required:
Click Select Tables to open the Table Selection dialog box and select new table(s) to analyze.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively.
The lists will show only the tables/columns that correspond to the text you type in.
Select another connection from the Connection list to connect to a different database. This list has all the
connections created in Talend Studio. If the tables listed in the Analyzed Tables view do not exist in the new database
connection you want to set, you receive a warning message that enables you to continue or cancel the operation.
4. Right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column is automatically located under the corresponding connection in the tree view.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 84/288
26/07/2022 09:49 PDF Export
1. Click the icon next to the table name where you want to add the SQL business rule.
The Business Rule Selector dialog box is displayed.
2. Expand the Rules folder and select the check box(es) of the predefined SQL business rule(s) you want to use on the
corresponding table(s).
3. Click OK.
The selected business rule is listed below the table name in the Analyzed Tables view.
You can also drag the business rule directly from the DQ Repository tree view to the table in the analysis editor.
4. If required, right-click the business rule and select View executed query.
The SQL editor opens in the Studio to display the query.
An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 85/288
26/07/2022 09:49 PDF Export
All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar
charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and
non-match indicator that indicates in red the age records from the "analyzed result set" that do not match the criteria (age
below 18).
6. Right-click the business rule results in the second table, or right-click the result bar in the chart itself and select:
You can carry out a table analysis in a direct and more simplified way. For further information, see Creating a table analysis
with an SQL business rule in a shortcut procedure.
Creating a table analysis with an SQL business rule with a join condition
In some cases, you may need to analyze database tables or views using an SQL business rule that has a join clause that combines
records from two tables in a database. This join clause will compare common values between two columns and give a result data
set. Then the data in this set will be analyzed against the business rule.
Depending on the analyzed data and the join clause itself, several different results of the join are possible, for example #match + #no
match > #row count, #match + #no match < #row count or #match + #no match = #row count.
The example below explains in detail the case where the data set in the join result is bigger than the row count (#match + #no match
> #row count) which indicates duplicates in the processed data.
At least one SQL business rule has been created in the Profiling perspective of Talend Studio.
At least one database connection is set in the Profiling perspective of Talend Studio.
In this example, you want to add the SQL business rule created in Creating an SQL business rule to a Person table that contains the
age and name columns. This SQL business rule will match the customer ages to define those who are older than 18. The business
rule also has a join condition that compares the "name" value between the Person table and another table called Person_Ref
through analyzing a common column called name .
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 86/288
26/07/2022 09:49 PDF Export
Below is a capture of the result of the join condition between these two tables:
The result set may give duplicate rows as it is the case here. Thus the results of the analysis may become a bit harder to understand.
The analysis here will not analyze the rows of the table that match the business rule but it will run on the result set given by the
business rule.
1. Define the table analysis and select the table you want to analyze.
2. Add the business rule with the join condition to the selected table through clicking the icon next to the table name.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 87/288
26/07/2022 09:49 PDF Export
This business rule has a join condition that compares the "name" value between two different tables through analyzing a
common column.
An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.
All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar
charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and
non-match indicator that indicates the age records from the "analyzed result set" that do not match the criteria (age below
18).
Note: If a join condition is used in the SQL business rule, the number of the rows of the join (#match + # no match) can be
different from the number of the analyzed rows (row count).
4. Right-click the Row Count row in the first table and select View rows.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 88/288
26/07/2022 09:49 PDF Export
The SQL editor opens in the Studio to display a list of the analyzed rows.
5. Right-click the business rule results in the second table, or right-click the result bar in the chart itself and select:
6. In the SQL editor, click the save icon on the toolbar to save the executed query on the SQL business rule and list it under the
Libraries > Source Files folder in the DQ Repository tree view.
To better understand the Business Rule Statistics bar chart in the analysis results, do the following:
1. In the analysis editor, right-click the business rule and select View executed query.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 89/288
26/07/2022 09:49 PDF Export
2. Modify the query in the top part of the editor to read as the following: SELECT * FROM `my_person_joins`.`person` Person
JOIN `my_person_joins`.`Person_ref` Person_ref ON (Person.`name`=Person_ref.`Name`) .
This will list the result data set of the join condition in the editor.
3. In the top left corner of the editor, click the icon to execute the query.
The query result, that is the analyzed result set, is listed in the bottom part of the editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 90/288
26/07/2022 09:49 PDF Export
4. Go back to the analysis editor and click the Analysis Results tab at the bottom of the editor to open a detail view of the
analysis results.
The analyzed result set may contain more or fewer rows than the analyzed table. In this example, the number of match and
non-match records (5 + 2 = 7) exceeds the number of analyzed records (6) because the join of the two tables generates more
rows than expected.
Here 5 rows (71.43%) match the business rule and 2 rows do not match. Because the join generates duplicate rows, this
result does not mean that 5 rows of the analyzed table match the business rule. It only means that 5 rows among the 7 rows
of the result set match the business rule. Actually, some rows of the analyzed tables may not be even analyzed against the
business rule. This happens when the join excludes these rows. For this reason, it is advised to check for duplicates on the
columns used in the join of the business rule in order to make sure that the join does not remove or add rows in the analyzed
result set. Otherwise the interpretation of the result is more complex.
In the Analysis Results view, if the number of match and non-match records exceeds the number of analyzed records, you can
generate a ready-to-use analysis that will analyze the duplicates in the selected table.
In some cases, when you analyze database tables that have some duplicate records and a join clause, using an SQL business rule,
the join results show that there are more rows in the joint than in the analyzed table.
You can generate a ready-to-use analysis to analyze these duplicate records. The results of this analysis help you to better
understand why there are more records in the join results than in the table.
A table analysis with an SQL business rule, that has a join condition, is defined and executed in the Profiling perspective of Talend
Studio. The join results must show that there are duplicates in the table.
For more information, see Creating a table analysis with an SQL business rule with a join condition.
Procedure
1. After creating and executing an analysis on a table that has duplicate records as outlined in Creating a table analysis with an
SQL business rule with a join condition, click the Analysis Results tab at the bottom of the analysis editor.
2. Right-click the join results in the second table and select Analyze duplicates.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 91/288
26/07/2022 09:49 PDF Export
The Column Selection dialog box opens with the analyzed tables selected by default.
3. Modify the selection in the dialog box if needed and then click OK.
Two column analyses are generated and listed under the Analyses folder in the DQ Repository tree view and are open in the
analysis editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 92/288
26/07/2022 09:49 PDF Export
5. Click Analysis Results at the bottom of the analysis editor to access the detail result view.
6. Right-click the row count or duplicate count results in the table, or right-click the result bar in the chart itself and select:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 93/288
26/07/2022 09:49 PDF Export
Option To...
open a view on a list of all data rows or duplicate rows in the analyzed
View rows
column.
View values open a view on a list of the duplicate data values of the analyzed column.
You can use a simplified way to create a table analysis with a predefined business rule. All what you need to do is to start from the
table name under the relevant DB Connection folder.
Procedure
1. In the DQ Repository tree view, expand Metadata > DB Connections, and then browse to the table you want to analyze.
2. Right-click the table name and select Table analysis from the list.
The New Table Analysis wizard is displayed.
3. Enter the metadata for the new analysis in the corresponding fields and then click Next to proceed to the next step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 94/288
26/07/2022 09:49 PDF Export
4. Expand Rules > SQL and then select the check box(es) of the predefined SQL business rule(s) you want to use on the
corresponding table(s).
An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.
This type of analysis helps you to detect anomalies in column dependencies through defining columns as either "determinant" or
"dependent" and then analyzing values in dependant columns against those in determinant columns. This analysis supports only
database tables.
This type of analysis detects to what extent a value in a determinant column functionally determines another value in a dependant
column.
This can help you identify problems in your data, such as values that are not valid. For example, if you analyze the dependency
between a column that contains United States Zip Codes and a column that contains states in the United States, the same Zip Code
should always have the same state. Running the functional dependency analysis on these two columns will show if there are any
violations of this dependency.
Procedure
2. In the filter field, start typing functional dependency analysis , select Functional Dependency Analysis from the list and
click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 95/288
26/07/2022 09:49 PDF Export
3. Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the
next step.
Procedure
1. Expand DB connections, browse to the database you want to analyze, select it and then click Finish to close the New Analysis
wizard.
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 96/288
26/07/2022 09:49 PDF Export
The Data Preview section shows a sample data of all the table columns.
2. In the Left Column panel, click A Columns Set to open the Column Selection dialog box.
Here you can select the first set of columns against which you want to analyze the values in the dependant columns. You can
also drag the columns directly from the DQ Repository tree view to the left column panel.
In this example, you want to evaluate the records present in the city column and those present in the state_province
column against each other to see if state names match to the listed city names and vice versa.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 97/288
26/07/2022 09:49 PDF Export
3. In the Column Selection dialog box, expand DB Connections and browse to the column(s) you want to define as determinant
columns.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.
4. Select the check box(es) next to the column(s) you want to analyze and click OK to proceed to the next step.
The selected column(s) are displayed in the Left Columns panel of the Analyzed Columns Set view. In this example, we select
the city column as the determinant column.
5. Do the same to select the dependant column(s) or drag it/them from the DQ Repository tree view to the Right Columns
panel. In this example, we select the state_province column as the dependent column. This relation will show if the state
names match to the listed city names.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column is automatically located under the corresponding connection in the tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 98/288
26/07/2022 09:49 PDF Export
6. Click the Reverse columns tab to automatically reverse the defined columns and thus evaluate the reverse relation, what city
names match to the listed state names.
You can select to connect to a different database by selecting another connection from the Connection list in the Data
Preview section. This list shows all the connections created in the Studio. If the columns listed in the Analyzed Columns Set
view do not exist in the new database connection you want to set, you will receive a warning message that enables you to
continue or cancel the operation.
Procedure
1. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.
An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.
This functional dependency analysis evaluated the records present in the city column and those present in the
state_province column against each other to see if the city names match to the listed state names and vice versa. The
returned results, in the %Match column, indicate the functional dependency strength for each determinant column. The
records that do not match are indicated in red.
The #Match column in the result table lists the numbers of the distinct determinant values in each of the analyzed columns.
The #row column in the analysis results lists the actual relations between the determinant attribute and the dependant
attribute. In this example, #Match in the first row of the result table represents the number of distinct cities, and #row
represents the number of distinct pairs (city, state_province). Since these two numbers are not equal, then the functional
dependency relationship here is only partial and the ratio of the numbers (%Match) measures the actual dependency
strength. When these numbers are equal, you have a "strict" functional dependency relationship, that is to say each city
appears only once with each state.
Note: The presence of null values in either of the two analyzed columns will lessen the "dependency strength". The
system does not ignore null values, but rather calculates them as values that violates the functional dependency.
3. In the Analysis Results view, right-click any of the dependency lines and select:
Option To...
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 99/288
26/07/2022 09:49 PDF Export
View valid/invalid rows access a list in the SQL editor of all valid/invalid rows measured according to the
functional dependencies analysis
View valid/invalid values access a list in the SQL editor of all valid/invalid values measured according to the
functional dependencies analysis
View detailed valid/detailed access a detailed list in the SQL editor of all valid/invalid values measured according
invalid values to the functional dependencies analysis
From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ
Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed
on indicators.
From the Profiling perspective, you can better explore the quality of data in tables in delimited files through either:
Creating a simple table analysis through analyzing all columns in the table using patterns.
Comparing a set of columns and creating groups of similar records using blocking and matching keys and/or survivorship
rules.
This type of analysis provide simple statistics on the number of records falling in certain categories, including the number of rows,
the number of null values, the number of distinct and unique values, the number of duplicates, or the number of blank fields. For
more information about these indicators, see Simple statistics.
It is also possible to add patterns to this type of analysis and have a single-bar result chart that shows the number of the rows that
match "all" the patterns.
You can analyze the content of a set of columns in a delimited file. This set can represent only some of the columns in the defined
table or the table as a whole.
You can then execute the created analysis using the Java engine.
Prerequisite(s): At least one connection to a delimited file is set in the Profiling perspective of the studio. For further information, see
Connecting to a database.
Warning: When carrying out this type of analysis, the set of columns to be analyzed must not include a primary key column.
Procedure
Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the next
step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 100/288
26/07/2022 09:49 PDF Export
Selecting the set of columns you want to analyze in the delimited file
Procedure
1. Expand the FileDelimited connection and browse to the set of columns you want to analyze.
2. Select the columns to be analyzed, and then click Finish to close this New analysis wizard.
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 101/288
26/07/2022 09:49 PDF Export
A sample data is displayed in the Data Preview section and the selected columns are displayed in the Analyzed Column
section of the analysis editor.
3. If required, select another connection from the Connection box in the Analyzed Columns view. This box lists all the
connections created in the Studio with the corresponding database names.
By default, the delimited file connection you have selected in the previous step is displayed in the Connection box.
4. If required, click the Select columns to analyze link to open a dialog box where you can modify your column selection.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 102/288
26/07/2022 09:49 PDF Export
Note: You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields
respectively. The lists will show only the tables/columns that correspond to the text you type in.
5. In the column list, select the check boxes of the column(s) you want to analyze and click OK to proceed to the next step.
In this example, you want to analyze a set of six columns in the delimited file: account number (account_num), education
(education), email (email), first name (fname), last name (lname) and gender (gender). You want to identify the number of
rows, the number of distinct and unique values and the number of duplicates.
You can add patterns to one or more of the analyzed columns to validate the full record (all columns) against all the patterns, and
not to validate each column against a specific pattern as it is the case with the column analysis. The results chart is a single bar chart
for the totality of the used patterns. This chart shows the number of the rows that match "all" the patterns.
Warning: Before being able to use a specific pattern with a set of columns analysis, you must manually set in the patterns
settings the pattern definition for Java, if it does not already exist. Otherwise, a warning message will display prompting you to
set the definition of the Java regular expression.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 103/288
26/07/2022 09:49 PDF Export
An analysis of a set of columns is open in the analysis editor in the Profiling perspective of Talend Studio.
Procedure
Select the check box(es) of the expression(s) you want to add to the selected column, then Click OK.
The added regular expression(s) display(s) under the analyzed column(s) in the Analyzed Columns view and the All Match
indicator is displayed in the Indicators list in the Indicators view.
What is left before executing this set of columns analysis is to define the indicator settings, data filter and analysis parameters.
Procedure
1. In the Analysis Parameters view, select the Allow drill down check box to store locally the data that will be analyzed by the
current analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 104/288
26/07/2022 09:49 PDF Export
2. In the Max number of rows kept per indicator field enter the number of the data rows you want to make accessible.
Note: The Allow drill down check box is selected by default, and the maximum analyzed data rows to be shown per
indicator is set to 50.
Results
The editor switches to the Analysis Results view and displays the graphical result corresponding to the Simple Statistics indicators
used to analyze the defined set of columns.
When you use patterns to match the content of the set of columns, another graphic is displayed to illustrate the match and non-
match results against the totality of the used patterns.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 105/288
26/07/2022 09:49 PDF Export
The procedure to filter the data of the analysis of a delimited file is the same as that for the database analysis. For further
information, see Filtering data against patterns.
You can create a column analysis on one or more columns defined in the set of columns analysis.
A simple table analysis is defined in the analysis editor in the Profiling perspective of Talend Studio.
Procedure
1. In the Analyzed Columns view, right-click the column(s) you want to create a column analysis on.
2. Follow the steps outlined in Creating a basic column analysis on a file to continue creating the column analysis on a
delimited file.
Analyzing duplicates
You can use the match analysis in the Profiling perspective of Talend Studio to compare columns in databases or delimited files and
create groups of similar records using the VSR or the T-Swoosh algorithm.
This analysis provides you with a simple way to create match rules, test them on a set of columns and see the results directly in the
editor..
You can also use the Profiling perspective to define match rules in a match rule editor and save them in the Talend Studio repository.
The match analysis enables you to compare a set of columns in databases or in delimited files and create groups of similar records
using blocking and matching keys and/or survivorship rules.
This analysis enables you to create match rules and test them on data to assess the number of duplicates . Currently, you can test
match rules only on columns in the same table.
Prerequisite(s): At least one database or file connection is defined under the Metadata node.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 106/288
26/07/2022 09:49 PDF Export
Procedure
1. Creating the connection to a data source from inside the editor if no connection has been defined under the Metadata folder
in the Studio tree view.
For further information, see Configuring the match analysis.
2. Defining the table or the group of columns you want to search for similar records using match processes.
For further information, see Defining a match analysis from the Analysis folder or Defining a match analysis from the
Metadata folder.
3. Defining blocking keys to reduce the number of pairs that need to be compared.
For further information, see Defining a match rule.
4. Defining match keys, the match methods according to which similar records are grouped together. For further information,
see Defining a match rule.
5. Exporting the match rules from the match analysis editor and centralize them in the studio repository.
For further information, see Importing or exporting match rules.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 107/288
26/07/2022 09:49 PDF Export
3. Start typing match in the filter field, select Match Analysis and then click Next to open a wizard.
4. Set the analysis name and metadata and then click Next.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 108/288
26/07/2022 09:49 PDF Export
5. Expand DB connections or FileDelimited connections depending on if the columns you want to match are in a database or a
delimited file.
6. Browse to the columns you want to match, select them and then click Finish.
The columns you select should be in the same table. Currently, the match analysis does not work on columns in different
tables.
The match analysis editor opens listing the selected columns.
You can also define a match analysis starting from the table or columns you want to match. For further information, see
Defining a match analysis from the Metadata folder.
7. Modify the parameters in the match analysis editor according to your needs.
For further information, see Configuring the match analysis.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 109/288
26/07/2022 09:49 PDF Export
Browse the database or the file connection to the table you want to match, right-click it and select Match Analysis, or
Browse the database or the file connection to the columns you want to match, right-click them and select Analyze
matches.
The columns you select should be in the same table. Currently, the match analysis does not work on columns in different
tables.
The match analysis editor opens listing all columns in the table or the group of selected columns.
3. Set the analysis name and metadata and click Next to open the analysis editor.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
4. Modify the parameters in the match analysis editor according to your needs.
For further information, see Configuring the match analysis.
Procedure
1. In the Limit field in the match analysis editor, set the number for the data records you want to use as a data sample.
Data is displayed in the Data Preview table.
2. If required, click any column name in the table to sort the sample data in an ascending or descending order.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 110/288
26/07/2022 09:49 PDF Export
Option To...
locate the selected table under the Metadata node in the tree view.
New Connection create a connection to a database or to a file from inside the match
analysis editor where you can expand this new connection and select the
columns on which to do the match.
Select Data update the selection of the columns listed in the table.
If you change the data set for an analysis, the charts that display the match
results of the sample data will be cleared automatically. You must click
Chart to compute the match results for the new data set you have defined.
lists in the table N first data records from the selected columns or list N
n first rows
random records from the selected columns.
or
n random rows
Select Blocking Key define the column(s) from the input flow according to which you want to
partition the processed data in blocks.
Select Matching Key define the match rules and the column(s) from the input flow on which you
want to apply the match algorithm.
Results
The Data Preview table has some additional columns which show the results of matching data. The indication of these columns are
as the following:
Column Description
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 111/288
26/07/2022 09:49 PDF Export
GRP_SIZE counts the number of records in the group, computed only on the master
record.
Each input record will be compared to the master record, if they match, the
input record will be in the group.
SCORE measures the distance between the input record and the master record
according to the matching algorithm used.
GRP_QUALITY only the master record has a quality score which is the minimal value in the
group.
ATTRIBUTE_SCORE lists the match score and the names of the columns used as key attributes
in the applied rules.
You can define match rules from the match analysis editor by defining:
blocking keys, the column(s) from the input flow according to which you want to partition the processed data in blocks,
matching keys and survivorship rules, the match algorithms you want to apply on columns from the input flow.
Defining a blocking key is not mandatory but strongly advisable. Using a blocking key to partition data in blocks reduces the number
of records that need to be examined as comparisons are restricted to record pairs within each block. Using blocking column(s) is
very useful when you are processing a big data set.
Procedure
1. In the Data section, click the Select Blocking Key tab and then click the name of the column(s) you want to use to partition
the processed data in blocks.
Blocking keys that have the exact name of the selected columns are listed in the Blocking Key table.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 112/288
26/07/2022 09:49 PDF Export
You can define more than one column in the table, but only one blocking key will be generated and listed in the BLOCK_KEY
column in the Data table.
For example, if you use an algorithm on the country and lname columns to process records that have the same first
character, data records that have the same first letter in the country and last names are grouped together in the same block.
Comparison is restricted to record within each block.
To remove a column from the Blocking key table, right-click it and select Delete or click on its name in the Data table.
2. Select an algorithm for the blocking key, and set the other parameters in the Blocking Key table as needed.
In this example, only one blocking key is used. The first character of each word in the country column is retrieved and listed
in the BLOCK_KEY column.
3. Click Chart to compute the generated key, group the sample records in the Data table and display the results in a chart.
This chart allows you to visualize the statistics regarding the number of blocks and to adapt the blocking parameters
according to the results you want to get.
Procedure
1. In the Record linkage algorithm section, select Simple VSR Matcher if it is not selected by default.
2. In the Data section, click the Select Matching Key tab and then click the name of the column(s) on which you want to apply
the match algorithm.
Matching keys that have the exact names of the selected input columns are listed in the Matching Key table.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 113/288
26/07/2022 09:49 PDF Export
To remove a column from this table, right-click it and select Delete or click on its name in the Data table.
3. Select the match algorithms you want to use from the Matching Function column and the null operator from the Handle Null
column.
In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on first
names and last names respectively and get the duplicate records.
If you want to use an external user-defined matching algorithm, select Custom and use the Custom Matcher column to load
the Jar file of the user-defined algorithm.
Procedure
Make sure first to select the column(s) on which to apply the match algorithm either from the Data section by using the Select
Matching Key tab, or directly from the Matching Key table.
Procedure
2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a
new rule, and then set the following criteria.
Match Key Name: Enter the name of your choice for the match key.
Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you
want to use an external user-defined matching algorithm.
Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-
defined algorithm.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 114/288
26/07/2022 09:49 PDF Export
Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.
Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This
value is used to give greater or lesser importance to certain columns when performing the match.
Handle Null: Specify how to deal with data records which contain null values.
nullMatch None: If one record contains a null, do not consider this a match.
Survivorship Function: Select how two similar records will be merged from the drop-down list.
Concatenate: It adds the content of the first record and the content of the second record together - for
example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to
be used to separate values.
Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the
source records are False .
Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the
source records are True .
Most common: It validates the most frequently-occurring field value in each duplicates group.
Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in
each duplicates group. The relevant Reference column must be of the Date type.
Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each
duplicates group.
Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical
value in a duplicates group.
Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of
numeric type.
Reference column: If you set Survivor Function to Most recent or Most ancient, this item is used to select the reference
column.
Parameter: For the Concatenate survivorship function, this item is used to specify a separator you want to use for
concatenating data.
Two data records match when the probability is above this value.
In the Confident match threshold field, set a numerical value between the current Match threshold and 1 .
4. In the Survivorship Rules For Columns section, define how data records survive for certain columns. Click the [+] button to
add a new rule, and then set the following criteria:
Input Column: Enter the column to which you want to apply the survivorship rule.
Survivorship Function: Select how two similar records will be merged from the drop-down list.
Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source
you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to
specify a separator to use for concatenating data.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 115/288
26/07/2022 09:49 PDF Export
If you specify the survivorship function for a match key in the Match And Survivor section and also specify the survivorship
function for the match key as an input column in the Survivorship Rules For Columns section, the survivorship function
selected in the Match And Survivor section is applied to the column.
5. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean , Date ,
Number and String .
a. Click the [+] button to add a new row for each data type.
b. In the Data Type column, select the relevant data type from the drop-down list.
c. In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note
that, depending on the data type, only certain choices may be relevant.
Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric
type.
d. Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source
you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to
specify a separator you want to use for concatenating data.
If you specify the survivorship function for a column in the Survivorship Rules For Columns section and also specify the
survivorship function for the data type of the column in the Default Survivorship Rules section, the suvivorship function
selected in the Survivorship Rules For Columns is applied to the column.
If you do not specify the behavior for any or all data types, the default behavior (the Most common survivorship function) will
be applied, that is, the most frequently-occurring field value in each duplicates group will be validated.
Procedure
1. To define a second match rule, put your cursor on the top right corner of the Matching Key table, click the [+] button to create
a new rule.
Follow the steps outlined in Defining a match rule to define matching keys.
When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data.
Records are evaluated against the first rule and the records that match are not evaluated against the second rule and so on.
2. Click the button at the top right corner of the Matching Key or Match and Survivor section and replace the default name
of the rule with a name of your choice.
If you define more than one rule in the match analysis, you can use the up and down arrows in the dialog box to change the
rule order and thus decide what rule to execute first.
3. Click OK.
The rules are named and ordered accordingly in the section.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 116/288
26/07/2022 09:49 PDF Export
5. Click Chart to compute the groups according to the blocking key and match rule you defined in the editor and display the
results of the sample data in a chart.
This chart shows a global picture about the duplicates in the analyzed data. The Hide groups less than parameter is set to 2
by default. This parameter enables you to decide what groups to show in the chart, you usually want to hide groups of small
group size.
The chart in the above image indicates that out of the 1000 sample records you examined and after excluding items that are
unique, by having the Hide groups less than parameter set to 2 :
49 groups have 2 items each. In each group, the 2 items are duplicates of each other.
7 groups have 3 duplicate items and the last group has 4 duplicate items.
Also, the Data table indicates the match details of items in each group and colors the groups in accordance with their colors
in the match chart.
Procedure
Save the settings in the match analysis editor and press F6.
The analysis is executed. The match rule and blocking key are computed against the whole data set and the Analysis Results view is
open in the editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 117/288
26/07/2022 09:49 PDF Export
In this view, the charts give a global picture about the duplicates in the analyzed data. In the first tables, you can read statistics about
the count of processed records, distinct records with only one occurrence, duplicate records (matched records) and suspect records
that did not match the rule. Duplicate records represent the records that matched with a good score - above the confidence
threshold. One record of the matched pair is a duplicate that should be discarded and the other is the survivor record.
In the second table, you can read statistics about the number of groups and the number of records in each group. You can click any
column header in the table to sort the results accordingly.
You can import match rules from the studio repository and use them in the match editor to test them on your data. You can also
export match rules from the match editor and save them in the studio repository.
You can import match rules stored in the studio repository into the match editor and test them on your data. You can also export
match rules from the editor and store them in the studio repository.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 118/288
26/07/2022 09:49 PDF Export
2. In the Match Rule Selector wizard, select the match rule you want to import into the match analysis editor and use on the
analyzed data.
A warning message displays in the wizard if the match rule you want to import is defined on columns that do not exist in the
analyzed data. Ignore the message as you can define input columns later in the match analysis editor.
3. Select the Overwrite current Match Rule in the analysis check box if you want to replace the rule in the editor with the rule
you import, otherwise, leave the box unselected.
4. Click OK.
The match rule is imported and the matching and blocking keys and /or survivorship rules are listed in the Matching Key and
Blocking Key tables respectively.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 119/288
26/07/2022 09:49 PDF Export
5. Click in the Input column and select from the list the column on which you want to apply the imported blocking and
matching keys.
If you have in the analyzed data a column that match the input column in the imported keys, it will be automatically defined
in the Input column, you do not need to define it yourself.
When you analyze data with multiple conditions, the match results will list data records that meet any of the defined rules.
When you execute the match analysis, an OR match operation is conducted on data and data records are evaluated against
the first rule and the records that match are not evaluated against the other rules.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 120/288
26/07/2022 09:49 PDF Export
2. In the open wizard, enter a name for the rule and set other metadata, if needed.
3. Click Finish.
The rule editor opens on the rule settings and the rule is saved and listed under Libraries > Rules > Match in the DQ
Repository tree view.
In data quality, match rules are used to compare a set of columns and create groups of similar records using blocking and matching
keys and/or survivorship functions.
From the studio, you can create match rules with the VSR or the T-Swoosh algorithm and save them in the studio repository. Once
centralized in the repository, you can import them in the match analysis editor and test them on your data to group duplicate
records. For further information about the match analysis, see Creating a match analysis.
The two algorithms produce different match results because of two reasons:
first, the master record is simply selected to be the first input record with the VSR algorithm. Therefore, the list of match
groups may depend on the order of the input records,
second, the output records do not change with the VSR algorithm, whereas the T-Swoosh algorithm creates new records.
Procedure
3. In the New Match Rule wizard, enter a name and set other metadata, if required.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 121/288
26/07/2022 09:49 PDF Export
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
Consider as an example that you want to create a rule to match customer full names.
4. Click Finish.
A match rule editor opens in the studio and the new match rule is listed under Libraries > Rule > Match in the DQ Repository
tree view.
In the Record Linkage algorithm view, the Simple VSR Matcher algorithm is selected by default.
5. Start defining the match rule items as described in Rules with the VSR algorithm and Rules with the T-Swoosh algorithm.
Duplicating a rule
To avoid creating a match rule from scratch, you can duplicate an existing one and work around its metadata and definition to have
a new rule.
Procedure
1. In the DQ Repository tree view, expand Libraries > Rules > Match.
2. Browse through the match rule list to reach the rule you want to duplicate.
4. Double-click the duplicated rule to open it and modify its metadata and/or definition as needed.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 122/288
26/07/2022 09:49 PDF Export
The VSR algorithm takes a set of records as input and groups similar encountered duplicates together according to defined match
rules. It compares pairs of records and assigns them to groups. The first processed record of each group is the master record of the
group. So, the order of the records is important and can have an impact on the creation process of the master records.
The VSR algorithm compares each record with the master of each group and uses the computed distances, from master records, to
decide to what group the record should go.
In the match analysis, the matching results of the VSR algorithm may vary depending on the order of the input records. If possible,
put the records in which you have more confidence first in the input flow, to have better algorithm accuracy.
You can import and test the rule on your data in the match analysis editor. For further information, see Importing or exporting match
rules.
Procedure
1. In the rule editor and in the Generation of Blocking Key section, click the [+] button to add a row to the table.
Blocking Key Name: Enter a name for the column you want to use to reduce the number of record pairs that need to
be compared.
Pre-algorithm: Select from the drop-down list an algorithm and set its value where necessary.
Defining a pre-algorithm is not mandatory. This algorithm is used to clean or standardize data before processing it
with the match algorithm and thus improve the results of data matching.
Algorithm: Select from the drop-down list the match algorithm you want to use and set its value where necessary.
Post-algorithm: Select from the drop-down list an algorithm and set its value where necessary
Defining a post-algorithm is not mandatory. This algorithm is used to clean or standardize data after processing it
with the match algorithm and thus improve the outcome of data matching.
3. If required, follow the same steps to add as many blocking keys as needed.
When you import a rule with many blocking keys into the match analysis editor, only one blocking key will be generated and
listed in the BLOCK_KEY column in the Data table.
Procedure
1. In the rule editor and in the Matching Key table, click the [+] button to add a row to the table.
Match Key Name: Enter the name of your choice for the match key.
Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you
want to use an external user-defined matching algorithm.
In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on
first names and last names respectively and get the duplicate records.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 123/288
26/07/2022 09:49 PDF Export
Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-
defined algorithm.
Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This
value is used to give greater or lesser importance to certain columns when performing the match.
Handle Null: Specify how to deal with data records which contain null values.
3. In the Match threshold field, enter the match probability threshold. Two data records match when the probability is above
this value.
In the Confident match threshold field, set a numerical value between the current Match threshold and 1 . Above this
threshold, you can be confident about the quality of the group.
4. To define a second match rule, place your cursor on the top right corner of the Matching Key table and then click the [+]
button.
Follow the steps to create a match rule.
When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data.
Records are evaluated against the first rule and the records that match are not evaluated against the second rule.
5. If required, put your cursor on the top right corner of the table and click the button then replace the default names of
the rules with names of your choice.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 124/288
26/07/2022 09:49 PDF Export
You can also use the up and down arrows in the dialog box to change the rule order and thus decide what rule to execute
first.
6. Click OK.
The rules are named and ordered accordingly in the Matching Key table.
You can use the T-Swoosh algorithm to find duplicates and to define how two similar records are merged to create a master record,
using a survivorship function. These new merged records are used to find new duplicates.
The differences between the T-Swoosh and the VSR algorithms are the following:
When using the T-Swoosh algorithm, the master record is in general a new record that does not exist in the list of input
records.
When using the T-Swoosh algorithm, you can define a survivorship function for each column to create a master record.
Procedure
2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a
new rule, and then set the following criteria.
Match Key Name: Enter the name of your choice for the match key.
Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you
want to use an external user-defined matching algorithm.
Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-
defined algorithm.
Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.
Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This
value is used to give greater or lesser importance to certain columns when performing the match.
Handle Null: Specify how to deal with data records which contain null values.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 125/288
26/07/2022 09:49 PDF Export
nullMatch None: If one record contains a null, do not consider this a match.
Survivorship Function: Select how two similar records will be merged from the drop-down list.
Concatenate: It adds the content of the first record and the content of the second record together - for
example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to
be used to separate values.
Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the
source records are False .
Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the
source records are True .
Most common: It validates the most frequently-occurring field value in each duplicates group.
Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in
each duplicates group. The relevant Reference column must be of the Date type.
Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each
duplicates group.
Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical
value in a duplicates group.
Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of
numeric type.
Reference column: If you set Survivor Function to Most recent or Most ancient, this item is used to select the reference
column.
Parameter: For the Concatenate survivorship function, this item is used to specify a separator you want to use for
concatenating data.
Two data records match when the probability is above this value.
In the Confident match threshold field, set a numerical value between the current Match threshold and 1 .
4. In the Survivorship Rules For Columns section, define how data records survive for certain columns. Click the [+] button to
add a new rule, and then set the following criteria:
Input Column: Enter the column to which you want to apply the survivorship rule.
Survivorship Function: Select how two similar records will be merged from the drop-down list.
Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source
you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to
specify a separator to use for concatenating data.
If you specify the survivorship function for a match key in the Match And Survivor section and also specify the survivorship
function for the match key as an input column in the Survivorship Rules For Columns section, the survivorship function
selected in the Match And Survivor section is applied to the column.
5. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean , Date ,
Number and String .
a. Click the [+] button to add a new row for each data type.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 126/288
26/07/2022 09:49 PDF Export
b. In the Data Type column, select the relevant data type from the drop-down list.
c. In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note
that, depending on the data type, only certain choices may be relevant.
Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric
type.
d. Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source
you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to
specify a separator you want to use for concatenating data.
If you specify the survivorship function for a column in the Survivorship Rules For Columns section and also specify the
survivorship function for the data type of the column in the Default Survivorship Rules section, the suvivorship function
selected in the Survivorship Rules For Columns is applied to the column.
If you do not specify the behavior for any or all data types, the default behavior (the Most common survivorship function) will
be applied, that is, the most frequently-occurring field value in each duplicates group will be validated.
Column analyses
Where to start?
Talend Studio enables you to examine and collect statistics and information about the data available in database columns and in
delimited files.
design a column analysis from scratch and define the analysis settings manually,
create column analyses automatically preconfigured with the indicators appropriate to the type you select,
Procedure
Option To...
Basic Column Analysis generate an empty column analysis where you can select the columns to
analyze and manually assign the indicators on each column.
Discrete Data Analysis create a column analysis on numerical data preconfigured with the Bin
Frequency and simple statistics indicators. You can then configure further
the analysis or modify it in order to convert continuous data into discrete
bins (ranges) according to your needs.
Nominal Values Analysis create a column analysis on nominal data preconfigured with indicators
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 127/288
26/07/2022 09:49 PDF Export
For example results about these statistics, see Finalizing and executing the
column analysis.
Pattern Frequency Analysis create a column analysis preconfigured with the Pattern Frequency,
Pattern Low Frequency and the row and null count indicators.
This analysis can discover patterns in your data. It shows frequent patterns
and rare patterns so that you can identify quality issues more easily.
For example results about these statistics, see Finalizing and executing the
column analysis.
Summary Statistics Analysis create a column analysis on numerical data preconfigured with the
Summary Statistics and the row and null count indicators.
This helps you to get a good idea of the shape of your numeric data by
computing the range, the inter quartile range and the mean and median
values.
2. Usually, the sequence of profiling data in one or multiple columns involves the following steps:
a. Connecting to the data source. For more information, see Creating connections to data sources.
b. Defining one or more columns on which to carry out data profiling processes that will define the content, structure
and quality of the data included in the column(s).
c. Settings predefined system indicators or indicators defined by the user on the column(s) that need to be analyzed or
monitored. These indicators will represent the results achieved through the implementation of different patterns.
d. Adding to the column analyses the patterns against which you can define the content, structure and quality of the
data.
For more information, see Adding a regular expression or an SQL pattern to a column analysis.
What to do next
The Creating a basic analysis on a database column section explains the procedures to analyze the content of one or multiple
columns in a database.
The Creating a basic column analysis on a file section explains the procedures to analyze columns in delimited files.
You can build your analysis from scratch, analyze the content of one or multiple columns and execute the created analyses using the
Java or the SQL engine. This type of analysis provides statistics about the values within each column.
When you use the Java engine to run a column analysis, you can view the analyzed data according to parameters you set yourself.
For more information, see Using the Java or the SQL engine.
Note: When you use the Java engine to run a column analysis on big sets or on data with many problems, it is advisable to define
a maximum memory size threshold in Talend Studio Preferences to execute the analysis as you may end up with a Java heap
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 128/288
26/07/2022 09:49 PDF Export
error.
You can also analyze a set of columns. This type of analysis provides statistics on the values across all the data set (full records).
The sequence of creating a basic column analysis involves the following steps:
Procedure
1. Defining the column(s) to be analyzed.
2. Setting predefined system indicators or indicators defined by the user for the column(s).
3. Adding the patterns against which to define the content, structure and quality of the data.
The first step in analyzing the content of one or multiple columns is to define the column(s) to be analyzed. The analysis results
provides statistics about the values within each column.
Before you begin, you have defined at least one database connection in the Profiling perspective of Talend Studio.
When you select to analyze Date columns and run the analysis with the Java engine, the date information is stored in the Talend
Studio as regular date/time of format YYYY-MM-DD HH:mm:ss.SSS for date/timestamp and of format HH:mm:ss.SSS for time. The
date and time formats are slightly different when you run the analysis with the SQL engine.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 129/288
26/07/2022 09:49 PDF Export
3. In the filter field, start typing basic column analysis , select Basic Column Analysis and click Next.
4. In the Name field, enter a name for the current column analysis.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed
to the next step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 130/288
26/07/2022 09:49 PDF Export
Procedure
1. Expand DB connections and in the desired database, browse to the columns you want to analyze.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 131/288
26/07/2022 09:49 PDF Export
Note: When profiling a DB2 database, if double quotes exist in the column names of a table, the double quotation marks
cannot be retrieved when retrieving the column. Therefore, it is recommended not to use double quotes in column
names in a DB2 database table.
2. Select the columns and then click Finish to close the wizard.
A file for the newly created column analysis is listed under the Analysis node in the DQ Repository tree view, and the analysis
editor opens with the analysis metadata.
This example analyzes full names, email addresses and sales figures.
Option To...
New Connection open a wizard and create a connection to the data source from within the
editor.
The Connection field on top of this section lists all the connections created
in Talend Studio.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 132/288
26/07/2022 09:49 PDF Export
Select Columns open the Column Selection dialog box where you can select the columns to
analyze or change the selection of the columns listed in the table. From the
open dialog box, you can filter the table or column lists by using the Table
filter or Column filter fields respectively.
Select Indicators open the Indicator Selection dialog box where you can select the indicators
to use for profiling columns
n first rows or n random rows list in the table N first data records from the selected columns or list N
random records from the selected columns
Refresh Data display the data in the selected columns according to the criteria you set
Run with sample data run the analysis only on the sample data set in the Limit field
5. In the Limit field, set the number for the data records you want to display in the table and use as sample data.
6. In the Analyzed Columns view, use the arrows in the top right corner to open different pages in the view if you analyze large
number of columns.
You can also drag the columns to be analyzed directly from the DQ Repository tree view to the Analyzed Columns list in the .
If one of the columns you want to analyze is a primary or a foreign key, its data mining type becomes automatically Nominal
when you list it in the Analyzed Columns view.
For more information, see Data mining types.
7. If required, right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view to
locate it in the database connection in the DQ Repository tree view.
The second step after defining the columns to be analyzed is to set either system or user-defined indicators for each of the defined
columns.
A column analysis is open in the analysis editor in the Profiling perspective of Talend Studio.
Procedure
1. From the Data preview view in the analysis editor, click Select indicators to open the Indicator Selection dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 133/288
26/07/2022 09:49 PDF Export
Note:
It is useless to use Pattern Frequency Statistics on a column of a Date type in databases when executing the analysis with
the SQL engine. No data quality issues are returned by this indicator as all dates will be displayed using one single
format.
If you attach the Date Pattern Frequency to a date column in your analysis, you can generate a date regular expression
from the analysis results.
3. Click OK.
The selected indicators are attached to the analyzed columns in the Analyzed Columns view.
The analysis in this example provides/computes the following:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 134/288
26/07/2022 09:49 PDF Export
Prerequisite(s): A column analysis is open in the analysis editor in the Profiling perspective of the studio. For more information, see
Defining the columns to be analyzed.
For more information about setting indicators, see Setting system or user-defined indicators.
Procedure
1. In the Analyzed Columns view in the analysis editor, click the option icon next to the indicator.
2. In the dialog box that opens, set the parameters for the given indicator.
For example, if you want to flag if there are null values in the column you analyze, you can set 0 in the Upper threshold field
for the Null Count indicator.
Indicators settings dialog boxes differ according to the parameters specific for each indicator. For more information about
different indicator parameters, see Indicator parameters.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 135/288
26/07/2022 09:49 PDF Export
Prerequisite(s):
A column analysis is open in the analysis editor in the Profiling perspective of the studio. For more information, see Defining
the columns to be analyzed.
A user-defined indicator is created in the Profiling perspective of the studio. For more information, see Creating SQL user-
defined indicators.
To set user-defined indicators from the analysis editor for the columns to be analyzed, do the following:
Procedure
1. Either:
a. In the analysis editor and from the Analyzed Columns view, click the icon next to the column name to which you
want to define the indicator.
The UDI selector dialog box opens.
2. Or:
a. In the DQ Repository tree view, expand Libraries > Indicators.
b. From the User Defined Indicator folder, drop the user-defined indicator(s) against which you want to analyze the
column content to the column name in the Analyzed Columns view.
The user-defined indicator is listed under the column name.
c. If required, set a threshold for the user-defined indicator.
For further information, see Setting options for system or user-defined indicators.
d. Save the analysis.
After defining the column(s) to be analyzed and setting indicators, you may want to filter the data that you want to analyze and
decide what engine to use to execute the column analysis.
The column analysis is open in the analysis editor in the Profiling perspective of Talend Studio.
You have set system or predefined indicators for the column analysis.
You have installed in the studio the SQL explorer libraries that are required for data quality.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 136/288
26/07/2022 09:49 PDF Export
Procedure
1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
a. In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to
the selected database connection.
You can set this number according to the database available resources, that is the number of concurrent connections
each database can support.
Note: Connection concurrency is not supported when using a connection to a SQLite database or a Hive database
on Spark. Connection concurrency is supported when using a connection to a Hive2 server.
b. From the Execution engine list, select the engine, Java or SQL, you want to use to execute the analysis.
If you select the Java engine:
select the Allow drill down check box to be able to drill down, in the Analysis Results view, the results of all
indicators except Row Count.
in the Max number of rows kept per indicator field, set the number of the data rows you want to drill down.
3. If you have defined context variables in the Contexts view in the analysis editor, do the following:
a. use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the
number of concurrent connections per analysis respectively.
b. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 137/288
26/07/2022 09:49 PDF Export
For further information about the Frequency and Text Statistics, see Advanced statistics and Text statistics respectively.
Below are the graphics representing the Pattern Frequency and Pattern Low Frequency statistics for the email column.
The patterns in the table use a and A to represent the email values. Each pattern can have till 30 characters. If the total
number of characters exceeds 30, the pattern is represented as the following: aaaaaAAAAAaaaaaAAAAAaaaaaAAAAA...<total
number of characters> , and you can place your pointer on the pattern in the table to get the original value.
For further information about these indicators, see Pattern frequency statistics.
Below are the graphics representing the Summary Statistics for the total_sales column.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 138/288
26/07/2022 09:49 PDF Export
For further information about the Benford's law statistics usually used as an indicator of accounting and expenses fraud in
lists or tables, see Fraud Detection.
Results
If you execute this analysis using the Java engine and then select the Allow drill down check box in the Analysis parameters view,
you can store locally the analyzed data and thus access it via Analysis Results > Data view . You can use the Max number of rows kept
per indicator field to decide the number of the data rows you want to make accessible.
When you select the Java engine, the system will look for Java regular expressions first, if none is found, it looks for SQL regular
expressions.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 139/288
26/07/2022 09:49 PDF Export
If you execute this analysis using the SQL engine, you can view the executed query for each of the attached indicators if you right-
click an indicator and then select the View executed query option from the list. However, when you use the Java engine, SQL queries
will not be accessible and thus clicking this option will open a warning message.
For more information on the Java engine, see Using the Java or the SQL engine.
After setting the analysis parameters in the analysis editor, you can use either the Java or the SQL engine to execute your analysis.
The choice of the engine can sometimes slightly change analysis results, for example when you select the summary statistics
indicators to profile a DB2 database. This is because indicators are computed differently depending on the database type, and also
because Talend uses special functions when working with Java.
SQL engine:
an SQL query is generated for each indicator used in the column analysis, the analysis runs multiple indicators in parallel and
results are refreshed in the charts while the analysis is still in progress,
By using this engine, you guarantee system better performance. You can also access valid/invalid data in the data explorer.
Java engine:
only one query is generated for all indicators used in the column analysis,
you can set the parameters to decide whether to access the analyzed data and how many data rows to show per indicator.
This will help to avoid memory limitation issues since it is impossible to store all analyzed data.
When you execute the column analysis with the Java engine, you do not need different query templates specific for each database.
However, system performance is significantly reduced in comparison with the SQL engine. Executing the analysis with the Java
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 140/288
26/07/2022 09:49 PDF Export
engine uses disk space as all data is retrieved and stored locally. If you want to free up some space, you may delete the data stored in
the main studio directory, at Talend-Studio>workspace>project_name>Work_MapDB .
To set the parameters to access analyzed data when using the Java engine, do the following:
Procedure
1. In the Analysis Parameter view of the column analysis editor, select Java from the Execution engine list.
2. Select the Allow drill down check box to store locally the data that will be analyzed by the current analysis.
This check box is usually selected by default.
3. In the Max number of rows kept per indicator field enter the number of the data rows you want to make accessible.
This field is set to 50 by default.
Results
You can now run your analysis and then have access to the analyzed data according to the set parameters.
To access a more detailed view of the analysis results of the procedures outlined in Defining the columns to be analyzed and setting
indicators and Finalizing and executing the column analysis, do the following:
Procedure
1. Click the Analysis Results tab at the bottom of the analysis editor to open the corresponding view.
2. Click the Analysis Result tab in the view and then the name of the analyzed column for which you want to open the detailed
results.
Note: The display of the Analysis Results view depends on the parameters you set in the Preferences window. For more
information, see Setting preferences of analysis editors and analysis results.
The detailed analysis results view shows the generated graphics for the analyzed columns accompanied with tables that
detail the statistic results.
Below are the tables that accompany the Frequency and Simple Statistics graphics in the Analysis Results view for the
analyzed email column.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 141/288
26/07/2022 09:49 PDF Export
In the Simple Statistics table, if an indicator value is displayed in red, this means that a threshold has been set on the
indicator in the column analysis editor and that this threshold has been violated. For further information about data
thresholds, see Setting options for system or user-defined indicators.
Below are the tables and the graphics representing the order of magnitude and the Benford's law statistics in the Analysis
Results view for the analyzed total_sales column.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 142/288
26/07/2022 09:49 PDF Export
For further information about the Benford's law statistics usually used as an indicator of accounting and expenses fraud in
lists or tables, see Fraud Detection.
3. Right-click any data row in the result tables and select View rows to access a view of the analyzed data.
After running your analysis using the SQL or the Java engine and from the Analysis Results view of the analysis editor, you can right-
click any of the rows in the statistic result tables and access a view of the actual data.
After running your analysis using the Java engine, you can use the analysis results to access a view of the actual data.
After running your analysis using the SQL engine, you can use the analysis results to open the Data Explorer perspective and access a
view of the actual data.
Prerequisite(s):You have selected the Profiling perspective in the studio. A column analysis has been created and executed.
1. At the bottom of the analysis editor, click the Analysis Results tab to open a detailed view of the analysis results.
2. Right-click a data row in the statistic results of any of the analyzed columns and select an option as the following:
Option Operation
View rows Open a view on a list of all data rows in the analyzed column.
Note: For the Duplicate Count indicator, the View rows option will list
all the rows that are duplicated. So if the duplicate count is 12 for
example, this option will list 24 rows.
View values Open a view on a list of the actual data values of the analyzed column.
Options other than the above listed ones are available when using regular expressions and SQL patterns in a column analysis.
When using the SQL engine, the view opens in the Data Explorer perspective listing the rows or the values of the analyzed data
according to the limits set in the data explorer.
If the Data Explorer perspective is missing from the studio, you must install certain SQL explorer libraries that are required for data
quality to work correctly, otherwise you may get an error message.
For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 143/288
26/07/2022 09:49 PDF Export
This explorer view will give also some basic information about the analysis itself. Such information is of great help when working
with multiple analysis at the same time.
Warning: The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server.
If you analyze data using such connection and you try to view data rows and values in the Data Explorer perspective, a warning
message prompt you to set your connection credentials to the SQL Server.
When using the Java engine, the view opens in the studio listing the number of the analyzed data rows you set in the Analysis
parameters view of the analysis editor.
From this view, you can export the analyzed data into a csv file. To do that:
1. Click the icon in the upper left corner of the view. A dialog box opens.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 144/288
26/07/2022 09:49 PDF Export
2. Click Choose... and browse to where you want to store the csv file and give it a name.
3. Click OK to close the dialog box. A csv file is created in the specified place holding all the analyzed data rows listed in the
view.
You can use regular expressions or SQL patterns in column analyses. These expressions and patterns will help you define the
content, structure and quality of the data included in the analyzed columns.
For more information on regular expressions and SQL patterns, see Patterns and indicators and Steps to analyze database tables.
You can add to any column analysis one or more regular expressions or SQL patterns against which you can match the content of the
column to be analyzed.
Warning:
If the database you are using does not support regular expressions or if the query template is not defined in the studio, you need
first to declare the user defined function and define the query template before being able to add any of the specified patterns to
the column analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 145/288
26/07/2022 09:49 PDF Export
Procedure
1. In the Analyze Columns view in the analysis editor, click the icon next to the column name to which you want to add a
regular expression or an SQL pattern, the email column in this example.
The Pattern Selector dialog box opens.
2. Expand Patterns and browse to the regular expression or/and the SQL patterns you want to add to the column analysis.
3. Select the check box(es) of the expression(s) or pattern(s) you want to add to the selected column.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 146/288
26/07/2022 09:49 PDF Export
Results
Prerequisite(s): You have selected the Profiling perspective in the studio. A column analysis is open in the analysis editor.
Procedure
1. In the Analyze Columns view in the analysis editor, right-click the pattern you want to edit and select Edit pattern from the
contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 147/288
26/07/2022 09:49 PDF Export
2. In the Pattern Definition view, edit the pattern definition, or change the selected database, or add other patterns specific to
available databases using the [+] button.
If the regular pattern is simple enough to be used in all databases, select Default in the list.
When you edit a pattern through the analysis editor, you modify the pattern in the studio repository. Make sure that your
modifications are suitable for all other analyses that may be using the modified pattern.
When you add one or more patterns to an analyzed column, you check all existing data in the column against the specified
pattern(s). After the execution of the column analysis, using the java or the SQL engine you can access a list of all the valid/invalid
data in the analyzed column.
When you use the Java engine to run the analysis, the view of the actual data will open in the studio. While if you use the SQL engine
to execute the analysis, the view of the actual data will open in the Data Explorer perspective.
Prerequisite(s):
You have installed in the studio the SQL explorer libraries that are required for data quality.
If you do not install these libraries, the Data Explorer perspective will be missing from the studio and many features will not be
available. For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide .
To view the actual data in the column analyzed against a specific pattern, do the following:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 148/288
26/07/2022 09:49 PDF Export
Procedure
1. Follow the steps outlined in Defining the columns to be analyzed and Adding a regular expression or an SQL pattern to a
column analysis to create a column analysis that uses a pattern.
4. Right-click the pattern line in the Pattern Matching table and select:
Option To...
View valid/invalid values open a view of all valid/invalid values measured against the pattern used
on the selected column
View valid/invalid rows open a view of all valid/invalid rows measured against the pattern used on
the selected column
Results
When using the SQL engine, the view opens in the Data Explorer perspective listing valid/invalid rows or values of the analyzed data
according to the limits set in the data explorer.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 149/288
26/07/2022 09:49 PDF Export
This explorer view will also give some basic information about the analysis itself. Such information is of great help when working
with multiple analysis at the same time.
The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server. If you analyze
data using such connection and you try to view data rows and values in the Data Explorer perspective, a warning message prompt
you to set your connection credentials to the SQL Server.
When using the Java engine, the view opens in the Profiling perspective of the studio listing the number of valid/invalid data
according to the row limit you set in the Analysis parameters view of the analysis editor. For more information, see Using the Java or
the SQL engine.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 150/288
26/07/2022 09:49 PDF Export
You can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree view if you click the
save icon on the SQL editor toolbar. For more information, see Saving the queries executed on indicators.
From the studio and in the Data Explorer perspective, you can view the queries executed on different indicators used in an analysis.
From the data explorer, you will be able to save the query and list it under the Libraries > Source Files folders in the DQ Repository
tree view.
Prerequisite(s): You have selected the Profiling perspective in the studio. At least one analysis with indicators has been created.
To save any of the queries executed on an indicator set in a column analysis, do the following:
Procedure
1. In the column analysis editor, right-click any of the used indicators to open a contextual menu.
2. Select View executed query to open the data explorer on the query executed on the selected indicator.
If the Data Explorer perspective is missing from the studio, you must install certain SQL explorer libraries that are required for
data quality to work correctly, otherwise you may get an error message.
For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide .
Note: The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL
Server. If you analyze data using such connection and you try to view the executed queries in the Data Explorer
perspective, a warning message prompt you to set your connection credentials to the SQL Server.
3. Click the save icon on the editor toolbar to open the Select folder dialog box
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 151/288
26/07/2022 09:49 PDF Export
4. Select the Source Files folder or any sub-folder under it and enter in the Name field a name for the open query.
In the studio, you can use simplified ways to create one or multiple column analyses. All what you need to do is to start from the
table name or the column name under the relevant DB Connection folder in the DQ Repository tree view.
However, the options you have to create column analyses if you start from the table name are different from those you have if you
start from the column name.
To create a column analysis directly from the relevant table name in the DB Connection, do the following:
2. Browse to the table that holds the column(s) you want to analyze and right-click it.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 152/288
26/07/2022 09:49 PDF Export
Item To...
Match Analysis open the match analysis editor where you can define match rules and
select the columns on which you want to use the match rules.
Table Analysis analyze the selected table using SQL business rules.
Column Analysis analyze all the columns included in the selected table using the Simple
Statistics indicators.
Pattern Frequency Analysis analyze all the columns included in the selected table using the Pattern
Frequency Statistics along with the Row Count and the Null Count
indicators.
The above steps replace the procedures outlined in Defining the columns to be analyzed and setting indicators. Then you proceed
following the steps outlined in Finalizing and executing the column analysis.
To create a column analysis directly from the column name in the DB Connection, do the following:
Item To...
you must later set the indicators you want to use to analyze the selected
column.
Nominal Value Analysis create a column analysis on nominal data preconfigured with indicators
appropriate for nominal data, namely Value Frequency, Simple
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 153/288
26/07/2022 09:49 PDF Export
Simple Analysis analyze the selected column using the Simple Statistics indicators.
Pattern Frequency Analysis analyze the selected column using the Pattern Frequency Statistics along
with the Row Count and the Null Count indicators.
Analyze Column Set perform an analysis on the content of a set of columns. This analysis
focuses on a column set (full records) and not on separate columns as it is
the case with the column analysis.
For more information, see Creating a simple table analysis (Column Set
Analysis).
Analyze Correlation perform column correlation analyses between nominal and interval
columns or nominal and date columns in database tables.
Analyze matches open the match analysis editor where you can define match rules and
select the columns on which you want to use the match rules.
The above steps replace one of or both of the procedures outlined in Defining the columns to be analyzed and setting indicators.
Now, you proceed following the same steps outlined in Finalizing and executing the column analysis.
You can create a column analysis on a delimited file and execute the created analyses using the Java engine.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 154/288
26/07/2022 09:49 PDF Export
The first step in analyzing the content of one or multiple columns is to define the column(s) to be analyzed. The analysis results
provides statistics about the values within each column.
When you select to analyze Date columns and run the analysis with the Java engine, the date information is stored in the Talend
Studio as regular date/time of format YYYY-MM-DD HH:mm:ss.SSS for date/timestamp and of format HH:mm:ss.SSS for time. The
date and time formats are slightly different when you run the analysis with the SQL engine.
Before you begin, you have defined at least one connection to a delimited file the Profiling perspective of Talend Studio.
Procedure
3. In the filter field, start typing basic column analysis , select Basic Column Analysis and click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 155/288
26/07/2022 09:49 PDF Export
4. In the Name field, enter a name for the current column analysis.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set column analysis metadata (purpose, description and author name) in the corresponding fields and click Next to proceed
to the next step.
Procedure
1. Expand FileDelimited connections and in the desired file, browse to the columns you want to analyze.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 156/288
26/07/2022 09:49 PDF Export
In this example, you want to analyze the id, first_name and age columns from the selected connection.
2. Select the columns and then click Finish to close the wizard.
A file for the newly created column analysis is listed under the Analysis node in the DQ Repository tree view, and the analysis
editor opens with the analysis metadata.
4. In the Limit field, set the number for the data records you want to display in the table and use as sample data.
For example, 50 records.
5. Select n first rows to list the first 50 records from the selected columns.
6. In the Analyzed Columns view, use the arrows in the top right corner to open different pages in the view if you analyze large
number of columns.
You can also drag the columns to be analyzed directly from the DQ Repository tree view to the Analyzed Columns list in the .
7. Use the delete, move up or move down buttons to manage the analyzed columns.
8. If required, right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view to
locate it in the database connection in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 157/288
26/07/2022 09:49 PDF Export
The second step after defining the columns to be analyzed is to set statistics indicators for each of the defined columns.
Note:
You can also use Java user-defined indicators when analyzing columns in a delimited file on the condition that a Java user-
defined indicator is already created.
Procedure
2. From the Data preview view in the analysis editor, click Select indicators to open the Indicator Selection dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 158/288
26/07/2022 09:49 PDF Export
Note: You can set the text statistics indicators on a column only if its data mining type is set to nominal. Otherwise, these
indicators will be grayed out in the Indicator Selection dialog box.
The selected indicators are attached to the analyzed columns in the Analyzed Columns view.
Procedure
1. Follow the procedures outlined in Defining the columns to be analyzed and Setting indicators on columns.
2. In the Analyzed Columns view in the analysis editor, click the option icon next to the indicator.
3. In the dialog box that opens, set the parameters for the given indicator.
Indicators settings dialog boxes differ according to the parameters specific for each indicator. For more information about
different indicator parameters, see Indicator parameters.
You can add one or more regular expressions to one or more of the analyzed columns.
An analysis of a delimited file is open in the analysis editor in the Profiling perspective of Talend Studio.
Procedure
1. Define the regular expression you want to add to the analyzed column.
In this example, the regular expression checks for all words that start with uppercase.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 159/288
26/07/2022 09:49 PDF Export
2. Add the regular expression to the analyzed column in the open analysis editor, the first_name column in this example.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 160/288
26/07/2022 09:49 PDF Export
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 161/288
26/07/2022 09:49 PDF Export
After running your file analysis using the Java engine and from the Analysis Results view of the analysis editor, you can right-click
any of the rows in the statistic result tables and access a view of the actual data.
1. At the bottom of the analysis editor, click the Analysis Results tab to open a detailed view of the analysis results.
2. Right-click a data row in the statistic results of any of the analyzed columns and select an option as the following:
Option Operation
View rows Open a view on a list of all data rows in the analyzed column.
Note: For the Duplicate Count indicator, the View rows option will list
all the rows that are duplicated. So if the duplicate count is 12 for
example, this option will list 24 rows.
View values Open a view on a list of the actual data values of the analyzed column.
The options View rows and View values are not available for Row Count. You can preview the data in Analysis Settings > Data
Preview.
Option Operation
View valid/invalid rows open a view on a list of all valid/invalid rows measured against a pattern.
View valid/invalid values open a view on a list of all valid/invalid values measured against a
pattern.
From this view, you can export the analyzed data into a csv file. To do that:
1. Click the icon in the upper left corner of the view. A dialog box opens.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 162/288
26/07/2022 09:49 PDF Export
2. Click Choose... and browse to where you want to store the csv file and give it a name.
3. Click OK to close the dialog box. A csv file is created in the specified place holding all the analyzed data rows listed in the
view.
You can profile data in a delimited file using a simplified way. All what you need to do is to start from the column name under
Metadata > FileDelimited folders in the DQ Repository tree view.
For further information, see Creating analyses from table or column names.
This analysis enables you to analyze numerical data. It creates a column analysis in which indicators, appropriate for numeric data,
are assigned to the column by default.
Discrete data can only take particular values of potentially an infinite number of values. Continuous data is the opposite of discrete
data, it is not restricted to defined separate values, but can occupy any value over a continuous range.
This analysis uses the Bin Frequency indicator that you must configure further in order to convert continuous data into discrete bins
(ranges) according to your needs.
Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see
Connecting to a database.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 163/288
26/07/2022 09:49 PDF Export
1. In the DQ Repository tree view, expand Metadata and browse to the numerical column you want to analyze.
2. Right-click the numerical column and select Column Analysis > Discrete data Analysis.
In this example, you want to convert customer age into a number of discrete bins, or range of age values.
The New Analysis wizard opens.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 164/288
26/07/2022 09:49 PDF Export
5. Double-click the Bin Frequency indicator to open the Indicator settings dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 165/288
26/07/2022 09:49 PDF Export
6. Set the bins minimum and maximum values and the number of bins in the corresponding fields.
If you set the number of bins is set to 0 , no bin is created. The indicator computes the frequency of each value of the
column.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 166/288
26/07/2022 09:49 PDF Export
The analysis creates age ranges with limited and discrete set of possible values out of an unlimited, continuous range of age
values.
2. Right-click any data row in the result tables or in the charts, the first age range in this example, and select View rows to
access a view of the analyzed data.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 167/288
26/07/2022 09:49 PDF Export
The SQL Editor opens listing all customers whose age is between 28 and 39.
When you create a column analysis in the studio, you can see a Datamining Type box next to each of the columns you want to
analyze. The selected type in the box represents the data mining type of the associated column.
These data mining types help the studio to choose the appropriate metrics for the associated column since not all indicators (or
metrics) can be computed on all data types.
Available data mining types are: Nominal, Interval, Unstructured Text and Other. The sections below describe these data mining
types.
Nominal
Nominal data is categorical data which values/observations can be assigned a code in the form of a number where the numbers are
simply labels. You can count, but not order or measure nominal data.
In the studio, the mining type of textual data is set to nominal. For example, a column called WEATHER with the values: sun , cloud
and rain is nominal.
And a column called POSTAL_CODE that has the values 52200 and 75014 is nominal as well in spite of the numerical values. Such
data is of nominal type because it identifies a postal code in France. Computing mathematical quantities such as the average on
such data is non sense. In such a case, you should set the data mining type of the column to Nominal, because there is currently no
way in the studio to automatically guess the correct type of data.
The same is true for primary or foreign-key data. Keys are most of the time represented by numerical data, but their data mining
type is Nominal.
Interval
This data mining type is used for numerical data and time data. Averages can be computed on this kind of data. In databases,
sometimes numerical quantities are stored in textual fields.
In the studio, it is possible to declare the data mining type of a textual column (e.g. a column of type VARCHAR ) as Interval. In that
case, the data should be treated as numerical data and summary statistics should be available.
Unstructured text
This is a new data mining type introduced by the studio. This data mining type is dedicated to handle unstructured textual data.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 168/288
26/07/2022 09:49 PDF Export
For example, the data mining type of a column called COMMENT that contains commentary text can not be Nominal, since the text in
it is unstructured. Still, we could be interested in seeing the duplicate values of such a column and here comes the need for such a
new data mining type.
Other
This is another new data mining type introduced in the studio. This type designs the data that the studio does not know how to
handle yet.
When masking data using Talend Data Preparation or the tDataMasking component, each of the characters in the input data is
masked to a character from the same character type, within the supported Unicode ranges.
When creating column analyses in Talend Studio, you can use the East Asia Pattern Frequency or East Asia Pattern Low Frequency
indicators for Asian characters, to define the content, structure and quality of your data.
The following table describes the supported character types and the related Unicode ranges (version 11.0).
For more information, see the documentation for the Unicode Standard (http://unicode.org/standard/standard.html) and the
character code charts (http://www.unicode.org/charts/).
Full-width Katakana [30A1-30FA] 30FC 30FD 30FE [ァ-ヺ] ー ヽ ヾ
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 169/288
26/07/2022 09:49 PDF Export
Different profiling results when running column analyses with the Java and the SQL engines
The profiling results of column analyses that use the Week Frequency and Week Low Frequency indicators may be different between
the Java and the SQL engines.
Environment
Talend Open Studio for Data Quality and all platform Studios with data quality.
Description
You may get different results when you run column analysis with Week Frequency and Week Low Frequency indicators using the
Java or the SQL engine.
This is due to the fact that the date function may differ between different database systems (DBMS) or even between different
installations of the same DBMS system.
Let's take the WEEK(date [,mode]) function of MySQL as an example, for further information check Date and Time Functions . This
function returns the week number for date . It takes a two-arguments form that enables you to specify whether:
With the SQL engine, Talend Studio uses the function WEEK(date) . The mode argument is omitted and thus the studio uses the
default mode value as set in the DBMS configuration, in general mode=0 , but that depends on your MySQL installation.
However, if you need to change this default behavior, you can create an UDI (User Defined Indicator) where you specify the mode
you want to use in the SQL query template.
With the Java engine, Talend Studio uses the parameter Locale.getDefault() to know about the above-listed two arguments and gets
the results from the Java API. This means it uses the locale of the studio and not the locale of the DBMS.
This explains why you may get different profiling result between the Java and the SQL engines on (Low) Week Frequency indicator
and other date functions.
DOCT-4674
Correlation analyses
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 170/288
26/07/2022 09:49 PDF Export
Talend Studio provides the possibility to explore relationships and correlations between two or more columns so that these
relationships and correlations give a new interpretation of the data through describing how data values are correlated at different
positions.
Note: Column correlation analyses are possible only on database columns for the time being. You can not use these analyses on
file connections.
It is very important to make the distinction between column correlation analyses and all other types of data quality analyses.
Column correlation analyses are usually used to explore relationships and correlations in data and not to provide statistics about the
quality of data.
Several types of column correlation analysis are possible. For more information, see Creating a numerical correlation analysis,
Creating a time correlation analysis and Creating a nominal correlation analysis.
For more information about the use of data mining types in Talend Studio, see Data mining types.
This type of analysis analyzes correlation between nominal and interval columns and gives the result in a kind of a bubble chart.
A bubble chart is created for each selected numeric column. In a bubble chart, each bubble represents a distinct record of the
nominal column. For example, a nominal column called outlook with 3 distinct nominal instances: sunny (11 records), rainy (16
records) and overcast (4 records) will generate a bubble chart with 3 bubbles.
The second column in this example is the temperature column where temperature is in degrees Celsius. The analysis in this example
will show the correlation between the outlook and the temperature columns and will give the result in a bubble chart. The vertical
axis represents the average of the numeric column and the horizontal axis represents the number of records of each nominal
instance. The average temperature would be 23.273 for the "sunny" instances, 7.5 for the "rainy" instances and 18.5 for the
"overcast" instances.
The two things to pay attention to in such a chart is the position of the bubble and its size.
Usually, outlier bubbles must be further investigated. The more the bubble is near the left axis, the less confident we are in the
average of the numeric column. For example, the overcast nominal instance here has only 4 records, hence the bubble is near the
left axis. We cannot be confident in the average with only 4 records. When looking for data quality issues, these bubbles could
indicate problematic values.
The bubbles near the top of the chart and those near the bottom of the chart may suggest data quality issues too. A too high or too
low temperature in average could indicate a bad measure of the temperature.
The size of the bubble represents the number of null numeric values. The more there are null values in the interval column, the
bigger will be the bubble.
When several nominal columns are selected, the order of the columns plays a crucial role in this analysis. A series of bubbles (with
one color) is displayed for the average temperature and the weather. Another series of bubbles is displayed for the average
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 171/288
26/07/2022 09:49 PDF Export
In the example below, you want to create a numerical correlation analysis to compute the age average of the personnel of different
enterprises located in different states. Three database columns are used for the analysis: STATE, AGE and COMPANY.
Note: The numerical correlation analysis is possible only on database columns for the time being. You can not use this analysis
on file connections.
Procedure
1. In the DQ Repository tree view, expand Data Profiling.
3. Start typing numerical correlation analysis in the filter field, select Numerical Correlation Analysis and then click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 172/288
26/07/2022 09:49 PDF Export
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Finish.
Results
A folder for the newly created analysis is listed under Analysis in the DQ Repository tree view, and the analysis editor opens on the
analysis metadata.
Selecting the columns you want to analyze and setting analysis parameters
Procedure
1. In the analysis editor and from the Connection list, select the database connection on which to run the analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 173/288
26/07/2022 09:49 PDF Export
The numerical correlation analysis is possible only on database columns for the time being. You can change your database
connection by selecting another connection from the Connection list. If the analyzed columns do not exist in the new
database connection you want to set, you receive a warning message that enables you to continue or cancel the operation.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 174/288
26/07/2022 09:49 PDF Export
3. Browse the catalogs/schemas in your database connection to the column(s) you want to analyze.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.
4. Click the table name to list all its columns in the right-hand panel of the Column Selection dialog box.
5. In the column list, select the check boxes of the column(s) you want to analyze and click OK.
In this example, you want to compute the age average of the personnel of different enterprises located in different states.
Then the columns to be analyzed are AGE, COMPANY and STATE.
You can drag the columns to be analyzed directly from the corresponding database connection in the DQ Repository tree
view into the Analyzed Columns area.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column will be automatically located under the corresponding connection in the tree view.
The selected columns are displayed in the Analyzed Column view of the analysis editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 175/288
26/07/2022 09:49 PDF Export
6. In the Indicators view, click to open a dialog box where you can set thresholds for each indicator.
The indicators representing the simple statistics are by-default attached to this type of analysis.
7. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
8. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database connection, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.
9. If you have defined context variables in the Contexts view in the analysis editor, complete the following steps:
a. Use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the
number of concurrent connections per analysis respectively.
b. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
For more information about contexts and variables, see Using context variables in analyses.
Results
The editor switches to the Analysis Results view showing the results.
For more information about the analysis results, see Exploring the results of the numerical correlation analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 176/288
26/07/2022 09:49 PDF Export
A numerical correlation analysis is defined and executed in the Profiling perspective of Talend Studio.
Procedure
1. In the Analysis Results view of the analysis editor, click Graphics, Simple Statistics or Data to show the generated graphic, the
number of the analyzed records or the actual analyzed data respectively.
In the Graphics view, the data plotted in the bubble chart have different colors with the legend pointing out which color
refers to which data.
The more the bubble is near the left axis the less confident we are in the average of the numeric column. For the selected
bubble in the above example, the company name is missing and there are only two data records, hence the bubble is near
the left axis. We cannot be confident about age average with only two records. When looking for data quality issues, these
bubbles could indicate problematic values.
The bubbles near the top of the chart and those near the bottom of the chart may suggest data quality issues too, too big or
too small age average in the above example.
2. From the generated graphic, you can perform the following actions:
clear the check box of the value(s) you want to hide in the bubble chart,
place the pointer on any of the bubbles to see the correlated data values at that position,
right-click any of the bubbles and select:
Option To...
Results
The below figure illustrates an example of the SQL editor listing the correlated data values at the selected position.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 177/288
26/07/2022 09:49 PDF Export
From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree
view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.
The Simple Statistics view shows the number of the analyzed records falling in certain categories, including the number of rows, the
number of distinct and unique values and the number of duplicates.
You can sort the data listed in the result table by simply clicking any column header in the table.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 178/288
26/07/2022 09:49 PDF Export
This type of analysis analyzes correlation between nominal and date columns and gives the result in a gantt chart that illustrates the
start and finish dates of each value of the nominal column.
In the example below, you want to create time correlation analysis to compute the minimal and maximal birth dates for each listed
country in the selected nominal column. Two columns are used for the analysis: birthdate and country.
Note: The time correlation analysis is possible only on database columns for the time being. You can not use this analysis on file
connections.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 179/288
26/07/2022 09:49 PDF Export
3. Start typing time correlation analysis in the filter field, select Time Correlation Analysis and click Next.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Finish.
A folder for the newly created analysis is listed under Analysis in the DQ Repository tree view, and the analysis editor opens
on the analysis metadata.
Selecting the columns for the time correlation analysis and setting analysis parameters
Procedure
1. In the analysis editor and from the Connection list, select the database connection on which to run the analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 180/288
26/07/2022 09:49 PDF Export
The time correlation analysis is possible only on database columns for the time being. You can change your database
connection by selecting another connection from the Connection list. If the analyzed columns do not exist in the new
database connection you want to set, you receive a warning message that enables you to continue or cancel the operation.
2. Click Select Columns to open the Column Selection dialog box and select the columns, or drag them directly from the DQ
Repository tree view into the Analyzed Columns view.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column will be automatically located under the corresponding connection in the tree view.
3. If required, click in the Indicators view to open a dialog box where you can set thresholds for each indicator.
The indicators representing the simple statistics are by-default attached to this type of analysis.
4. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
5. In the Analysis Parameterview and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database connection, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 181/288
26/07/2022 09:49 PDF Export
For detail explanation of the analysis results, see Exploring the results of the time correlation analysis.
A time correlation analysis is defined and executed is the Profiling perspective of the studio.
Procedure
In the Analysis Results view of the analysis editor, click Graphics, Simple Statistics or Data to show the generated graphic, the
number of the analyzed records or the actual analyzed data respectively.
What to do next
In the Graphics view, you can clear the check box of the value(s) you want to hide in the chart.
The Gantt chart displays a range showing the minimal and maximal birth dates for each country listed in the selected nominal
column. It also highlights the range bars that contain null values for birth dates.
For example, in the above chart, the minimal birth date for Mexico is 1910 and the maximal is 2000. And of all the data records where
the country is Mexico, 41 records have null value as birth date.
You can also select a specific birth date range to show if you put the pointer at the start nominal value you want to show and drag it
to the end nominal value you want to show.
clear the check box of the value(s) you want to hide in the chart,
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 182/288
26/07/2022 09:49 PDF Export
place the pointer on any of the range bars to display the correlated data values at that position,
Option To...
The below figure illustrates an example of the SQL editor listing the correlated data values at the selected range bar.
From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree
view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.
The Simple Statistics view shows the number of the analyzed records falling in certain categories, including the number of rows, the
number of distinct and unique values and the number of duplicates.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 183/288
26/07/2022 09:49 PDF Export
You can sort the data listed in the result table by simply clicking any column header in the table.
This type of analysis analyzes minimal correlations between nominal columns in the same table and gives the result in a chart.
In the chart, each column will be represented by a node that has a given color. The correlations between the nominal values are
represented by lines. The thicker the line is, the weaker the association is. Thicker lines can identify problems or correlations that
need special attention. However, you can always inverse edge weight, that is give larger edge thickness to higher correlation, by
selecting the Inverse Edge Weight check box below the nominal correlation chart.
The correlations in the chart are always pairwise correlations: show associations between pairs of columns.
In the example below, you want to create nominal correlation analysis to compute the minimal and maximal birth dates for each
listed country in the selected nominal column. Two columns are used for the analysis: birthdate and country.
Note: The nominal correlation analysis is possible only on database columns for the time being. You can not use this analysis on
file connections.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 184/288
26/07/2022 09:49 PDF Export
3. Start typing nominal correlation analysis in the filter field, select Nominal Correlation Analysis and then click Next.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 185/288
26/07/2022 09:49 PDF Export
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Finish.
A folder for the newly created analysis is listed under Analysis in the DQ Repository tree view, and the analysis editor opens
on the analysis metadata.
Procedure
1. In the analysis editor and from the Connection list, select the database connection on which to run the analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 186/288
26/07/2022 09:49 PDF Export
The nominal correlation analysis is possible only on database columns for the time being. You can change your database
connection by selecting another connection from the Connection list. If the analyzed columns do not exist in the new
database connection you want to set, you receive a warning message that enables you to continue or cancel the operation.
2. Click Select Columns to open the Column Selection dialog box and select the columns you want to analyze, or drag them
directly from the DQ Repository tree view.
If you select too many columns, the analysis result chart will be very difficult to read.
You can right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository viewto locate
the selected column under the corresponding connection in the tree view.
3. If required, click in the Indicators view to open a dialog box where you can set thresholds for each indicator.
The indicators representing the simple statistics are by-default attached to this type of analysis.
4. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.
5. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database connection, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 187/288
26/07/2022 09:49 PDF Export
The editor switches to the Analysis Results view showing the results.
For detail explanation of the analysis results, see Exploring the results of the nominal correlation analysis.
Prerequisite(s): A nominal correlation analysis is defined and executed in the Profiling perspective of the studio.
Procedure
Click Graphics, Simple Statistics or Data to show the generated graphic, the number of the analyzed records or the actual data
respectively.
Results
The Graphics view shows the generated graphic for the analyzed columns.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 188/288
26/07/2022 09:49 PDF Export
To have a better view of the graphical result of the nominal correlation analysis, right-click the graphic in the Graphics panel and
select Show in full screen.
In the above chart, each value in the country and marital-status columns is represented by a node that has a given color.
Nominal correlation analysis is carried out to see the relationship between the number of married or single people and the country
they live in. Correlations are represented by lines, the thicker the line is, the higher the association is - if the Inverse Edge Weight
check box is selected.
The buttons below the chart help you manage the chart display. The following table describes these buttons and their usage:
Button Description
Filter Edge Weight Move the slider to the right to (filter out edges with small weight) visualize
the more important edges.
plus and minus Click the [+] or [-] buttons to respectively zoom in and zoom out the chart
size.
Inverse Edge Weight By default, the thicker the line is, the weaker the correlation is.
Select this check box to inverse the current edge weight, that is give larger
edge thickness to higher correlation.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 189/288
26/07/2022 09:49 PDF Export
Button Description
Picking Select this check box to be able to pick any node and drag it to anywhere in
the chart.
Restore Layout Click this button to restore the chart to its previously saved layout.
The Simple Statistics view shows the number of the analyzed records falling in certain categories, including the number of rows, the
number of distinct and unique values and the number of duplicates.
You can sort the data listed in the result table by simply clicking any column header in the table.
Patterns are sets of strings against which you can match the content of the columns to be analyzed.
Pattern types
Two types of patterns are listed under the Patterns folder in the DQ Repository tree view: regular expressions and SQL patterns.
Regular expressions (regex) are predefined patterns that you can use to search and manipulate text in the databases to which you
connect. You can also create your own regular expressions and use them to analyze columns.
When selecting a pattern in a Job, the regular expression for the current database type is used:
If the regular expression does not exist for this database type, the default regular expression in the selected pattern is used.
If you remove the regular expression for this database type in a pattern that is used in Jobs, the Jobs are updated with the
default regular expression in the selected pattern.
SQL patterns are a kind of personalized patterns that are used in SQL queries. These patterns usually contain the percent sign (%).
For more information on SQL wildcards, see http://www.w3schools.com/SQL/sql_wildcards.asp.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 190/288
26/07/2022 09:49 PDF Export
You can use any of the above two pattern types either with column analyses or with the analyses of a set of columns (simple table
analyses). These pattern-based analyses illustrate the frequencies of various data patterns found in the values of the analyzed
columns. For more information, see Creating a basic analysis on a database column and Creating an analysis of a set of columns
using patterns.
From Talend Studio, you can generate graphs to represent the results of analyses using patterns. You can also view tables in the
Analysis Results view that write in words the generated graphs. From those graphs and analysis results you can easily determine the
percentage of invalid values based on the listed patterns.
Management processes for SQL patterns and regular expressions, including those for Java, are the same. For more information, see
Managing regular expressions and SQL patterns.
Note: Some databases do not support regular expressions. To work with such databases, some configuration is necessary before
being able to use regular expressions. For more information, see Managing User-Defined Functions in databases.
The following table shows the patterns that you can select in any database:
SQL
Patterns
Regex
Patterns
The regular expression function is built in several databases, but many other databases do not support it. The databases that
natively support regular expressions include: MySQL, PostgreSQL, Oracle 10g, and Ingres while Microsoft SQL server and Netezza do
not, for example.
A different case is when the regular expression function is built in the database but the query template of the regular expression
indicator is not defined.
extend the functionality of certain database servers to support the regular expression function; and
define the query template for a database that supports the regular expression function.
The regular expression function is not built into all different database environments. If you want to use Talend Studio to analyze
columns against regular expressions in databases that do not natively support regular expressions, you can:
Either:
Or:
1. Execute the column analysis using the Java engine. In this case, the system will use the Java regular expressions to analyze
the specified column(s) and not SQL regular expressions. For more information on the Java engine, see Using the Java or the
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 191/288
26/07/2022 09:49 PDF Export
SQL engine.
A query template defines the query logic required to analyze columns against regular expressions.
The steps to define a query template in Talend Studio include the following:
The below example shows how to define a query template specific for the Microsoft SQL Server database.
Procedure
3. Double-click Regular Expression Matching, or right-click it and select Open from the contextual menu.
The corresponding view is displayed to show the indicator metadata and its definition.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 192/288
26/07/2022 09:49 PDF Export
You need now to add to the list of databases the database for which you want to define a query template. This query
template will compute the regular expression matching.
4. Click the [+] button at the bottom of the Indicator Definition view to add a field for the new template.
5. In the new field, click the arrow and select the database for which you want to define the template. In this example, select
Ingres.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 193/288
26/07/2022 09:49 PDF Export
8. Paste the indicator definition (template) in the Expression box and then modify the text after WHEN in order to adapt the
template to the selected database. In this example, replace the text after WHEN with WHEN REGEX .
10. Click the save icon on top of the editor to save your changes.
Results
You have finalized creating the query template specific for the Ingres database. You can now start analyzing the columns in this
database against regular expressions.
If the regular expression you want to use to analyze data on this server is simple enough to be used with all databases, you can start
your column analyses immediately. If not, you must edit the definition of the regular expression to work with this specific database,
Ingres in this example.
If an analysis with a user-defined indicator runs successfully at least one time and later the indicator definition template for the
database is deleted, the analysis does not fail. It keeps running successfully because it uses the previously generated SQL query.
For more information on how to set the database-specific regular expression definition, see Editing a regular expression or an SQL
pattern.
You can edit the query template you create for a specific database.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 194/288
26/07/2022 09:49 PDF Export
3. Double-click Regular Expression Matching, or right-click it and select Open from the contextual menu.
The corresponding view is displayed to show the indicator metadata and its definition.
4. Click the button next to the database for which you want to edit the query template.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 195/288
26/07/2022 09:49 PDF Export
5. In the Expression area, edit the regular expression template as required and then click OK to close the dialog box and
proceed to the next step.
The regular expression template is modified accordingly.
You can delete the query template you create for a specific database.
Procedure
3. Double-click Regular Expression Matching, or right-click it and select Open from the contextual menu.
The corresponding view is displayed to show the indicator metadata and its definition.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 196/288
26/07/2022 09:49 PDF Export
4. Click the button next to the database for which you want to delete the query template.
Results
The selected query template is deleted from the list in the Indicator definition view.
You can use regular expressions and SQL patterns in column analyses in order to check all existing data in the analyzed columns
against these expressions and patterns. For more information, see Adding a regular expression or an SQL pattern to a column
analysis.
You can also edit the regular expression or SQL pattern parameters after attaching it to a column analysis. For more information, see
Editing a pattern in the column analysis.
After the execution of the column analysis that uses a specific expression or pattern, you can:
access a list of all valid/invalid data in the analyzed column. For more information, see Viewing the data analyzed against
patterns.
The management procedures of regular expressions and SQL patterns include operations like creating, testing, duplicating,
importing and exporting.
The sections below explain in detail each of the management option for regular expressions and SQL patterns. Management
processes for both types of patterns are exactly the same.
You can create new regular expressions or SQL patterns, including those for Java to be used in column analyses.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 197/288
26/07/2022 09:49 PDF Export
Management processes for regular expressions and SQL patterns are the same. The procedure below with all the included screen
captures reflect the steps to create a regular expression. You can follow the same steps to create an SQL pattern.
Procedure
1. In the DQ Repository tree view, expand Libraries > Patterns, and then right-click Regex.
2. From the contextual menu, select New Regex Pattern to open the corresponding wizard.
When you open the wizard, a help panel automatically opens with the wizard. This help panel guides you through the steps
of creating new regular patterns.
3. In the Name field, enter a name for this new regular expression.
Note:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 198/288
26/07/2022 09:49 PDF Export
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
4. If required, set other metadata (purpose, description and author name) in the corresponding fields and click Next.
5. In the Regular expression field, enter the definition of the regular expression to be created. The regular expression must be
surrounded by single quotes.
Note:
For the PostgreSQL database, regular expressions are not compatible among the database different versions.
If you want to use the regular expression with PostgreSQL version 9.1 or greater, you must either:
in the PostgreSQL database configuration, set the standard_conforming_strings parameter to off and write
double backslashes in the definition, or
in the Regular expression field in the wizard, use a single backslash in the expression definition.
For further information about PostgreSQL regular expressions, select Window > Show View, expand Help and then select
Bookmarks.
6. From the Language Selection list, select the language (a specific database or Java).
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 199/288
26/07/2022 09:49 PDF Export
8. In the Pattern Definition view, click the [+] button and add as many regular expressions as necessary in the new pattern.
You can define the regular expressions specific to any of the available databases or specific to Java.
Note: If the regular expression is simple enough to be used in all databases, select Default from the list.
Subfolders labeled with the specified database types or Java are listed below the name of the new pattern under the
Patterns folder in the DQ Repository tree view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 200/288
26/07/2022 09:49 PDF Export
10. Optional: Click the pattern name to display its detail in the Detail View in Talend Studio.
It is possible to test character sequences against a predefined or newly created regular expression.
At least one database connection is set in the Profiling perspective of Talend Studio.
Procedure
1. Follow the steps outlined in Creating a new regular expression or SQL pattern to create a new regular expression.
2. In the open pattern editor, click Pattern Definition to open the relevant view.
3. Click the Test button next to the definition against which you want to test a character sequence to proceed to the next step.
The test view is displayed in the Studio showing the selected regular expression.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 201/288
26/07/2022 09:49 PDF Export
4. In the Test Area, enter the character sequence you want to check against the regular expression
5. From the DB Connection list, select the database in which you want to use the regular expression.
Note: If you select to test a regular expression in Java, the Java option will be selected by default and the DB Connections
option and list will be unavailable in the test view.
6. Click Test.
An icon is displayed in the upper left corner of the view to indicate if the character sequence matches or does not match the
selected pattern definition.
7. If required, modify the regular expression according to your needs and then click Save to save your modifications.
The pattern definition is modified accordingly in the pattern editor.
You can create your own customized patterns from the Pattern Test View. The advantage of creating a pattern from this view is that
you can create your customized pattern based on an already tested regular expression.
All you need to do is to customize the expression definition according to your needs before saving the new pattern.
Procedure
1. In the DQ Repository tree view, expand Libraries > Patterns > Regex and double-click the pattern you want to use to create
your customized pattern.
The pattern editor opens in Talend Studio.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 202/288
26/07/2022 09:49 PDF Export
2. Click Test next to the definition you want to use as a base to create the new pattern.
The Pattern Test View is opened on the definition of the selected regular expression.
3. Optional: Enter a test string in the Test Area, to test the regular expression.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 203/288
26/07/2022 09:49 PDF Export
5. In the Name field, enter a name for this new regular expression.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
6. If required, set other metadata (purpose, description and author name) in the corresponding fields and click Next to proceed
to the next step.
The definition of the initial regular expression is already listed in the Regular expression field.
7. Customize the syntax of the initial regular expression according to your needs. The regular expression definition must be
surrounded by single quotes.
Note:
For the PostgreSQL database, regular expressions are not compatible among the database different versions.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 204/288
26/07/2022 09:49 PDF Export
If you want to use the regular expression with PostgreSQL version 9.1 or greater, you must either:
in the PostgreSQL database configuration, set the standard_conforming_strings parameter to off and write
double backslashes in the definition, or
in the Regular expression field in the wizard, use a single backslash in the expression definition.
For further information about PostgreSQL regular expressions, select Window > Show View, expand Help and then select
Bookmarks.
8. From the Language Selection list, select the database in which you want to use the new regular expression.
Results
A subfolder for the new pattern is listed under the Regex folder in the same file of the initial regular pattern. The pattern editor opens
on the pattern metadata and pattern definition.
You can generate a regular pattern from the results of an analysis that uses the Date Pattern Frequency indicator on a date column.
In the Profiling perspective of Talend Studio, a column analysis is created on a date column using the Date Pattern Frequency
indicator.
Note: To be able to use the Date Pattern Frequency indicator on date columns, you must set the execution engine to Java in the
Analysis Parameter view of the column analysis editor. For more information on execution engines, see Using the Java or the
SQL engine.
For more information on how to create a column analysis, see Creating a basic analysis on a database column.
Procedure
1. In the DQ Repository tree view, right-click the column analysis that uses the date indicator on a date column.
2. Select Open from the contextual menu to open the corresponding analysis editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 205/288
26/07/2022 09:49 PDF Export
3. Press F6 to execute the analysis and display the analysis results in the Analysis Results view of the editor.
In this example, 100.00% of the date values follow the pattern yyyy MM dd and 39.41% follow the pattern yyyy dd MM .
4. Right-click the date value for which you want to generate a regular expression and select Generate Regex Pattern from the
contextual menu.
The New Regex Pattern dialog box is displayed.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 206/288
26/07/2022 09:49 PDF Export
5. Click Next.
The pattern editor opens with the defined metadata and the generated pattern definition.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 207/288
26/07/2022 09:49 PDF Export
The new regular expression is listed under Pattern > Regex in the DQ Repository tree view. You can drag it onto any date
column in the analysis editor.
7. Optional: If required, click the Test button to test a character sequence against this date regular expression as outlined in
Testing a regular expression in the Pattern Test View.
You can open the editor of any regular expression or SQL pattern to check its settings and/or edit its definition in order to adapt it to
a specific database type, or adapt it to a specific use.
Procedure
2. Browse through the regular expression or SQL pattern lists to reach the expression or pattern you want to open/edit.
3. Right-click its name and select Open from the contextual menu.
The pattern editor opens displaying the regular expression or SQL pattern settings.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 208/288
26/07/2022 09:49 PDF Export
4. Modify the pattern metadata, if required, and then click Pattern Definition to display the relevant view.
In this view, you can: edit pattern definition, change the selected database and add other patterns specific to available
databases through the [+] button.
Note:
For the PostgreSQL database, regular expressions are not compatible among the database different versions.
If you want to use the regular expression with PostgreSQL version 9.1 or greater, you must either:
in the PostgreSQL database configuration, set the standard_conforming_strings parameter to off and write
double backslashes in the definition, or
in the Regular expression field in the wizard, use a single backslash in the expression definition.
For further information about PostgreSQL regular expressions, select Window > Show View, expand Help and then select
Bookmarks.
5. If the regular expression or SQL pattern is simple enough to be used in all databases, select Default in the list.
6. Click the save icon on top of the editor to save your changes.
Note: When you edit a regular expression or an SQL pattern, make sure that your modifications are suitable for all the
analyses that may be using this regular expression or SQL pattern.
You can export regular expressions or SQL patterns from your current version of the studio to Talend Exchange where they are saved
as .xmi files. Other users can then import these patterns from the exchange server into their Talend Studio and use them in their
analyses.
You can also export regular expressions and SQL patterns and store them locally in a csv file. For more information about the
content lay out of the csv file, see Importing regular expressions or SQL patterns.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 209/288
26/07/2022 09:49 PDF Export
Note: Management processes for regular expressions and SQL patterns are the same. The procedure below with all the included
screen captures reflect the steps to export regular expressions. You can follow the same steps to export SQL patterns.
You can export regular expressions or SQL patterns from your current version of studio to Talend Exchange where you can share
them with other users. The exported patterns are saved as .xmi files on the exchange server.
Patterns will be exported with the exact path they have in the initial repository. When users import them from Talend Exchange into
the repository of their Talend Studio, these patterns will be imported under the same folder or subfolders they had in the initial
repository.
The below procedure uses regular expressions as an example. You can follow the same steps to export SQL patterns.
Procedure
4. Click Select All to select all the regular expressions in the list or select the check boxes of the regular expressions you want to
export to the specified folder.
Select the Browse check box to list only the patterns you want to export.
5. Click Finish.
The .xmi file of each selected pattern is saved as a zip file in the defined folder.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 210/288
26/07/2022 09:49 PDF Export
Note: When users import these .xmi files from Talend Exchange into the repository of their Talend Studio, the patterns
are imported under the same family subfolder, and thus have the path they had in the initial repository.
You can export regular expressions or SQL patterns from your current version of studio to Talend Exchange where you can share
them with other users. The exported patterns are saved as .xmi files on the exchange server.
Patterns will be exported with the exact path they have in the initial repository. When users import them from Talend Exchange into
the repository of their Talend Studio, these patterns will be imported under the same folder or subfolders they had in the initial
repository.
The below procedure uses regular expressions as an example. You can follow the same steps to export SQL patterns.
Procedure
1. In the DQ Repository tree view, expand Libraries > Patterns, and then browse to the regular expression you want to export.
2. Right-click it and then select Export for Talend Exchange from the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 211/288
26/07/2022 09:49 PDF Export
3. Click Select All to select all the regular expressions in the list, or select the check boxes of the regular expressions or SQL
patterns you want to export to the folder.
The .xmi file of each selected pattern is saved as a zip file in the defined folder.
Note: When users import these .xmi files from Talend Exchange into the repository of their Talend Studio, the patterns
are imported under the same family subfolder, and thus have the path they had in the initial repository.
Procedure
1. In the DQ Repository tree view, expand Libraries > Patterns, and then right-click Regex.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 212/288
26/07/2022 09:49 PDF Export
4. Click Select All to select all listed regular expressions or select the check boxes of the regular expressions you want to export
to the csv file.
What to do next
1. In the DQ Repository tree view, expand Libraries > Patterns, and then browse to the regular expression family you want to
export.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 213/288
26/07/2022 09:49 PDF Export
3. Click Select All to select all the check boxes of the regular expressions or select the check boxes of the regular expressions
you want to export to the csv file.
4. Click Finish to close the wizard.
All exported regular expressions are saved in the defined csv file.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 214/288
26/07/2022 09:49 PDF Export
When users try to import these regular expressions from the csv file into the repository of their studios, the regular expressions will
be imported under the same subfolder, and thus have in the new repository the same path they had in the initial repository.
You can import regular expressions or SQL patterns from Talend Exchange into your Talend Studio and use them on analyses. This
way you can share all the patterns created by other users and stored on the exchange server.
You can also import the regular expressions or SQL patterns stored locally in a csv file. The csv file must have specific columns
including the ones listed in the table below. The csv file may contain only a few of them.
Relative Path The relative path to the root folder (can be empty)
All_ DB_Regexp The regular expression applicable to all databases (can be empty)
< database name >_Regexp The regular expression applicable to this specific databases (can be empty)
You can import the .xmi files of regular expressions or SQL patterns from Talend Exchange into the repository of your current Talend
Studio and use them to analyze columns.
You can import only versions that are compatible with the version of your current Talend Studio.
The below procedure uses SQL patterns as an example. You can follow the same steps to import regular expressions.
Procedure
2. Under Exchange, expand SQL and right-click the name of the pattern you want to import.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 215/288
26/07/2022 09:49 PDF Export
4. Select the Overwrite existing items check box if some error and warning messages are listed in the Error and Warning area.
This means that a pattern with the same name already exists in the current studio. The imported pattern will replace the one
in Talend Studio.
5. Click Finish.
A progress information bar is displayed. The pattern is listed under the Patterns > SQL folder in the DQ Repository tree view.
The patterns you import in your studio will be imported with the structure they had in the initial repository. They will be
imported under the same folder or subfolder they had initially.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 216/288
26/07/2022 09:49 PDF Export
Option Description
skip existing patterns Import only the regular expressions that do not exist in
the corresponding lists in the DQ Repository tree view. A
warning message is displayed if the imported patterns
already exist under the Patterns folder.
rename new patterns with suffix Identify each of the imported regular expressions with a
suffix. All regular expression will be imported even if they
already exist under the Patterns folder.
5. Click Finish.
A confirmation message is displayed.
6. Click OK.
All imported regular expressions are listed under the Regex folder in the DQ Repository tree view.
Results
The regular expressions are imported under the same folders or subfolders they had in the initial repository.
Note: A warning icon next to the name of the imported regular expression or SQL pattern in the tree view identifies that it is
not correct. You must open the expression or the pattern and try to figure out what is wrong. Usually, problems come from
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 217/288
26/07/2022 09:49 PDF Export
missing quotes. Check your regular expressions and SQL patterns and ensure that they are encapsulated in single quotes.
Indicators
Indicators can be the results achieved through the implementation of different patterns that are used to define the content,
structure and quality of your data.
Indicators represent as well the results of highly complex analyses related not only to data-matching, but also to different other
data-related operations.
Indicator types
Two types of indicators are listed under the Indicators folder in the DQ Repository tree view: system indicators and user-defined
indicators.
User-defined indicators, as their name indicates, are indicators created by the user. You can use them through a simple drag-and-
drop operation from the User Defined Indicators folder in the tree view. User-defined indicators are used only with column analyses.
For more information on how to set user-defined indicators for columns, see Setting user-defined indicators from the analysis editor.
System indicators are predefined indicators grouped under different categories in the System Indicators folder in the DQ Repository
tree view. Each category of the system indicators is used with a corresponding analysis type.
You can not create a system indicator or drag it directly from the DQ Repository tree view to an analysis. However, you can open and
modify the parameters of a system indicator to adapt it to a specific database for example. For further information, see Editing a
system indicator.
Only system indicators you can modify are listed under the System Indicators folder in the DQ Repository tree view. However, the
Indicator Selection dialog box lists all indicators including the system indicators you can not modify, such as Date pattern frequency.
Several management options including editing, duplicating, importing and exporting are possible for system and user-defined
indicators.
The following sections to follow describe the system indicators used only on column analyses. These system indicators can range
from simple or advanced statistics to text strings analysis, including summary data and statistical distributions of records.
Advanced statistics
They determine the most probable and the most frequent values and builds frequency tables. The main advanced statistics include
the following values:
Mode: computes the most probable value. For numerical data or continuous data, you can set bins in the parameters of this
indicator. It is different from the "average" and the "median". It is good for addressing categorical attributes.
Value Frequency: computes the number of most frequent values for each distinct record.
All other Value Frequency indicators are available to aggregate date and numerical data (with respect to "date", "week",
"month", "quarter", "year" and "bin").
Value Low Frequency: computes the number of less frequent records for each distinct record.
All other Value Low Frequency indicators are available to aggregate date and numerical data (with respect to "date", "week",
"month", "quarter", "year" and "bin"), where "bin" is the aggregation of numerical data by intervals.
The following table shows the indicators that you can select in any database:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 218/288
26/07/2022 09:49 PDF Export
Mode
Value
(Low)
Frequency
Date (Low) * *
Frequency
Week * *
(Low)
Frequency
Month * *
(Low)
Frequency
Quarter * *
(Low)
Frequency
Year (Low) * *
Frequency
Bin (Low)
Frequency
Fraud Detection
The Benford Law indicator (first-digit law) is based on examining the actual frequency of the digits 1 through 9 in numerical data.
It is usually used as an indicator of accounting and expenses fraud in lists or tables.
Benford's law states that in lists and tables the digit 1 tends to occur as a leading digit about 30% of the time. Larger digits occur as
the leading digits with lower frequency, for example the digit 2 about 17%, the digit 3 about 12% and so on. Valid, unaltered data
will follow this expected frequency. A simple comparison of first-digit frequency distribution from the data you analyze with the
expected distribution according to Benford's law ought to show up any anomalous results.
For example, let's assume an employee has committed fraud by creating and sending payments to a fictitious vendor. Since the
amounts of these fictitious payments are made up rather than occurring naturally, the leading digit distribution of all fictitious and
valid transactions (mixed together) will no longer follow Benford's law. Furthermore, assume many of these fraudulent payments
have 2 as the leading digit, such as 29, 232 or 2,187. By using the Benford Law indicator to analyze such data, you should see the
amounts that have the leading digit 2 occur more frequently than the usual occurrence pattern of 17%.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 219/288
26/07/2022 09:49 PDF Export
make sure that the numerical data you analyze do not start with 0 as Benford's law expects the leading digit to range only
from 1 to 9 . This can be verified by using the number > Integer values pattern on the column you analyze.
check the order of magnitude of the data either by selecting the min and max value indicators or by using the Order of
Magnitude indicator you can import from Talend Exchange. This is because Benford's law tends to be most accurate when
values are distributed across multiple orders of magnitude.
For more information about importing indicators from Talend Exchange, see Importing user-defined indicators from Talend
Exchange.
In the result chart of the Benford Law indicator, digits 1 through 9 are represented by bars and the height of the bar is the
percentage of the first-digit frequency distribution of the analyzed data. The dots represent the expected first-digit frequency
distribution according to Benford's law.
Below is an example of the results of an analysis after using the Benford Law indicator and the Order of Magnitude user-defined
indicator on a total_sales column.
The first chart shows that the analyzed data varies over 5 orders of magnitude, that is there are 5 digits between the minimal
value and maximal value of the numerical column.
The second chart shows that the actual distribution of the data (height of bars) does not follow the Benford's law (dot values). The
differences are very big between the frequency distribution of the sales figures and the expected distribution according to Benford's
law. For example, the usual occurrence pattern for sales figures that start with 1 is 30% and those figures in the analyzed data
represent only 20%. Some fraud could be suspected here, sales figures may have been modified by someone or some data may be
missing.
Below is another example of the result chart of a column analysis after using the Benford Law indicator.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 220/288
26/07/2022 09:49 PDF Export
The red bar labeled as invalid means that this percentage of the analyzed data does not start with a digit. And the 0 bar represents
the percentage of data that starts with 0. Both cases are not expected when analyzing columns using the Benford Law indicator and
this is why they are represented in red.
The following table shows the indicators that you can select in any database:
Benford
Law
Indicators in this group determine the most and less frequent patterns.
Note:
When running an analysis with the SQL engine, percentage values do not appear in the analysis results if you did not select the
Row Count indicator.
Date Pattern Frequency supports 30 types of date patterns. If the user-defined pattern is not included, results will be empty. To
be able to add a user-defined pattern, create a user-defined indicator.
Pattern frequency indicators include Pattern Frequency and Pattern Low Frequency.
Indicator Purpose
Pattern Frequency Computes the number of most frequent records for each
distinct pattern.
Pattern Low Frequency Computes the number of less frequent records for each
distinct pattern.
The above two indicators give patterns by converting alpha characters to a and numerics to 9 .
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 221/288
26/07/2022 09:49 PDF Export
The following table shows the indicators that you can select in any database:
Pattern
Frequency
Pattern
Low
Frequency
East Asia pattern frequency indicators include East Asia Pattern Frequency and East Asia Pattern Low Frequency.
Indicator Purpose
East Asia Pattern Frequency Computes the number of most frequent records for each
distinct pattern.
East Asia Pattern Low Frequency Computes the number of less frequent records for each
distinct pattern.
The above two indicators work only with Latin characters and are available only with the Java engine. They are useful when you
want to identify patterns in Asian data.
The above two indicators give patterns by converting Asian characters to letters such as H,K,C and G following the rules
described in the following table:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 222/288
26/07/2022 09:49 PDF Export
Below is an example of a column analysis using the East Asia Pattern Frequency and East Asia Pattern Low Frequency indicators on
an address column.
The analysis results of the East Asia Pattern Low Frequency indicator will look like the following:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 223/288
26/07/2022 09:49 PDF Export
These results give the number of the least frequent records for each distinct pattern. Some patterns have characters and numbers,
while others have only characters. Patterns also have different lengths, so this shows that the address is not consistent and you may
need to correct and clean it.
The following table shows the indicators that you can select in any database:
East Asia
Pattern
Frequency
East Asia
Pattern
Low
Frequency
Indicator Purpose
Date Pattern Frequency Counts the number of records for each distinct date pattern.
The following table shows the indicators that you can select in any database:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 224/288
26/07/2022 09:49 PDF Export
Date
Pattern
Frequency
Word-based pattern indicators include case sensitive and case insensitive indicators.
Word-based pattern indicators count the number of record for each distinct pattern and are available only with the Java engine.
You can use those indicators with the String data type only.
Case-sensitive indicators
Indicator Purpose
CS Word Pattern Low Frequency Evaluates the least frequent word patterns.
Pattern Description
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 225/288
26/07/2022 09:49 PDF Export
Pattern Description
When using the CS Word Pattern Frequency and CS Word Pattern Low Frequency indicators, the following strings are replaced with
the following patterns:
String Pattern
someWordsINwORDS [word][Word][WORD][char][WORD]
[email protected] [Word][number]@[word].[word]
[email protected] [word][Word][digit]@[word].[word]
Latin2䏿–‡ [Word][digit][IdeogramSeq]
Latin3フランス [Word][digit][kataSeq]
Latin4ã¨ã†ãょㆠ[Word][digit][hiraSeq]
Indicator Purpose
CI Word Pattern Low Frequency Evaluates the least frequent word patterns.
Pattern Description
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 226/288
26/07/2022 09:49 PDF Export
Pattern Description
When using the CI Word Pattern Frequency and CI Word Pattern Low Frequency indicators, the following strings are replaced with
the following patterns:
String Pattern
someWordsINwORDS [word]
[email protected] [alnum]@[word].[word]
[email protected] [alnum]@[word].[word]
Latin2䏿–‡ [word][digit][IdeogramSeq]
Latin3フランス [word][digit][kataSeq]
Latin4ã¨ã†ãょㆠ[word][digit][hiraSeq]
The following table shows the indicators that you can select in any database:
CS Word
Pattern
Frequency
CS Word
Pattern
Low
Frequency
CI Word
Pattern
Frequency
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 227/288
26/07/2022 09:49 PDF Export
CI Word
Pattern
Low
Frequency
List of engines used and database types supported when using Pattern Frequency Statistics indicators
When creating a column analysis from the Profiling perspective of Talend Studio, you can profile a database using the Pattern
Frequency Statistics indicators. To execute the analysis, you can use the Java or the SQL engine depending on the type of the
database you want to profile.
Ingres Yes No
Sybase Yes No
Teradata Yes No
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 228/288
26/07/2022 09:49 PDF Export
Indicators in this group count phone numbers. They return the count for each phone number format. They validate the phone
formats using the org.talend.libraries.google.libphonumber library.
The following table shows the indicators that you can select in any database:
Valid Phone
Number
Count 1
Valid Region
Code Count
Invalid
Region Code
Count
Valid Phone
Number for
Region
Count 1
Possible
Phone
Number
Count
Well Formed
International
Phone
Number
Count
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 229/288
26/07/2022 09:49 PDF Export
Well Formed
National
Phone
Number
Count
Well Formed
E164 Phone
Number
Count
Format
Phone
Number
Frequency
1 As Valid Phone Number for Region Count was added from 7.3 onwards, when you run a Job created in a previous version, the
results in Valid Phone Number Count may be different than the previous versions.
The input data can be the same as Valid Phone Number Count but the results are different. For example: A phone number is
valid, true returned in Valid Phone Number Count , but the country code is wrong, false returned in Valid Phone Number for
Region Count .
Simple statistics
They provide simple statistics on the number of records falling in certain categories including the number of rows, the number of
null values, the number of distinct and unique values, the number of duplicates, or the number of blank fields.
Blank Count: counts the number of blank rows. A "blank" is a non null textual data that contains only white space. Note that
Oracle does not distinguish between the empty string and the null value.
The following table shows the indicators that you can select in any database:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 230/288
26/07/2022 09:49 PDF Export
Row
Count
Null Count
Distinct
Count
Unique
Count
Duplicate
Count
Blank
Count
Default * * * * * * * *
Value
Count
Indicators in this group use the Soundex algorithm built in the DBMS.
Those indicators index records by sounds. This way, records with the same English pronunciation are encoded to the same
representation so that they can be matched despite minor differences in spelling.
Soundex Frequency: computes the number of most frequent distinct records relative to the total number of records having
the same pronunciation.
Soundex Low Frequency: computes the number of less frequent distinct records relative to the total number of records
having the same pronunciation.
To be able to use Soundex frequency statistics indicators on PostgreSQL, Amazon for PostgreSQL and Amazon Redshift, install an
extension into the PostgreSQL database using the CREATE EXTENSION fuzzystrmatch; query.
To be able to use Soundex frequency statistics indicators on Amazon Redshift, you can also create a custom user-defined function.
You can only use Soundex frequency statistics indicators on Snowflake with the Java engine.
Due to some limitation in Teradata soundex implementation, you may not be able to drill down the results of profiling Teradata with
this indicator.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 231/288
26/07/2022 09:49 PDF Export
The following table shows the indicators that you can select in any database:
Soundex
Frequency
Table
Soundex
Low
Frequency
Table
You will get an error message when using Talend Studio to profile the Teradata database using the Soundex Frequency Table
indicator because your input is invalid.
From the Profiling perspective of Talend Studio, try to profile a column in Teradata, first_name for example, using the Soundex
Frequency Table indicator. Run the column analysis with the SQL engine. The analysis runs successfully.
Try to drill down data on the result page: in the Frequency Statistics table in the Analysis Results view, right-click a row and select
View Rows. You will get an error in the SQL Editor about the generated SQL query.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 232/288
26/07/2022 09:49 PDF Export
This limitation is due to Teradata soundex implementation. The Teradata database requires that a character string or expression that
contains a surname is evaluated in simple Latin characters.
A simple Latin character is one that does not have diacritical marks such as tilde (~) or acute accent (´). There are 26 uppercase
simple Latin characters and 26 lowercase simple Latin characters. Even a simple call to SOUNDEX ('Sébastien') cannot be
executed on Teradata. Therefore, it is not possible to drill down into all rows that sounds like 'Sébastien' .
Summary statistics
They perform statistical analyses on numeric data, including the computation of location measures such as the median and the
average, the computation of statistical dispersions such as the inter quartile range and the range.
When using the summary statistics indicators to profile a DB2 database, analysis results could be slightly different between Java and
SQL engines. This is because indicators are computed differently depending on the database type, and also Talend uses special
functions when working with Java.
The following table shows the indicators that you can select in any database:
Mean
Median
Inter
Quartile
Range
Upper
Quartile
Range
Minimum
Maximum
Text statistics
You can use the text statistics indicators to analyze columns only if their data mining type is set to nominal in the analysis editor.
Otherwise, these statistics are grayed out in the Indicator Selection dialog box. For further information on the available data mining
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 233/288
26/07/2022 09:49 PDF Export
Text statistics analyze the characteristics of textual fields in the columns, including minimum, maximum and average length.
Minimal Length: computes the minimal length of a text field. It excludes null and blank values.
Maximal Length: computes the maximal length of a text field. It excludes null and blank values.
Average Length: computes the average length of a field. It excludes null and blank values.
Other text indicators are available to count each of the above indicators with null values, with blank values or with null and blank
values.
Null values are counted as data of 0 length, that is to say the minimal length of null values is 0. This means that the Other text
indicators are available to count each of the above indicators with null values, with blank values or with null and blank values.
Minimal Length With Null and the Maximal Length With Null compute the minimal/maximal length of a text field including null
values, that are considered to be 0-length text.
Blank values are counted as regular data of 1 length. Empty values are counted as data of 0 length, that is to say the minimal length
of blank values is 0. This means that the Minimal Length With Blank and the Maximal Length With Blank compute the
minimal/maximal length of a text field including blank values.
The same are applied for all average indicators. Empty values are also counted as data of 0 length.
For example, compute the length of textual fields in a column containing the following values, using all different types of text
statistic indicators:
"Brayan" 6
"Ava" 3
"_" 1
"" 0
<null> <null>
"__________" 10
The following table shows the indicators that you can select in any database:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 234/288
26/07/2022 09:49 PDF Export
Data type
Analysis Java Number SQL Java Text SQL Java Date SQL Java Others SQL
engine
type
Analysis Java SQL Java SQL Java SQL Java SQL
engine
type
Minimal
Length
Minimal
Length
With Null
Minimal
Length
With
Blank
Minimal
Length
With
Blank And
Null
Maximal
Length
Maximal
Length
With Null
Maximal
Length
With
Blank
Maximal
Length
With
Blank And
Null
Average
Length
Average
Length
With Null
Average
Length
With
Blank
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 235/288
26/07/2022 09:49 PDF Export
Average
Length
With
Blank And
Null
System indicators are predefined but editable indicators that you can use on analyses.
Although system indicators are predefined indicators, you can open their editors to check or to edit their settings and their
definitions in order to adapt them to a specific database version or to a specific need, for example. However, you can not edit the
name of the system indicator.
Note: When you edit an indicator, you modify the indicator listed in the DQ Repository tree view. Make sure that your
modifications are suitable for all analyses that may be using the modified indicator.
Procedure
1. In the DQ Repository tree view, expand Libraries > Indicators, and then browse through the indicator lists to reach the
indicator you want to modify.
2. Right-click the indicator name and select Open from the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 236/288
26/07/2022 09:49 PDF Export
5. Click the save icon on top of the editor to save your changes.
If the indicator is simple enough to be used in all databases, select Default in the database list.
You can define system indicators and indicator parameters for columns of database tables that need to be analyzed or monitored.
For more information, see Setting indicators on columns and Setting options for system or user-defined indicators.
You can export system indicators to folders or archive files and import them again in the studio on the condition that the export and
import operations are done in compatible versions of the Studio. For further information, see Exporting data profiling items and
Importing data profiling items.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 237/288
26/07/2022 09:49 PDF Export
To avoid creating a system indicator from scratch, you can duplicate an existing one in the indicator list. once the copy is created,
you can work around its metadata to have a new indicator and use it in data profiling analyses.
Procedure
2. Browse through the indicator lists to reach the indicator you want to duplicate, right-click its name and select Duplicate from
the contextual menu.
Results
The indicator is duplicated as a user-defined indicator and is displayed under the User Defined Indicators folder in the DQ Repository
tree view.
What to do next
The indicator category of the duplicated Fraud Detection and Soundex indicators is User Defined Frequency. Thus, two columns are
expected in the result set of the analysis that uses these duplicated indicators, whereas four columns are expected when using the
original system indicators.
To be able use the duplicated Fraud Detection and Soundex indicators in data profiling analyses, you need to edit the indicator
definition or create new indicators. For more information on editing user-defined indicators, see Editing a user-defined indicator.
User-defined indicators, as their name indicates, are indicators created by the user. You can use these indicators to analyzed
columns through a simple drag-and-drop operation from the DQ Repository tree view to the columns listed in the editor.
The management options available for user-defined indicators include: create, export and import, edit and duplicate.
You can create your own personalized indicators from the studio.
Note: Management processes for user-defined indicators are the same as those for system indicators.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 238/288
26/07/2022 09:49 PDF Export
4. In the Name field, enter a name for the indicator you want to create.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. Optional: Set other metadata (purpose, description and author name) in the corresponding fields and click Finish.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 239/288
26/07/2022 09:49 PDF Export
Results
The indicator editor opens displaying the metadata of the user-defined indicator.
Procedure
1. Click Indicator Category and select from the list a category for the indicator.
The selected category determines the columns expected in the result set of the analysis that uses the user-defined indicator.
The table below explains available categories.
User Defined Match Evaluates the number of data matching The result set should have one row and
a condition. two columns. The first column contains
the number of values that match and
the second column the total count.
User Defined Frequency Evaluates the frequency of records The result set should have 0 or more
using user-defined indicators for each rows and two columns. The first column
distinct record. contains a value and the second the
frequency (count) of this value.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 240/288
26/07/2022 09:49 PDF Export
User Defined Real Value Evaluates real function of data. The result set should have one row and
one column that contain a real value.
User Defined Count (by-default Analyzes the quantity of records and The result set should have one row and
category) returns a row count. one column that contain the row count.
3. From the Database list, select a database on which to use the indicator.
If the indicator is simple enough to be used in all databases, select Default in the database list.
5. Define the SQL statement for the indicator you want to create:
a. Click the Edit... button next to the SQL Template field.
b. In the Indicator Definition view, enter the SQL expression(s) you want to use in matching and analyzing data. You can
drop templates from the templates list to complete the expression.
Example
Set the expression to measure the maximal length of the values in a column as shown in the above capture.
This view may have several input fields, one for each column expected by indicator category. For example, if you
select the User Defined Count category, you will have only a Where Expression field; while if you select the User
Defined Match category, you will have two fields: Matching Expression and Where Expression.
The SQL expressions are automatically transformed into a complete SQL template in the Full SQL Template view.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 241/288
26/07/2022 09:49 PDF Export
Also, the SQL expressions are automatically transformed into templates to view rows/values. Different tabs are
available in the dialog box depending on what indicator category is selected.
If you edit the SQL expression(s) in the Indicator Definition view, the templates will be updated accordingly in the
other tabs.
c. Use the Reset button to revert all templates according to the content of the Indicator Definition tab.
d. Click OK.
The dialog box is closed and the SQL template is displayed in the indicator editor.
e. Use the [+] button and follow the same steps to add as many indicator definitions as needed.
Note: You do not need to define any parameters in the Indicator Parameters view when the user-defined indicator
contains only SQL templates. These parameters are used only when indicators have Java implementation. For
further information, see Defining Java user-defined indicators.
Results
The indicator is listed under the User Defined Indicators folder in the DQ Repository tree view. You can use this indicator to analyzed
columns through a simple drag-and-drop operation from the DQ Repository tree view to the columns listed in the editor.
If an analysis with a user-defined indicator runs successfully at least one time and later the indicator definition template for the
database is deleted, the analysis does not fail. It keeps running successfully because it uses the previously generated SQL query.
You can create your own personalized Java indicators from Talend Studio.
Management processes for Java user-defined indicators are the same as those for system indicators.
You can also import a ready-to-use Java user-defined indicator from the Exchange folder in the DQ Repository tree view. This Java
user-defined indicator connects to the mail server and checks if the email exists.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 242/288
26/07/2022 09:49 PDF Export
4. In the Name field, enter a name for the Java indicator you want to create.
Note:
"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "¥", "'", """, "«", "»", "<", ">".
These characters are all replaced with "_" in the file system and you may end up creating duplicate items.
5. If required, set other metadata (purpose, description and author name) in the corresponding fields and click Finish.
The indicator editor opens displaying the metadata of the user-defined indicator.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 243/288
26/07/2022 09:49 PDF Export
Procedure
1. Click Indicator Category and select from the list a category for the Java indicator.
The selected category determines the columns expected in the result set of the analysis that uses this indicator.
The table below explains available categories.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 244/288
26/07/2022 09:49 PDF Export
Note: Make sure that the class name includes the package path, if this string parameter is not correctly specified, an error
message will display when you try to save the Java user-defined indicator.
In the Select libraries view, select the check box of the archive holding the Java class and then select the class in the
bottom panel of the wizard.
Click OK.
The dialog box is closed and the Java archive is displayed in the indicator editor.
You can add or delete Java archives from the Manage Libraries view of this dialog box.
For more information on creating a Java archive, see Creating a Java archive for the user-defined indicator.
6. Click Indicator Parameters to open the view where you can define parameters to retrieve parameter values while coding the
Java indicator.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 245/288
26/07/2022 09:49 PDF Export
You can retrieve parameter values with a code similar to this one that retrieves the parameter of EMAIL_PARAM :
// Check prerequisite
if (param == null) {
log.error("No parameter set in the user defined indicator " + this.getName()); //$NON-NLS-1$
return false;
if (indicatorValidDomain == null) {
log.error("No parameter set in the user defined indicator " + this.getName()); //$NON-NLS-1$
return false;
if (EMAIL_PARAM.equalsIgnoreCase(p.getKey())) {
For a more detailed sample of the use of parameters in a Java user-defined indicator, check https://github.com/Talend/tdq-
studio-se/tree/master/sample/test.myudi/src/main/java/org/talend/dataquality/indicator/userdefine/email.
7. Click the [+] button at the bottom of the table and define in the new line the parameter key and value.
You can edit these default parameters or even add new parameters any time you use the indicator in a column analysis. To
do this, click the indicator option icon in the analysis editor to open a dialog box where you can edit the default parameters
according to your needs or add new parameters.
Before creating a Java archive for the user defined indicator, you must define, in Eclipse, the target platform against which the
workspace plugins will be compiled and tested.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 246/288
26/07/2022 09:49 PDF Export
b. Expand Plug-in Development and select Target Platform then click Add... to open a view where you can create the
target definition.
c. Select the Nothing: Start with an empty target definition option and then click Next to proceed to the next step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 247/288
26/07/2022 09:49 PDF Export
d. In the Name field, enter a name for the new target definition and then click the Add... button to proceed to the next
step.
e. Select Installation from the Add Content list and then click Next to proceed to the next step.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 248/288
26/07/2022 09:49 PDF Export
f. Use the Browse... button to set the path of the installation directory and then click Next to proceed to the next step.
In this Java project, you can find four Java classes that correspond to the four indicator categories listed in the
Indicator Category view in the indicator editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 249/288
26/07/2022 09:49 PDF Export
Each one of these Java classes extends the UserDefIndicatorImpl indicator. The figure below illustrates an example
using the MyAvgLength Java class.
package test.udi;
import org.talend.dataquality.indicators.sql.impl.UserDefIndicatorImpl;
/**
* @author mzhao
* A very simple example of a java implementation of a user defined indicator. This indicator returns a user defined
*/
@Override
super.reset();
length = 0;
return true;
@Override
super.handle(data);
// an indicator which computes the average text length on data which are more than 2 characters (this means that
// text values with less than 2 characters are not taken into account).
if (dataLength > 2) {
length += dataLength;
return true;
/*
* (non-Javadoc)
* @see org.talend.dataquality.indicators.impl.IndicatorImpl#finalizeComputation()
*/
@Override
return super.finalizeComputation();
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 250/288
26/07/2022 09:49 PDF Export
b. Modify the code of the methods that follow each @Override according to your needs.
c. Optional: If required, use the following methods in your code to retrieve the indicator parameters:
Method Description
Results
The Java archive is now ready to be attached to any Java indicator you want to create in from Profiling perspective of Talend Studio.
You can export user-defined indicators to archive files or to Talend Exchange to be shared with other users.
You can export user-defined indicators and store them locally in an archive file using the Export Item option on the studio toolbar.
For further information on how to export indicators, see Exporting data profiling items.
You can export user-defined indicators from your current Talend Studio to Talend Exchange where you can share them with other
users.
The exported indicators are saved as .xmi files on the exchange server.
At least one user-defined indicator is created in the Profiling perspective of Talend Studio.
Procedure
2. Right-click the User Defined Indicator folder and select Export for Talend Exchange.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 251/288
26/07/2022 09:49 PDF Export
4. Select the check boxes of the indicators you want to export to the specified folder.
Select the Browse check box to list only the indicators you want to export.
5. Click Finish.
The .xmi file of each selected indicator is saved as a zip file in the defined folder.
6. Upload the zip files to Talend Exchange at https://exchange.talend.com/. Please create an account on the Talend Exchange
server if you do not have one already.
You can import user-defined indicators from a local archive file or from Talend Exchange into your studio and use them on your
column analyses.
Using the Import Item option on the studio toolbar, you can import user-defined indicators stored locally in an archive file and use
them on your column analyses. For further information on how to import indicators, see Importing data profiling items.
A warning icon next to the name of the imported indicator in the tree view identifies that it is not correct. You must open the
indicator and try to figure out what is wrong.
Deprecated
This section describes a deprecated feature that is still available for use to provide backward compatibility.
You can import indicators stored locally in a csv file to use them on your column analyses.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 252/288
26/07/2022 09:49 PDF Export
Procedure
Option Pour...
skip existing indicators import only the indicators that do not exist in the
corresponding lists in the DQ Repository tree view. A
warning message is displayed if the imported indicators
already exist under the Indicators folder.
rename new indicators with suffix identify each of the imported indicators with a suffix. All
indicators will be imported even if they already exist
under the Indicators folder.
5. Click Finish.
All imported indicators are listed under the User Defined Indicators folder in the DQ Repository tree view.
Results
Note: A warning icon next to the name of the imported user-defined indicator in the tree view identifies that it is not correct.
You must open the indicator and try to figure out what is wrong.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 253/288
26/07/2022 09:49 PDF Export
You can import the .xmi files of user-defined indicators created by other users and stored on the Talend Exchange server into your
current Talend Studio and use them on your column analyses.
You can import only versions that are compatible with the version of your current Talend Studio.
The indicators you can import from Talend Exchange include for example:
Order of Magnitude: It computes the number of digits between the minimal value and maximal value of a numerical column.
Email validation via mail server: This Java user-defined indicator connects to a mail server and checks if the email exists.
The below procedure shows how to import the Email validation via mail server indicator from the exchange server into Talend
Studio.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 254/288
26/07/2022 09:49 PDF Export
4. Select the Overwrite existing items check box if some error and warning messages are listed in the Error and Warning area.
This means that an indicator with the same name already exists in the current studio. The imported indicator will replace the
one in the Talend Studio.
5. Click Finish.
The user-defined indicator is imported from Talend Exchange and listed under the User Defined Indicators folder in the DQ
Repository tree view.
Before being able to use this indicator on column analyses to check emails by sending an SMTP request to the mail server, you must
define the indicator parameters as the following:
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 255/288
26/07/2022 09:49 PDF Export
a buffer size that gives the number of invalid email addresses stored in memory before they are saved in a file.
a file path to the list of invalid email addresses.
the email of the sender.
7. Modify the values of the BUFFER SIZE and INVALID DATA FILE according to your needs.
8. In the Parameters Value column, set the value for the EMAIL parameter, that is the address of the sender on the SMTP server.
9. Save the indicator.
Note: If you have an error message when running a column analysis with this indicator, please check your email server
configuration.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 256/288
26/07/2022 09:49 PDF Export
Depending on the data types used, you can select some user-defined indicators.
The following table shows the indicators that you can select in any database:
User * * * *
Defined
Real Value
User * * * *
Defined
Match
User * * * *
Defined
Count
User * * * *
Defined
Frequency
You can open the editor of any system or user-defined indicator to check its settings and/or edit its definition and metadata in order
to adapt it to a specific database type or need, if required.
Note: When you edit an indicator, make sure that your modifications are suitable for all analyses that may be using the modified
indicator.
At least one user-defined indicator is created in the Profiling perspective of Talend Studio.
Procedure
1. In the DQ Repository tree view, expand Libraries > Indicators > User Defined Indicators, and then browse through the
indicator lists to reach the indicator you want to modify the definition of.
2. Right-click the indicator name and select Open from the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 257/288
26/07/2022 09:49 PDF Export
4. Click Indicator Category to open the view and select an indicator category from the list:
User Defined Match (by-default category) Uses user-defined indicators to evaluate the number of
the data records that match a regular expression or an
SQL pattern. The analysis results show the record
matching count and the record total count.
User Defined Frequency Uses user-defined indicators for each distinct data record
to evaluate the record frequency that match a regular
expression or an SQL pattern. The analysis results show
the distinct count giving a label and a label-related count.
User Defined Real Value Uses user-defined indicators which return real value to
evaluate any real function of the data.
User Defined Count Uses user-defined indicators that return a row count.
If you change the indicator category, you can keep the existing indicator definition or remove the indicator definitions for all
databases but the one for Java.
5. Click Indicator Definition to open the view. Set the indicator definition for one or more databases using the [+] button.
If the indicator is simple enough to be used in all databases, select Default in the list.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 258/288
26/07/2022 09:49 PDF Export
7. To add new default parameters, click the [+] button at the bottom of the table, click in the lines and define the parameter
keys and values.
8. Click the save icon on top of the editor to save your changes.
Indicator parameters
This section describes indicator parameters displayed in the different Indicators Settings dialog boxes.
Bins Designer
Blank Options
Aggregate nulls with blanks When selected, null data is counted as zero length text field. This means
that null data is treated as an empty string. When not selected, null data is
treated as any other text data.
Aggregate blanks When selected, blank texts (e.g. " ") are all grouped together and
considered as an empty string. When not selected, blank texts are treated
as any other text data.
Note: In Oracle, empty strings and null strings are the same objects.
Therefore, you must select or clear both check boxes in order to get
consistent results.
Data Thresholds
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 259/288
26/07/2022 09:49 PDF Export
Upper threshold Data greater than this value should not exist.
Indicator Thresholds
Lower threshold(%) Lower threshold of matching indicator values in percentage relative to the
total row count.
Upper threshold(%) Higher threshold of matching indicator values in percentage relative to the
total row count.
Expected value Only for the Mode indicator in the Advanced Statistics. Most probable value
that should exist in the selected column.
Java Options
Replacement characters List of the characters that will take the place of the replaced characters.
Note: Each character of the first field will be replaced by the character
at the same position from the second field. For example, with the
values "abc0123ABC,;.:" in the first field and "aaa9999AAApppp" in the
second field any "a", "b" or "c" will be replaced by "a" and any "0", "1",
"2" or "3" will be replaced by "9".
Phone number
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 260/288
26/07/2022 09:49 PDF Export
Text Parameters
Text Length
Count nulls When selected, null data is counted as zero length text field.
Count blanks When selected, blank texts (e.g. " ") are counted as zero length text
fields.
When profiling a column of a Date type in an Oracle database using Pattern Frequency Statistics, the result of the column analysis is
99-AAA-99 with the SQL engine, and not 9999-99-99 99:99:99.9 as expected. If you run the analysis with the Java engine, you will
get 9999-99-99 99:99:99.9 .
In Oracle, dates are stored as numbers. Talend uses the Cast function with Pattern Frequency Statistics. When you run the analysis
with the SQL engine, the query casts the Date type to a text type and the original date format is changed from 9999-99-99 99:99:99
to 99-AAA-99 .
As all dates are date objects in the Date column, dates will always be displayed using one single format in Oracle ( 99-AAA-99 ) and
using another single format such as 9999-99-99 99:99:99.9 in non-Oracle and non-SQL cases. This is why no data quality issue can
be found using this indicator. It is then not advisable to use Pattern Frequency Statistics on a Date type column in databases.
From Talend Studio, you can query and browse a selected database using the SQL Editor and then to store these SQL queries under
the Source Files folder in the DQ Repository tree view.
You can then open the SQL Editor on any of these stored queries to rename, edit or execute the query.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 261/288
26/07/2022 09:49 PDF Export
2. Right-click Source Files and select New SQL File from the contextual menu. The New SQL File dialog box is displayed.
3. In the Name field, enter a name for the SQL query you want to create and then click Finish to proceed to the next step.
The SQL Editor opens on the new SQL query.
If the Connections view is not open, use the combination Window > Show View > Data Explorer > Connections to open it.
5. From the Choose Connection list, select the database you want to run the query on.
6. On the SQL Editor toolbar, click to execute the query on the defined base table(s).
Data rows are retrieved from the defined base table(s) and displayed in the editor.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 262/288
26/07/2022 09:49 PDF Export
A file for the new SQL query is listed under Source Files in the DQ Repository tree view.
Option To...
Rename SQL File open a dialog box where you can edit the name of the
query file
Open in Data Explorer open in the data explorer the SQL editor on the selected
query file
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 263/288
26/07/2022 09:49 PDF Export
Option To...
Note:
When you open a query file in the SQL Editor, make sure to select the database connection from the Choose
Connection list before executing the query. Otherwise the run icon on the editor toolbar will be unavailable.
When you create or modify a query in a query file in the SQL Editor and try to close the editor, you will be
prompted to save the modifications. The modifications will not be taken into account unless you click the save
icon on the editor toolbar.
You can use context variables in the analysis editor to filter the data on which to run the analysis, or to decide the number of
concurrent connections to the database allowed by the analysis.
The Profiling perspective of Talend Studio allows you to define multiple contexts for profiling analyses and select specific context
values with which to execute an analysis. You can define context parameters in the analysis editor to:
filter data using contexts variables with WHERE clauses and decide the data or tables/columns on which to run the analysis,
decide the number of concurrent connections allowed by the analysis.
You usually set this number according to the database available resources, that is the number of concurrent connections
each database can support.
Note: You can use context variables only when you profile databases. Context variables can not be used when profiling files.
You can create one or several contexts for a database analysis and select specific context values with which to execute the analysis.
Defining context variables in analysis will enable you to run the analysis on different data using the same connection.
Procedure
2. Select Window > Show View > Profiling > Contexts to open the context view in the Profiling perspective.
3. Select the default context and click Edit to rename it, Prod in this example. Click OK.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 264/288
26/07/2022 09:49 PDF Export
Procedure
1. Click the [+] button at the bottom of the Contexts view to add lines in the table.
Suppose that you want to profile a postal_code column and you want to analyze the postal codes that start with 15 in the
development environment and with 75 in the production environment. And that you want to allow different concurrent
connections per analysis in the two environments.
2. Click in the Name field and enter the name of the variable you are creating.
Name the first variable where_clause for this example.
3. Repeat the above steps to define all the variables for the different contexts.
In this example, set the value of the where_clause variable in the Test context to postal_code like '15%' , and for the
Prod context to postal_code like '75%' . And set the value for the concurrent connections per analysis in the
development and production environments to three and five respectively.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 265/288
26/07/2022 09:49 PDF Export
At least one context has been created in the analysis editor. For further information, see Creating one or multiple contexts for the
same analysis.
Procedure
enter the name of the context variable you want to use to run the analysis using context. followed by the variable
name, or
press Ctrl + space to open the list of context variables created in the Profiling perspective and from which you can
select a variable of your choice.
This list will be empty until you define context variables in the Contexts view in the analysis editor.
3. In the Analysis Parameters view and in the Number of connection per analysis field, either:
enter the name of the context variable you want to use to run the analysis using context. followed by the variable
name, or
press Ctrl+Space to open the list of context variables.
This list will be empty until you define context variables in the Contexts view in the analysis editor.
For further information about defining context variables, see Creating one or multiple contexts for the same analysis.
4. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis, prod
for this example.
All context environments you create in the Contexts view in the analysis editor will be accessible from this list.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 266/288
26/07/2022 09:49 PDF Export
The SQL Editor opens listing only the data rows where postal code starts with 75 .
You can import data profiling items including analyses, database connections, patterns and indicators, etc. into your current Talend
Studio from various projects or different versions of Talend Studio.
You cannot import an item without all its dependencies. When you try to import an analysis for example, all its dependencies such
as a metadata connection and the patterns and indicators used in this analysis will be selected by default and imported with the
analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 267/288
26/07/2022 09:49 PDF Export
Warning: You cannot import into your current Talend Studio data profiling items created in versions older than 4.0.0. To use such
items in your current Talend Studio, you must carry out an upgrade operation. For more information, see Upgrading project
items from older versions.
You have access to the root directory of another studio version in which data profiling items have been created.
Procedure
Right-click anywhere in the DQ Repository tree view and select Import Items.
All editors which are open in the studio are automatically closed. The Import Item wizard is displayed.
2. Select the root directory or the archive file option according to whether the data profiling items you want to import are in the
workspace file within the studio directory or are already exported into a zip file.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 268/288
26/07/2022 09:49 PDF Export
If you select the root directory option, click Browse and set the path to the project folder containing the items to be
imported within the workspace file of the Talend Studio directory.
All items and their dependencies that do not exist in your current Talend Studio are selected by default in the dialog
box.
If you select the archive file option, click Browse and set the path to the archive file that holds the data profiling items
you want to import.
All items and their dependencies that do not exist in your current Talend Studio are selected by default in the dialog
box.
3. Select the Overwrite existing items check box if some error and warning messages are listed in the Error and Warning area.
This means that items with the same names already exist in the current studio.
4. Select or clear the check boxes of the data profiling items you want or do not want to import according to your needs.
All dependencies for the selected item are selected by default. When you clear the check box of an item, the check boxes of
the dependencies of this item are automatically cleared as well. Also, an error message will display on top of the dialog box if
you clear the check box of any of the dependencies of the selected item.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 269/288
26/07/2022 09:49 PDF Export
You can export data profiling items including analyses, database connections, patterns and indicators, etc. from the current instance
of the studio to the root directory of another studio or to archive files.
At least, one data profiling item has been created in the Talend Studio.
Procedure
1. Complete one of the following steps:
Right-click anywhere in the DQ Respository tree view and select Export Items; or
2. Select the root directory or archive file option and then click Browse... and browse to the directory/archive where you want
to export the data profiling items.
3. Select the check boxes of the data profiling items you want to export or use the Select All or Deselect All tabs.
When you select and export items, specific behaviors are applied:
Selecting an analysis check All analysis dependencies including If you have an error message on top of the
box the metadata connection, the dialog box that indicates any missing
contexts and any patterns or dependencies, click the Include
indicators used in this analysis are dependencies button. It automatically selects
selected by default. the check boxes of all items necessary to the
selected data profiling analysis.
Exporting analyses from a All their items will be exported.
reference project
Selecting items from a main Items from the reference projects For example, when an analysis from a main
project are available. Select them to export project depends on a connection in the
them. reference project, the connection is not
selected in the Export item wizard. You must
select it. Clicking the Include dependencies
button has no effects in this case.
For further information about reference projects, see Working with referenced projects.
4. If required, select the Browse check box to have in the export list only the selected data profiling items.
The below procedure concerns only the migration of data profiling items from versions older than 4.0.0. To migrate your data
profiling items from version 4.0.0 onward, you simply need to import them into your current Talend Studio .
Procedure
1. From the folder of the old version of Talend Studio, copy the workspace file and paste it in the folder of your current Talend
Studio. Accept to replace the current workspace file with the old file.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 270/288
26/07/2022 09:49 PDF Export
Results
The upgrade operation is completed once Talend Studio is completely launched, and you should have access to all your data
profiling items.
Regarding system indicators during migration, please pay attention to the following:
When you upgrade the repository items to version 4.2 from a prior version, the migration process overwrites any changes you
made to the system indicators.
When you upgrade the repository items from version 4.2 to version 5.0 onward, you do not lose any changes you made to the
system indicators, the changes are merged.
Tasks
Working with tasks
In the studio, it is possible to add tasks to different items, display the task list and delete any completed task from the task list.
in the DQ Repository tree view on connections, catalogs, schemas, tables, columns and created analyses,
or, on columns, or patterns and indicators set on columns directly in the current analysis editor.
For example, you can add a general task to any item in a database connection via the Metadata node in the DQ Repository tree view.
You can add a more specific task to the same item defined in the context of an analysis through the Analyses node. And finally, you
can add a task to a column in an analysis context (also to a pattern or an indicator set on this column) directly in the current analysis
editor.
The procedure to add a task to any of these items is exactly the same. Adding tasks to such items will list these tasks in the Tasks list
accessible through the Window > Show view... combination. Later, you can open the editor corresponding to the relevant item by
double-clicking the appropriate task in the Tasks list.
Procedure
2. Navigate to the column you want to add a task to, account_id in this example.
3. Right-click the account_id and select Add task... from the contextual menu.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 271/288
26/07/2022 09:49 PDF Export
The Properties dialog box opens showing the metadata of the selected column.
4. In the Description field, enter a short description for the task you want to carry on the selected item.
5. In the Priority list, select the priority level and then click OK to close the dialog box.
The created task is added to the Tasks list. For more information on how to access the task list, see Displaying the task list.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 272/288
26/07/2022 09:49 PDF Export
double-click a task to open the editor where this task has been set.
select the task check box once the task is completed in order to be able to delete it.
filter the task view according to your needs using the options in a menu accessible through the drop-down arrow on
the top-right corner of the Tasks view. For further information about filtering the task list, see Filtering the task list.
The below procedure gives an example of adding a task to a column in an analysis. You can follow the same steps to add tasks to
other elements in the created analyses.
The analysis has been created in the Profiling perspective of Talend Studio.
Procedure
2. Expand an analysis and navigate to the item you want to add a task to, the account_id column in this example.
3. Right-click account_id and select Add task... from the contextual menu.
4. Follow the steps outlined in Adding a task to a column in a database connection to add a task to account_id in the selected
analysis.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 273/288
26/07/2022 09:49 PDF Export
For more information on how to access the task list, see Displaying the task list.
In the open analysis editor, you can add a task to the indicators set on columns. This task can be used, for example, as a reminder to
modify the indicator or to flag a problem that needs to be solved later.
A column analysis is open in the analysis editor in the Profiling perspective of Talend Studio.
At least one indicator is set for the columns to be analyzed.
Procedure
1. In the Analyzed Columns list, right-click the indicator name and select Add task... from the contextual menu.
The Properties dialog box opens showing the metadata of the selected indicator.
2. In the Description field, enter a short description for the task you want to attach to the selected indicator.
3. On the Priority list, select the priority level and then click OK to close the dialog box. The created task is added to the Tasks
list.
For more information on how to access the task list, see Displaying the task list.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 274/288
26/07/2022 09:49 PDF Export
Adding tasks to items will list these tasks in the Tasks list.
Procedure
1. On the menu bar of Talend Studio, select Window > Show view....
The Show View dialog box is displayed.
2. Start typing task in the filter field and then select Tasks from the list.
3. Click OK.
The Tasks view opens in the Profiling perspective of Talend Studio listing the added task(s).
4. If required, double-click any task in the Tasks list to open the editor corresponding to the item to which the task is attached.
In the Profiling perspective of Talend Studio, the Tasks view lists all the tasks you create in Talend Studio .
You can create filters to decide what to list in the task view.
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 275/288
26/07/2022 09:49 PDF Export
2. Click the drop-down arrow in the top right corner of the view, and then select Configure contents....
The Configure contents... dialog box is displayed showing the by-default configuration.
3. Click New to open a dialog box and then enter a name for the new filter.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 276/288
26/07/2022 09:49 PDF Export
5. Set the different options for the new filter as the following:
a. From the Scope list, select a filter scope option, and then click Select... to open a dialog box where you can select a
working set for your filter.
b. Select whether you want to display completed or not completed tasks or both of them.
c. Select to display tasks according to their priority or according to the text they have.
d. Finally, select the check boxes of the task types you want to list.
When a task goal is met, you can delete this task from the Tasks list after labeling it as completed.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 277/288
26/07/2022 09:49 PDF Export
At least one task is added to an item in the Profiling perspective of Talend Studio.
Procedure
1. Follow the steps outlined in Displaying the task list to access the Tasks list.
2. Select the check boxes next to each of the tasks and right-click anywhere in the list.
3. From the contextual menu, select Delete Completed Tasks. A confirmation message is displayed to validate the operation.
Appendices
Regular expressions
Using regular expressions on SQLServer
Main concept
The regular expression function is not built into all different databases environments. This is why you need, when using some
databases, to create a User-Defined Function (UDF) to extend the functionality of the database server.
For example, the following databases natively support regular expressions: MySQL, PostgreSQL, Oracle 10g, Ingres, etc., while
Microsoft SQL server does not.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 278/288
26/07/2022 09:49 PDF Export
After you create the regular expression function, you should use the studio to declare that function in a specific database before
being able to use regular expressions on analyzed columns.
Prerequisite(s):You should have Visual Studio 2008, which is the version used to do the example in this documentation. The Visual
Studio main window is open.
To create a regular expression function in SQL Server, follow the steps outlined in the sections below.
Procedure
1. On the menu bar, select File > New > Project to open the New Project window.
2. In the Project types tree view, expand Visual C# and select Database.
3. In the Templates area to the right, select SQL Server Project and then enter a name in the Name field for the project you want
to create, UDF function in this example.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 279/288
26/07/2022 09:49 PDF Export
5. From the Available References list, select the database in which you want to create the project and then click OK to close the
dialog box.
Note: If the database you want to create the project in is not listed, you can add it to the Available Reference list through
the Add New Reference tab.
The project is created and listed in the Data Explorer panel to the right of the Visual Studio main window.
Procedure
1. In the project list in the Solution Explorer panel, expand the node of the project you created and right-click the Test Scripts
node.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 280/288
26/07/2022 09:49 PDF Export
3. From the Templates list, select Class and then in the Name field, enter a name to the user-defined function you want to add
to the project, RegExMatch in this example.
The added function is listed under the created project node in the Solution Explorer panel to the right.
4. Click Add to validate your changes and close the dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 281/288
26/07/2022 09:49 PDF Export
5. In the code space to the left, enter the instructions corresponding to the regular expression function you already added to
the created project.
Below is the code for the regular expression function we use in this example.
Using System;
Using Microsoft.SqlServer.Server;
Using System.Text.RegularExpressions;
if (r1.Match(matchString.TrimEnd(null)).Success == true)
return 1 ;
else
return 0 ;
Using
};
6. Press Ctrl+S to save your changes and then on the menu bar, click Build and in the contextual menu select the corresponding
item to build the project you created, Build UDF function in this example.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 282/288
26/07/2022 09:49 PDF Export
The lower pane of the window displays a message to confirm that the "build" operation was successful or not.
7. On the menu bar, click Build and in the contextual menu select the corresponding item to deploy the project you created,
Deploy UDF function in this example.
The lower pane of the window displays a message to confirm that the "deploy" operation was successful, or not.
If required:
a. launch SQL Server and check if the created function exists in the function list,
b. check if the function works well, for more information, see Testing the created function via the SQL Server editor.
Before being able to use regular expressions on analyzed columns in a database, you must first declare the created regular
expression function, RegExMatch in this example, in the specified database via the studio. To do that:
Procedure
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 283/288
26/07/2022 09:49 PDF Export
3. Double-click Regular Expression Matching, or right-click it and select Open from the contextual menu.
The corresponding view displays the indicator metadata and its definition.
You need now to add to the list of databases the database for which you want to define a query template. This query
template will compute the regular expression matching.
4. Click the [+] button at the bottom of the Indicator Definition view to add a field for the new template.
5. In the new field, click the arrow and select the database for which you want to define the template, Microsoft SQL Server.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 284/288
26/07/2022 09:49 PDF Export
8. Paste the indicator definition (template) in the Expression box and then modify the text after WHEN in order to adapt the
template to the selected database.
9. Click OK to proceed to the next step. The new template is displayed in the field.
10. Click the save icon on top of the editor to save your changes.
Results
For more detailed information on how to declare a regular expression function in the studio, see Defining a query template for a
specific database and Declaring a User-Defined Function in a specific database.
Procedure
FirstName nvarchar(30),
LastName nvarchar(30),
(dbo.RegExMatch('[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)
+(com|org|edu|nz)',
EmailAddress)=1),
(dbo.RegExMatch('\([1-9][0-9][0-9]\) [0-9][0-9][0-9]
\-[0-9][0-9][0-9][0-9]',
UsPhoneNo)=1))
([FirstName]
, [LastName]
, [EmailAddress]
, [USPhoneNo])
VALUES
('Hallam'
, 'Amine'
, '0129-2090-1092')
, ( 'encoremoi'
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 285/288
26/07/2022 09:49 PDF Export
, 'nimportequoi'
, '(122) 190-9090')
GO
2. To search for the expression that match, use the following code:
SELECT [FirstName]
, [LastName]
, [EmailAddress]
, [USPhoneNo]
FROM [talend].[dbo].[Contacts]
where [talend].[dbo].RegExMatch([EmailAddress],
'[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)+(com|org|edu|nz|au)')
= 1
3. To search for the expression that do not match, use the following code:
SELECT [FirstName]
, [LastName]
, [EmailAddress]
, [USPhoneNo]
FROM [talend].[dbo].[Contacts]
where [talend].[dbo].RegExMatch([EmailAddress],
'[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)+(com|org|edu|nz|au)')
= 0
To identify incorrect data, you may want to use Pattern indicators. These indicators use regular expression to work. In Teradata, the
regular expression function is only installed by default starting from version 14.
To use the regular expression function in a Teradata version prior to 14, you need to install a User Defined Function (UDF).
You must create a Teradata user with CREATE FUNCTION and EXECUTE FUNCTION access rights.
Procedure
3. Grant the user you created at least the CREATE FUNCTION and EXECUTE FUNCTION access rights.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 286/288
26/07/2022 09:49 PDF Export
You can create a User Defined Function (UDF) function using a C program and install it on the Teradata database in order to use
regular expressions.
Retrieve the Regex_INSTR.c file from the Downloads tab in the left panel of this page.
Procedure
4. Create a UDF function. You can use the following command for example:
RETURNS INTEGER
LANGUAGE C
NO SQL
select Regex_INSTR('A','[A-Z]');
Before being able to use regular expressions on analyzed columns in a database, you must first declare the created regular
expression function by adding the SQL instruction and the pattern.
Procedure
1. In the Profiling perspective, expand Libraries > Indicators > System Indicators.
4. In the new indicator line, select Teradata from the Database list.
5. Click Edit... next to the new field to display the Edit expression dialog box.
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 287/288
26/07/2022 09:49 PDF Export
9. In the Pattern Matching table, right-click the pattern results and select View invalid rows, for example. The SQL editor opens
listing invalid data and the SQL expression looks like the following:
In the Pattern Test View, you can test a string of text against a regular expression.
Procedure
1. On the menu bar of Talend Studio , click Window > Show View.
4. Select the DB Connections option then select your Teradata connection from the list.
5. In the Test Area and Regex fields, enter your test string and your regular expression respectively.
Results
https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 288/288