0% found this document useful (0 votes)

11 views288 pages

Tos DQS en

Uploaded by

Valery Diby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views288 pages

Tos DQS en

Uploaded by

Valery Diby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 288

26/07/2022 09:49 PDF Export

Welcome to Talend Help Center

Talend Open Studio for Data Quality

User Guide
2022-07-26

1. Launching Talend Studio

2. Configuring Talend Studio
1. Installing external modules
1. When to install external modules
2. Customizing the Maven URI for external module deployment
3. Installing all external modules in one go
4. Installing external modules manually using the Modules view
5. Overriding a database driver by customizing the Maven URI
3. Data Profiling
1. Data profiling: concepts and principles
1. About Talend Data Quality
1. What is Talend Data Quality?
2. Core features
1. Metadata repository
2. Patterns and indicators
2. Functional architecture
2. Getting started with Talend Data Quality
1. Working principles of data quality
2. Importing a data quality demo project
1. Importing a demo project to be a separate data quality project
2. Importing a demo project in the current data quality project
3. Important features and configuration options
1. Defining the maximum memory size threshold
2. Setting preferences of analysis editors and analysis results
3. Setting the default frequency table parameters
4. Displaying and hiding the help content in Talend Studio
1. Displaying the cheat sheets
2. Hiding the help panel
5. Displaying the Module view
6. Displaying the error log view and managing log files
7. Opening new editors
4. Icons appended on analyses names in the DQ Repository
3. Setting up connections to data sources

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 1/288
26/07/2022 09:49 PDF Export

1. Creating connections to data sources

1. Connecting to a database
1. Creating a connection
2. Creating a connection from a catalog or a schema
3. Creating a connection to a custom database
4. What you need to know about some databases
5. Catalogs and schemas in database systems
2. Connecting to a file
2. Managing connections to data sources
1. Managing database connections
1. Opening or editing a database connection
2. Filtering a database connection
3. Duplicating a database connection
4. Adding a task to a database connection or any of its elements
5. Filtering tables/views in a database connection
6. Deleting a database connection
7. Restoring a database connection
2. Managing file connections
4. Profiling database content
1. Analyzing databases
1. Creating a database content analysis
1. Defining the connection overview analysis
2. Selecting the database connection you want to analyze
2. Creating a catalog or schema analysis
2. Previewing data in the SQL editor
3. Displaying keys and indexes of database tables
4. Synchronizing metadata connections and database structures
1. Synchronizing and reloading catalog and schema lists
2. Synchronizing and reloading table lists
3. Synchronizing and reloading column lists
5. Redundancy analyses
1. What are redundancy analyses?
2. Comparing identical columns in different tables
1. Defining the redundancy analysis
2. Selecting the identical columns you want to compare
3. Finalizing and executing the analysis
3. Matching primary and foreign keys
1. Defining the analysis to match primary and foreign keys in tables
2. Selecting the primary and foreign keys
6. Table analyses
1. Steps to analyze database tables
2. Analyzing tables in databases
1. Creating a simple table analysis (Column Set Analysis)
1. Creating an analysis of a set of columns using patterns
1. Defining the set of columns to be analyzed
1. Defining the analysis
2. Selecting the set of columns you want to analyze
2. Adding patterns to the analyzed columns
3. Finalizing and execute the analysis of a set of columns
4. Filtering data against patterns
2. Creating a column analysis from a simple table analysis
2. Creating a table analysis with SQL business rules
1. Creating an SQL business rule
1. Creating the business rule
2. Creating a join condition
2. Editing an SQL business rule

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 2/288
26/07/2022 09:49 PDF Export

3. Creating a table analysis with a simple SQL business rule

1. Defining the table analysis
2. Selecting the table you want to analyze
3. Selecting the business rule
4. Creating a table analysis with an SQL business rule with a join condition
5. Generating an analysis on the join results to analyze duplicates
6. Creating a table analysis with an SQL business rule in a shortcut procedure
3. Detecting anomalies in columns (Functional Dependency Analysis)
1. Defining the analysis to detect anomalies in columns
2. Selecting the columns as either "determinant" or "dependent"
3. Finalizing and executing the functional dependency analysis
3. Analyzing tables in delimited files
1. Creating a column set analysis on a delimited file using patterns
1. Defining the set of columns to be analyzed in a delimited file
1. Defining the column set analysis
2. Selecting the set of columns you want to analyze in the delimited file
2. Adding patterns to the analyzed columns in the delimited file
3. Finalizing and executing the column set analysis on a delimited file
4. Filtering analysis data against patterns
2. Creating a column analysis from the analysis of a set of columns
4. Analyzing duplicates
1. Creating a match analysis
1. Defining a match analysis from the Analysis folder
2. Defining a match analysis from the Metadata folder
3. Configuring the match analysis
4. Defining a match rule
1. Defining a blocking key
2. Defining a matching key with the VSR algorithm
3. Defining a matching key with the T-Swoosh algorithm
1. Creating a match key
4. Editing rules and displaying sample results
5. How to show the match results
6. Importing or exporting match rules
1. Importing match rules from the repository
2. Exporting match rules to the repository
2. Creating a match rule
1. Defining the rule
2. Duplicating a rule
3. Rules with the VSR algorithm
1. Defining a blocking key from the match analysis
2. Defining a matching key
4. Rules with the T-Swoosh algorithm
1. Creating a match key
7. Column analyses
1. Where to start?
2. Creating a basic analysis on a database column
1. Defining the columns to be analyzed and setting indicators
1. Defining the columns to be analyzed
1. Defining the basic column analysis
2. Selecting the database columns and setting sample data
2. Setting indicators on columns
1. Setting system or user-defined indicators
2. Setting options for system or user-defined indicators
3. Setting user-defined indicators from the analysis editor
2. Finalizing and executing the column analysis
3. Using the Java or the SQL engine

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 3/288
26/07/2022 09:49 PDF Export

4. Accessing the detailed view of the database column analysis

5. Viewing and exporting analyzed data
6. Using regular expressions and SQL patterns in a column analysis
1. Adding a regular expression or an SQL pattern to a column analysis
2. Editing a pattern in the column analysis
3. Viewing the data analyzed against patterns
7. Saving the queries executed on indicators
8. Creating analyses from table or column names
3. Creating a basic column analysis on a file
1. Defining the columns to be analyzed in a file
1. Defining the column analysis
2. Selecting the file columns and setting sample data
2. Setting system and user-defined indicators
3. Setting options for system indicators
4. Setting regular expressions and finalize the analysis
5. Viewing and exporting the analyzed data in a file
6. Analyzing delimited data in shortcut procedures
4. Analyzing discrete data
1. Defining the analysis of discrete data
2. Running the analysis and accessing the detail analysis results
5. Data mining types
1. Nominal
2. Interval
3. Unstructured text
4. Other
6. Supported character types in column analyses and data masking operations
7. Different profiling results when running column analyses with the Java and the SQL engines
8. Correlation analyses
1. What are column correlation analyses?
2. Numerical correlation analyses
1. Creating a numerical correlation analysis
1. Defining the numerical correlation analysis
2. Selecting the columns you want to analyze and setting analysis parameters
2. Exploring the results of the numerical correlation analysis
3. Time correlation analyses
1. Creating a time correlation analysis
1. Defining the time correlation analysis
2. Selecting the columns for the time correlation analysis and setting analysis parameters
2. Exploring the results of the time correlation analysis
4. Nominal correlation analyses
1. Creating a nominal correlation analysis
1. Defining the nominal correlation analysis
2. Selecting the columns you want to analyze
2. Exploring the results of the nominal correlation analysis
9. Patterns and indicators
1. Patterns
1. Pattern types
2. Managing User-Defined Functions in databases
1. Declaring a User-Defined Function in a specific database
2. Defining a query template for a specific database
3. Editing a query template
4. Deleting a query template
3. Adding regular expressions and SQL patterns to column analyses
4. Managing regular expressions and SQL patterns
1. Creating a new regular expression or SQL pattern
2. Testing a regular expression in the Pattern Test View

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 4/288
26/07/2022 09:49 PDF Export

3. Creating a new pattern from the Pattern Test View

4. Generating a regular expression from the Date Pattern Frequency indicator
5. Editing a regular expression or an SQL pattern
6. Exporting regular expressions or SQL patterns
1. Exporting regular expressions or SQL patterns to Talend Exchange
2. Exporting a family of regular expressions or SQL patterns to Talend Exchange
3. Exporting regular expressions or SQL patterns to a csv file
7. Importing regular expressions or SQL patterns
1. Importing regular expressions or SQL patterns from Talend Exchange
2. Importing regular expressions or SQL patterns from a csv file
2. Indicators
1. Indicator types
1. Advanced statistics
2. Fraud Detection
3. Pattern frequency statistics
1. Pattern frequency indicators
2. East Asia pattern frequency indicators
3. Date pattern frequency indicator
4. Word-based pattern indicators
5. List of engines used and database types supported when using Pattern Frequency
Statistics indicators
4. Phone number statistics
5. Simple statistics
6. Soundex frequency statistics
1. Teradata error: "Invalid Input: only Latin letters allowed"
7. Summary statistics
8. Text statistics
2. Managing system indicators
1. Editing a system indicator
2. Setting system indicators and indicator options to column analyses
3. Exporting or importing system indicators
4. Duplicating a system indicator
3. Managing user-defined indicators
1. Creating SQL user-defined indicators
1. Defining the indicator
2. Setting the indicator definition and category
2. Defining Java user-defined indicators
1. Creating Java user-defined indicators
1. Defining the custom indicator
2. Setting the definition and category of the custom indicator
2. Creating a Java archive for the user-defined indicator
3. Exporting user-defined indicators
1. Exporting user-defined indicators to an archive file
2. Exporting user-defined indicators to Talend Exchange
4. Importing user-defined indicators
1. Importing user-defined indicators from an archive file
2. Importing user-defined indicators from a csv file (deprecated feature)
3. Importing user-defined indicators from Talend Exchange
5. Selecting a user-defined indicator
6. Editing a user-defined indicator
4. Indicator parameters
5. Date handling when profiling columns in Oracle
10. Other management procedures
1. Creating and storing SQL queries
2. Using context variables in analyses
1. Creating one or multiple contexts for the same analysis

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 5/288
26/07/2022 09:49 PDF Export

1. Defining contexts in analyses

2. Defining variables in analyses
2. Selecting the context with which to run the analysis
3. Importing data profiling items
4. Exporting data profiling items
5. Upgrading project items from older versions
11. Tasks
1. Working with tasks
1. Adding a task to a column in a database connection
2. Adding a task to an item in a specific analysis
3. Adding a task to an indicator in a column analysis
4. Displaying the task list
5. Filtering the task list
6. Deleting a completed task
4. Appendices
1. Regular expressions
1. Using regular expressions on SQLServer
1. Main concept
2. Creating a regular expression function on SQL Server
1. Creating a project in Visual Studio
2. Deploying the regular expression function to the SQL server
3. Setting up the studio
3. Testing the created function via the SQL Server editor
2. Using regular expressions on Teradata
1. Creating a user on Teradata
2. Creating a User Defined Function using a C program
3. Editing the pattern indicator and using it in a column analysis
4. Using the Pattern Test view

Launching Talend Studio

You can launch your Talend Studio by following the procedure below.

Procedure

1. Go to your Talend Studio installation directory.

The Talend Studio installation directory contains binaries for several platforms including Windows, Linux, and MacOS.

2. Double-click the executable file corresponding to your operating system.

Operating System Executable file

TOS_DQ-win-x86_64.exe
Windows

TOS_DQ-linux-gtk-x86_64
Linux on x86

TOS_DQ-gtk-aarch64
Linux on ARM

TOS_DQ-macosx-cocoa.app
MacOS on x86

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 6/288
26/07/2022 09:49 PDF Export

Operating System Executable file

TOS_DQ-macosx-cocoa-
MacOS on ARM
aarch64.app

Tip: If you work on MacOS and install Talend Studio manually using the zip archive file, you might get one of the
following two messages as shown in below screenshots when trying to launch Talend Studio for the first time. To fix the
issue, you can open the terminal, go to the directory above the Talend Studio top directory, execute the command xattr
-d com.apple.quarantine <Talend-Studio>/* , where <Talend-Studio> is the root folder name of your Talend Studio,
and then relaunch your Talend Studio.

3. If you launch Talend Studio for the first time, in the User License Agreement dialog box that opens, read and accept the terms
of the end user license agreement.

Results

A default project is created in the workspace folder under the installation directory of Talend Open Studio for Data Quality.

You can now start working with your project and items.

Talend Studio requires specific third-party Java libraries or database drivers (.jar files) to be able to connect to sources and targets.
These libraries or drivers, known as external modules, are required by some Talend components and/or connection wizards. Due to
license restrictions, Talend may not be able to ship certain external modules within Talend Studio. For more information, see
Installing external modules.

Configuring Talend Studio

Installing external modules

Talend Studio requires specific third-party Java libraries or database drivers to be installed to connect to sources and targets.

Those libraries or database drivers, known as external modules, may be required by some of Talend components or by some
connection wizards or by both. Due to license restrictions, Talend cannot ship some of these external modules within Talend Studio.
You need to install them for Talend Studio to function properly.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 7/288
26/07/2022 09:49 PDF Export

Warning: Make sure that the -Dtalend.disable.internet parameter is not present in the Studio .ini file or is set to false .

When to install external modules

Talend Studio will let you know when you need to install external modules and what external modules you need to install.

Talend Studio notify you about required external modules in several ways.

On your design workspace, if a component requires the installation of external modules before it can work properly, a red
error indicator appears on the component. With your mouse pointer over the error indicator, you can see a tooltip message
showing which external modules are required for that component to work.
When you open the Basic settings or Advanced settings view of a component for which one or more external modules are
required, you will see a piece of highlighted information about external modules, followed by an Install button. Clicking the
Install button opens a wizard that will show you the external modules to be installed.
The Modules view lists all the modules required for the Studio to work properly, including those Java libraries and drivers
that you must install.

If the Modules view is not shown under your design workspace, go to Window > Show view... > Talend and then select
Modules from the list.

In this view:

Item Description

Filter text field Allows you to search external modules based on the status, the context, the
module file name, and the Maven URI.

Status Points out if a module is installed or not installed on your system.

The icon indicates that the module is not necessarily required for the
corresponding component or Metadata connection.

The icon indicates that the module is absolutely required for the
corresponding component or Metadata connection.

Context Gives the name of the component or Metadata connection using the
module. If this column is empty, the module is then required for the
general use of your Talend Studio.

Module Gives the exact name of the module.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 8/288
26/07/2022 09:49 PDF Export

Item Description

Maven URI Uniquely identifies the module deployment in Maven.

You can customize the Maven URI of a module by clicking the Maven URI
field and then clicking [...] that appears. For more information, see
Customizing the Maven URI for external module deployment.

Refreshes this view to reflect the latest module installation status.

In case of collaborative work, once a required module is installed in one

user's Studio, the other users can simply refresh their Modules view to add
this module to their own Talend Studio.

Imports custom Maven settings from a local file.

Exports custom Maven settings into a local file.

Allows you to install an already downloaded external module into your

Studio. For details, see Installing external modules manually using the
Modules view.

Shares libraries to the local libraries repository.

This button is available only if the connection to the local libraries

repository is successful.

You can configure whether to share libraries at Talend Studio startup. For
more information, see Artifact repository for libraries preferences.

If you are using a local libraries repository configured with proxy, libraries
will not be shared when clicking this button. For more information about
configuring proxy for a local libraries repository, see Configuring a proxy
repository for libraries in Talend Studio.

Opens the Third-party Libraries wizard, which allows you to install all
required and/or optional libraries in one go. For more information, see
Installing all external modules in one go.

A Jar installation wizard appears when you:

drop a component from the Palette if one or more external modules required for that component to work are missing
in the Studio.
click the Test connection button in a Metadata connection setup wizard in the Studio if one or more external modules
required for the connection are missing in the Studio.
click the Guess schema button in the Component view of a component if one or more external modules required for
that component to work are missing in the Studio.
click Install on the top of the Basic settings or Advanced settings view of a component for which one or more required
external modules are missing.
run a Job that involves components or Metadata connections for which one or more required external modules are
missing.
select one or more modules that are not integrated in the Studio and click the button in the Modules view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 9/288
26/07/2022 09:49 PDF Export

This wizard:

lists the external modules to be installed and the licenses under which they are provided,
provides the default Maven URIs that identify the deployment of the modules,
provides the official websites where you can learn more about the modules,
lets you download and install automatically all the modules available in the Talend repository,
allows you to install those not available in the Talend repository manually.

When you drop a component, set up a connection, or guess the schema of a database, that requires an external module for
which neither the Jar file nor its download URL information is available on the Talend website, the Jar installation wizard
does not appear, but the Error Log view will present an error message informing you that the download URL for that module
is not available. You can try to find and download it by yourself, and then install it manually into the Studio.

Tip: To show the Error Log view on the tab system, go to Window > Show views..., then expand General and select Error
Log.

Customizing the Maven URI for external module deployment

In Talend Studio, each external module is given a default URI to identify its deployment in Maven. When needed, you can change the
Maven URI.

For example, when replacing an installed database driver with a new version, you need to specify another Maven URI for it.

Note:

Changing the Maven URI for an external module will affect all the components and metadata connections that use the module
within the project.

When working on a remote project, your custom Maven URI settings will be automatically synchronized to the Talend Artifact
Repository and will be used when other users working on the same project install the external module.

Procedure

1. In the Modules view, click the Maven URI you want to customize and then click the [...] button that appears.
The Install Module dialog box opens.

2. If you want to install another version of the external module, specify the full path to the module file in the Module File field,
or click the [...] button to browse in your local file system.
If the MVN URI of the library is within the jar file and is different from the default MVN URI, it is automatically detected and
filled in the Custom MVN URI field.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 10/288
26/07/2022 09:49 PDF Export

3. Select the Custom MVN URI check box and enter a new URI in the field.

4. Click Detect the module install status and then OK to validate the custom URI and close the dialog box.

Results

The new Maven URI takes effect and is displayed in the Modules view, from which you can export all your Maven URI changes into a
local JSON file.

Installing all external modules in one go

You can download and install all required and/or optional external modules in one go automatically.

Before you begin

Make sure your Talend Studio has a secure Internet connection.

Procedure

1. Click in the upper right corner of the Modules view.

The Third-party Libraries wizard opens. The number of all required/optional third-party libraries is displayed next to the
corresponding option.

2. Select the Required third-party libraries and/or Optional third-party libraries check box(es) according to your needs.

3. Click OK.
The Review Licenses dialog box displays.

4. Accept the license terms and start the download and installation process:

To download and install the external module(s) provided under a particular license, select that license from the
Licenses pane, review the license terms, select the I accept the terms of the selected license agreement option, and
click Finish.
To download and install all external modules provided under all the listed licenses, click Accept all.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 11/288
26/07/2022 09:49 PDF Export

When the installation process is completed, the chosen external module or modules are installed into your Talend Studio,
and you can use Talend Studio features that depend on these modules.

Installing external modules manually using the Modules view

If you have already downloaded external modules, you can install them manually into your Talend Studio.

Before you begin

If you are going to install the JDBC driver for Oracle 9i into your Talend Studio, change the file name from ojdbc14.jar to ojdbc14-
9i.jar first.

Procedure

1. Click in the upper right corner of the Modules view or in the Jar installation wizard to browse your local file system.

2. In the Open dialog box of your file system, browse to the module you want to install, double-click the .jar file, or select it
and then click Open to install it into your Talend Studio.
The dialog box closes and the selected module is installed in the library folder of the current Talend Studio.

Overriding a database driver by customizing the Maven URI

If you have different versions of a database driver, you can override the current version of the driver by customizing its Maven URI.
That is, upgrading to the latest version or roll back to a previous version. Overriding a driver allows your Jobs to use any available
driver version.

Before you begin

Make sure the desired versions of the driver (for example, ojdbc14.jar and ojdbc14-9i.jar in the previous section) are available.

Note:

To avoid overriding failures, make sure the dependencies of the desired driver are resolved.
If you override a driver using the up-to-date driver (for example, version 2021) and then Talend Studio upgrades with a
newer driver version (for example, version 2022), the newer version will be used.
If you override a driver using an earlier driver (for example, overriding version 2021 using version 2020) and then Talend
Studio upgrades with the latest driver version (for example, version 2022), the latest version (that is, version 2022) will be
used.

Procedure

1. In the Modules view, locate the driver you want to override, click the Maven URI of the driver you want to customize, and
then click the [...] button that appears.
The Install Module dialog box opens.

2. Enter the full path to the desired driver file in the Module File field or click the [...] button and navigate to the desired driver
file.
If the MVN URI of the library is within the jar file and is different from the default MVN URI, it is automatically detected and
filled in the Custom MVN URI field.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 12/288
26/07/2022 09:49 PDF Export

3. Select the Custom MVN URI check box and enter a new URI in the field.

4. Click Detect the module install status and then OK to validate the custom URI and close the dialog box.

Note: To revert to the original driver, clear the check box to the left of the Custom MVN URI field.

Data Profiling

Data profiling: concepts and principles

About Talend Data Quality

The following sections introduce Talend Data Quality and list its key features.

What is Talend Data Quality?

This data profiling tool allows you to identify potential problems before beginning data-intensive projects such as data integration.

The data profiler centralizes several elements including:

a data profiler;
a data explorer;
a pattern manager; for more information about the pattern manager, see Patterns and indicators;
a metadata manager; for more information about the metadata manager, see Metadata repository.

Core features

This section describes basic features of Talend data profiling solution.

Metadata repository

Using Talend data quality, you can connect to data sources to analyze their structure (catalogs, schemas and tables), and stores the
description of their metadata in its metadata repository. You can then use this metadata to set up metrics and indicators.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 13/288
26/07/2022 09:49 PDF Export

For more information, see Creating connections to data sources.

Patterns and indicators

Patterns are sets of strings against which you can define the content, structure and quality of high complex data. The Profiling
perspective of Talend Studio lists two types of patterns: regular expressions, which are predefined regular patterns, and SQL
patterns which are the patterns you add using LIKE clauses.

For more information about patterns, see Patterns.

Indicators are the results achieved through the implementation of different patterns. They can represent the results of data
matching and different other data-related operations. The Profiling perspective of Talend Studio lists two types of indicators: system
indicators, a list of predefined indicators, and user-defined indicators, a list of those defined by the user.

For more information about indicators, see Indicators.

Functional architecture

The functional architecture of your Talend Studio is an architectural model that identifies the Talend Studio functions, interactions
and corresponding IT needs. The overall architecture has been described by isolating specific functionalities in functional blocks.

The chart below illustrates the main architectural functional blocks explored within the studio.

The different types of functional blocks are:

A Profiling perspective where you can use predefined or customized patterns and indicators to analyze data stored in
different data sources.
A Data Explorer perspective where you can browse and query the results of the profiling analyses done on data.

Getting started with Talend Data Quality

Working principles of data quality

From the Profiling perspective of Talend Studio, you can examine the data available in different data sources and collect statistics
and information about this data.

A typical sequence of profiling data using Talend Studio involves the following steps:

1. Connecting to a data source including databases and delimited files in order to be able to access the tables and columns on
which you want to define and execute analyses. For more information, see Creating connections to data sources.
2. Defining any of the available data quality analyses including database content analysis, column analysis, table analysis,
redundancy analysis, correlation analysis, etc. These analyses will carry out data profiling processes that will define the

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 14/288
26/07/2022 09:49 PDF Export

content, structure and quality of highly complex data structures. The analysis results will be displayed graphically next to
each of the analysis editors, or in more detail in the Analysis Results view.

Note: While you can use all analyses types to profile data in databases, you can only use Column Analysis and Column
Set Analysis to profile data in delimited files.

Importing a data quality demo project

Talend provides you with different demo projects you can import into your Talend Studio. Available demos depend on the and may
include ready to use Jobs which help you understand the functionalities of different Talend components.

When you import the data quality demo project:

input files and databases necessary to run the demo Jobs and analyses are imported under the Documentation folder in the
Integration perspective of the studio.

profiling analyses are imported in the DQ Repository tree view of the Profiling perspective. These analyses run on the
databases and files you installed initially as pointed out by the data quality tutorials.
data quality Jobs are imported in the Repository tree view of the Integration perspective. These Jobs use different data
quality components to standardize, deduplicate and match data for example.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 15/288
26/07/2022 09:49 PDF Export

You can run most of these Jobs without any prerequisites. However, for few Jobs, you must restore in your Mysql the
databases, tbi , tutorials , cif and crm , and download some files locally. You can find the databases and files under
the Documentation folder in the Repository tree view in the Integration perspective.

Note: As some of the demo Jobs are shared with the data quality Tutorials, they may have A, B, C, etc. or 1, 2, 3, etc. that
precede their names. You must run these Jobs in the specified order.

You can import the demo project either from the login window of your studio as a separate project, or from the Integration
perspective into your current project.

Importing a demo project to be a separate data quality project

Procedure

1. Launch your Talend Studio and from the login window select Import a demo project and then click Select.

2. In the open dialog box, select the demo project you want to import and click Finish.

Note: The demo projects available in the dialog box may vary depending on the you are using.

3. In the dialog box that opens, type in a name for the demo project you want to import and click Finish.
A bar is displayed to show the progress of the operation.

4. On the login window, select from the project list the demo project you imported and then click Finish to open the demo
project in the studio.
All the samples of the demo project are imported into the studio under different folders in the repository tree view including
the input files and metadata connection necessary to run the demo samples.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 16/288
26/07/2022 09:49 PDF Export

Importing a demo project in the current data quality project

Procedure

1. Launch your studio and in the Integration perspective, click the icon on the toolbar.

2. In the open dialog box, select the demo project to import and click Finish.
A bar is displayed to show the progress of the operation and then a confirmation message opens.

3. Click OK.

Important features and configuration options

This section details some important information about analysis editors, the Error log view and the help context embedded in Talend
Studio.

Defining the maximum memory size threshold

From Talend Studio, you can control memory usage when using the Java engine to run two types of analyses: column analysis and
the analysis of a set of columns.

If you use column analysis or column set analysis to profile very big sets of data or data with many problems, you may run out of
memory and end up with a Java heap error. By defining the maximum memory size threshold for these analyses, Talend Studio will
stop the analysis execution when the memory limit size is reached and provide you with the analysis results that were measured on
the data before the analysis execution was terminated by the memory limit size.

Before you begin

Procedure

1. On the menu bar, select Window > Preferences to display the Preferences window.

2. Perform one of the following steps:

Expand Talend > Profiling and select Analysis tuning, or,

start typing analysis tuning in the dynamic filter field.

The Analysis tuning view is displayed.

3. In the Memory area, select the Enable analysis thread memory control check box.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 17/288
26/07/2022 09:49 PDF Export

4. Move the slider to the right to define the memory limit at which the analysis execution will be stopped.

Results

The execution of any column analysis or column set analysis will be stopped if it exceeds the allocated memory size. The analysis
results given in Talend Studio will cover the data analyzed before the interruption of the analysis execution.

Setting preferences of analysis editors and analysis results

You can decide once for all what sections to fold by default when you open any of the connection or analysis editors. It also offers
the possibility to set up the display of all analysis results and whether to show or hide the graphical results in the different analysis
editors.

Procedure
1. On the menu bar, select Window > Preferences to display the Preferences window.

2. Expand Talend > Profiling and select Editor.

3. In the Folding area, select the check box(es) corresponding to the display mode you want to set for the different sections in
all the editors.

4. In the Analysis results folding area, select the check boxes corresponding to the display mode you want to set for the statistic
results in the Analysis Results view of the analysis editor.

5. In the Graphics area, select the Hide graphics in analysis results page option if you do not want to show the graphical results
of the executed analyses in the analysis editor. This will optimize system performance when you have so many graphics to
generate.

6. In the Analyzed Items Per Page field, set the number for the analyzed items you want to group on each page.

7. In the Business Rules Per Page field, set the number for the business rules you want to group in each page.

Note: You can always click the Restore Defaults tab on the Preferences window to bring back the default values.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 18/288
26/07/2022 09:49 PDF Export

8. Click Apply and then OK to validate the changes and close the Preferences window.

Results
While carrying on different analyses, all corresponding editors will open with the display mode you set in the Preferences window.

Setting the default frequency table parameters

In the Profiling perspective, when viewing the results of an analysis, 10 results are shown in the frequency tables by default. From
the Preferences window of Talend Studio, you can edit the default value for frequency and low frequency tables.

You cannot update frequency table parameters for:

locked analyses,
open analyses, and
analyses with frequency indicators that use the current default value.

Procedure

1. On the menu bar, select Window > Preferences to display the Preferences window.

2. Expand Talend > Profiling and select Indicator settings.

3. In the Number of result shown fields, set the default value for Frequency table and Low frequency table.

4. Click Apply to analyses to apply the parameters to existing analyses.

5. In the Set the Frequency Table Parameters dialog box, select the analyses for which to apply the new frequency table
parameters, and click OK.

6. Click OK to save your changes.

Displaying and hiding the help content in Talend Studio

Talend Studio provides you with cheat sheets that you can use as a quick reference that guides you through all common tasks in
data profiling.

You can also have access to a help panel that is attached to all wizards used in Talend Studio to create the different types of analyses
or to set thresholds on indicators.

Displaying the cheat sheets

When you open Talend Studio for the first time, the Cheat Sheets view opens by default in the Profiling perspective.

If you close the Cheat Sheets view in the Profiling perspective of Talend Studio, it will be always closed anytime you switch back to
this perspective until you open it manually.

Procedure

To display the cheat sheets, either:

1. Press the Alt+Shift+Q and then H shortcut keys, or select Window > Show View from the menu bar.

The Show View dialog box opens.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 19/288
26/07/2022 09:49 PDF Export

2. Expand the Help folder and then select Cheat Sheets.

3. Click OK to close the dialog box.

Or,

4. Select Help > Cheat Sheets from the menu bar. The Cheat Sheet Selection dialog box opens.

You can also press the Alt+H shortcut keys to open the Help menu and then select Cheat Sheets.

5. Expand the Talend > Cheat Sheets folder, select the cheat sheet you want to open in Talend Studio and then click OK .

The selected cheat sheet opens in the Talend Studio main window. Use the local toolbar icons to manage the display of the
cheat sheets.

Hiding the help panel

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 20/288
26/07/2022 09:49 PDF Export

A help panel is attached to the wizards used in Talend Studio to create and manage profiling items. This help panel opens by default
in all wizards.

Procedure

1. Select Window > Preferences > Talend > Profiling > Web Browser.
The Web Browser view opens.

2. Select the Block browser help check box and then click OK.
From now on, all wizards in Talend Studio display without the help panel.

Displaying the Module view

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 21/288
26/07/2022 09:49 PDF Export

Talend Studio provides you with a Module view. This view shows if a module is necessary and required for creating a connection to a
database.

Checking the Module view helps you to verify what modules you have or should have to run smoothly your profiling analyses.

Procedure

1. Select Window > Show View from the menu bar.

The Show View dialog box opens.

2. Start typing Module in the filter field.

3. Select Modules from the list and then click OK.

The Module view opens in Talend Studio.

4. From the toolbar of the Module view, select:

Icon To

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 22/288
26/07/2022 09:49 PDF Export

Icon To

browse your local system to the module you want to

install

open a list of all required external modules that are not

integrated in Talend Studio

For further information, see the Talend Installation Guide.

Displaying the error log view and managing log files

Talend Studio provides you with very comprehensive log files that maintain diagnostic information and record any errors that are
encountered in the data profiling process.

The Error Log view is the first place to look when a problem occurs while profiling data, since it will often contain details of what
went wrong and how to fix it.

Procedure

1. Perform one of the following steps:

press the Alt+Shift+Q and then L shortcut keys, or,

select Window > Show View from the menu bar.

The Show View dialog box opens.

2. Expand the General folder and select Error Log.

3. Click OK to close the dialog box.

The Error Log view opens in Talend Studio.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 23/288
26/07/2022 09:49 PDF Export

Note: The filter field at the top of the view enables you to do dynamic filtering, for example as you type your text in the
field, the list will show only the logs that match the filter.

You can use icons on the view toolbar to carry out different management options including exporting and importing the error
log files.
Each error log in the list is preceded by an icon that indicates the severity of the log: for errors, for warnings and for
information.

4. Double-click any of the error log files to open the Event Detail dialog box.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 24/288
26/07/2022 09:49 PDF Export

5. If required, click the icon in the Event Detail dialog box to copy the event detail to the clipboard and then paste it
anywhere you like.

Opening new editors

It is possible to open new analysis or SQL editors in the Profiling and Data Explorer perspectives respectively.

Before you begin

To be able to use Data Explorer in Talend Studio, you must install certain SQL explorer libraries that are required for data quality. If
you do not install these libraries, the Data Explorer perspective will be missing from Talend Studio and many features will not be
available.

For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide .

Procedure

To open an empty new analysis editor, do the following:

1. In the DQ Repository tree view, expand the Data Profiling folder.

2. Right-click the Analysis folder and select New Analysis.

To open an empty new SQL editor from the Data Explorer perspective, do the following:

3. In the Connections view of the Data Explorer perspective, right-click any connection in the list.
A contextual menu is displayed.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 25/288
26/07/2022 09:49 PDF Export

4. Select New SQL Editor.

A new SQL empty editor opens in the Data Explorer perspective.

What to do next
To open an empty SQL editor from the Profiling perspective of Talend Studio, see the procedure outlined in Creating and storing SQL
queries.

Icons appended on analyses names in the DQ Repository

When you create any analysis type from the Talend Studio, a corresponding analysis item is listed under the Analyses folder in the
DQ Repository tree view.

Note: The number of the analyses created in the studio will be indicated next to this Analyses folder in the DQ Repository tree
view.

This analysis list will give you an idea about any problems in one or more of your analyses before even opening the analysis.

If an analysis fails to run, a small red-cross icon will be appended on it. If an analysis runs correctly but has violated thresholds, a
warning icon is appended on such analysis.

Setting up connections to data sources

Creating connections to data sources

The Profiling perspective of Talend Studio enables you to create connections to databases and to delimited files in order to profile
data in these data sources.

Connecting to a database

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 26/288
26/07/2022 09:49 PDF Export

Before proceeding to analyze data in a specific database, you must first set up the connection to this database. From the Profiling
perspective of Talend Studio, you can create a connection on the DataBase Management System (DBMS) and show database
content.

For more information about the supported databases for profiling data, see Talend Data Fabric Installation Guide.

Connections to different databases are reflected by different tree levels and different icons in the DQ Repository tree view because
the logical and physical structure of data differs from one relational database to another. The highest level structure "Catalog"
followed by "Schema" and finally by "Table" is not applicable to all database types.

For further information, see Catalogs and schemas in database systems.

Creating a connection

Before you begin

You have read What you need to know about some databases carefully.

Procedure

1. In the DQ Repository tree view, expand Metadata, right-click DB Connections and select Create DB Connection.

The Database Connection wizard opens.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 27/288
26/07/2022 09:49 PDF Export

2. In the Name field, enter a name for this new database connection.
Do not use spaces in the connection name.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

3. If required, set other connection metadata (purpose, description and author name) in the corresponding fields and click Next
to proceed to the next step.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 28/288
26/07/2022 09:49 PDF Export

4. In the DB Type field and from the drop-down list, select the type of database to which you want to connect. For example,
MySQL .
For further information about supported databases, see the Talend Installation Guide.

Note: If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is
recommended to use the Java engine to execute the column analyses created on the selected database. For more
information on column analyses, see Defining the columns to be analyzed and setting indicators, and for more
information on the Java engine, see Using the Java or the SQL engine.

5. In the DB Type field and from the drop-down list, select the type of database to which you want to connect. For example,
MySQL .
For further information about supported databases, see the Talend Installation and Upgrade Guide .
If you select to connect to a database that is not supported in the studio (using the ODBC or JDBC methods), it is
recommended to use the Java engine to execute the column analyses created on the selected database. For more
information on column analyses, see Defining the columns to be analyzed and setting indicators, and for more information
on the Java engine, see Using the Java or the SQL engine.

6. In the DB Version field, select the version of the database to which you are creating the connection.

7. Enter your login, password, server and port information in their corresponding fields.

8. In the Database field, enter the database name you are connecting to. If you need to connect to all of the catalogs within one
connection, if the database allows you to, leave this field empty.

9. Click the Check button to verify if your connection is successful.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 29/288
26/07/2022 09:49 PDF Export

If you have not already installed the database driver (.jar file) necessary to use the database, you will have a wizard
prompting you to install the relative third-party module, click Download and Install and then close the wizard.
For further information about identifying and installing external modules, see the Talend Installation and Upgrade Guide or
click the How to install a driver link in the wizard.
For further information about the Module view, see Displaying the Module view.

10. Click Finish to close the Database Connection wizard.

A folder for the created database connection is displayed under DB Connection in the DQ Repository tree view . The
connection editor opens with the defined metadata in Talend Studio.

11. If you created this connection in a reference project, expand Tables > table name > Columns.
Expanding the columns in a reference project allows you to select them from the main project.

Results

When you created the connection, you can open in Talend Studio a preview of the data in a specific database table. For further
information, see Previewing data in the SQL editor.

From the connection editor, you can:

Click Connection information to show the connection parameters for the relevant database.

Click the Check button to check the status of your current connection.

Click the Edit... button to open the connection wizard and modify any of the connection information.

For information on how to set up a connection to a file, see Connecting to a file.

Creating a connection from a catalog or a schema

You can create a connection on a database catalog or schema directly from a database connection.

Before you begin

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 30/288
26/07/2022 09:49 PDF Export

At least one database connection is set in the Profiling perspective of Talend Studio. For further information, see Connecting to a
database

About this task

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections and browse to the catalog or schema on which you want
to create the connection.

2. Right-click a catalog or schema and select Create a new connection.

A confirmation message is displayed.

3. Click OK.

Results

A new connection named after the selected connection and catalog is created under DB Connections.

Creating a connection to a custom database

The database connection wizard in Talend Studio lists the databases to which you can create a connection and do profiling
processes.

You can still use Talend Studio to connect to a custom "unsupported" database.

Procedure

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 31/288
26/07/2022 09:49 PDF Export

1. Choose JDBC as the database type in the connection wizard.

2. Fill in the connection parameters.

What to do next

After creating the connection to a custom database, you can profile and monitor data in this database by using different analyses
and indicators, as you do with supported databases. But you may need to change, in the Indicator Settings editor, the SQL query
template for some indicators, such as the regex indicator where each database has a different function to call. For further
information, see Editing a system indicator and Editing a user-defined indicator.

Note: If you have a problem profiling a custom database even though you use a JDBC connection, the reason could be that some
JDBC functions are not implemented by the JDBC driver library. Please raise an issue or ask support via Talend Community at:
https://community.talend.com/

What you need to know about some databases

Google BigQuery
Profiling data from Google BigQuery requires to go through a JDBC connection setting.

For more information, see how to build the Connection URL.

The RECORD data type is not supported.

When you set up a JDBC connection, specify each jar file extracted from the zip file.

Hive
The Hive server requires sufficient memory to run correctly. Before connecting to a Hive database:

1. Go to the Hive server configuration.

2. Set the HiveServer2 Java Heap Size parameter to at least 1 GB.

If you select to connect to the Hive database, you will be able to create and execute different analyses as with the other database
types.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 32/288
26/07/2022 09:49 PDF Export

In the connection wizard, you must select from the Distribution list the platform that hosts Hive. You must also set the Hive version
and model.

For more information, see http://hadoop.apache.org/.

If you decide to change the user name in an embedded mode of a Hive connection, you must restart the studio before being able to
successfully run the profiling analyses that use the connection.

If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks Data Platform V1.3 , you must set
proper memory allocations for the map and reduce computations to be performed by the Hadoop system. In the second step in the
connection wizard:

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 33/288
26/07/2022 09:49 PDF Export

1. Click the button next to Hadoop Properties and in the open dialog box click the [+] button to add two lines to the table.

2. Enter the parameters names as mapred.job.map.memory.mb and mapred.job.reduce.memory.mb .

3. Set their values to the by-default value 1000 .

This value is normally appropriate for running the computations.

If the Hadoop distribution to be used is Hortonworks Data Platform V2.0 (YARN), you must set the following parameter in the Hadoop
Properties table:

The parameter is yarn.application.classpath

The value is /etc/hadoop/conf,/usr/lib/hadoop/,/usr/lib/hadoop/lib/,/usr/lib/hadoop-hdfs/,/usr/lib/hadoop-
hdfs/lib/,/usr/lib/hadoop-yarn/,/usr/lib/hadoop-yarn/lib/,/usr/lib/hadoop-mapreduce/,/usr/lib/hadoop-

mapreduce/lib/

Note that one analysis type and few indicators and functions are still not supported for Hive, see the table below for more detail:

Unsupported indicators Unsupported functions Unsupported analyses

With SQL engine:

The View rows contextual menu for The only analysis that is not supported
Soundex Low Frequency column analyses with unique, duplicate for Hive is Time Correlation Analysis as
and all textual indicators the Date data type does not exist in
Pattern(Low) Frequency
The View match rows contextual menu Hive. For further information on this
Upper Quartile and Lower for column analyses with unique, analysis type, see Time correlation
Quartile duplicate and all textual indicators analyses.
All contextual menus on the analysis
Median results of functional dependency analysis

All Date Frequency indicators

Microsoft SQL Server

Microsoft SQL Server 2012 and later are supported.

If you select to connect to the Microsoft SQL Server database with Windows Authentication Mode, you can select Microsoft or JTDS
open source from the Db Version list.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 34/288
26/07/2022 09:49 PDF Export

MySQL
When creating a connection to MySQL via JDBC, it is not mandatory to include the database name to the JDBC URL. Regardless of
whether the database connection URL specified in the JDBC URL field includes the database name, all databases are retrieved.

For example, if you specify jdbc:mysql://192.168.33.41:3306/tbi?noDatetimeStringSync=true , where tbi is the database name,
or jdbc:mysql://192.168.33.41:3306/?noDatetimeStringSync=true , all databases are retrieved.

To support surrogate pairs in data and metadata, you need to edit the following properties in the MySQL server configuration file:

[client]

default-character-set=utf8mb4

[mysql]

default-character-set=utf8mb4

character-set-server=utf8mb4

Netezza
The Netezza database does not support regular expressions. If you want to use regular expressions with this database, you must:

Install the SQL Extensions Toolkit package on a Netezza system. Use the regex_like function provided in this toolkit in the
SQL template as documented in
http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/topic/com.ibm.nz.sqltk.doc/r_sqlext_regexp_like.html.
Add the indicator definition for Netezza in the Pattern Matching folder in Talend Studio under Libraries > Indicators > System
Indicators.

The query template you need to define for Netezza is as the following: SELECT COUNT(CASE WHEN
REGEXP_LIKE(<%=COLUMN_NAMES%>,<%=PATTERN_EXPR%>) THEN 1 END), COUNT FROM <%=TABLE_NAME%><%=WHERE_CLAUSE%> .

Oracle
To support surrogate pairs, the NLS_CHARACTERSET parameter of the database must be set to UTF8 or AL32UTF8 .

The default NLS_CHARACTERSET parameters are:

NLS_CHARACTERSET=WE8ISO8859P15
NLS_NCHAR_CHARACTERSET=AL16UTF16

Note: To check the database parameters, you can run the following SQL query: SQL> SELECT * FROM NLS_DATABASE_PARAMETERS;

PostgreSQL
When you connect to a PostgreSQL database via a JDBC connection, the INT4 and INT8 data types are replaced by a String data
type. As a consequence, if your analysis uses the T-Swoosh algorithm, the survivorship functions are for strings, not for numbers.

To change the data type:

1. Close the analysis and switch to the Integration perspective.

2. Expand Metadata and right-click the database connection > Retrieve Schema > Next.
3. Select the check box of the table to update.
4. When Creation status is set to Success, click Next.
5. If columns with no database type must be of Integer type, set DB Type to INT.
6. Click Finish and close the dialog box.
7. Switch to the Profiling perspective and open the analysis.
8. In Survivorship Rules for Columns, delete and add back the columns you updated. You can see the survivorship functions for
numbers (Largest and Smallest).

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 35/288
26/07/2022 09:49 PDF Export

SAP Hana
Profiling data from SAP Hana is only possible for Table, View and Calculation View schemas.

The Soundex frequency statistics indicators support the English alphabet only.

Snowflake
Profiling data from Snowflake requires a JDBC connection.

For more information, see https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html.

Teradata
In the Teradata database, the regular expression function is installed by default only starting from version 14. If you want to use
regular expressions with older versions of this database, you must install a User Defined Function in Teradata and add the indicator
definition for Teradata in Talend Studio.

Catalogs and schemas in database systems

The structure of a database defines how objects are organized in the database. Different data storage structures are used to store
objects in databases. For example, the highest-level structure (such as "Catalog" followed by "Schema" and finally by "Table") is not
applicable to all database types.

Database name Version Catalog Schema

AS/400 V5R4

DB2 -

DB2 ZOS -

Informix -

Ingres -

Microsoft SQL Server -

MySQL -

Netezza -

Oracle -

PointBase -

PostgreSQL -

SQLite -

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 36/288
26/07/2022 09:49 PDF Export

Database name Version Catalog Schema

Sybase -

Teradata -

Connecting to a file

Before you begin

Before proceeding to analyze data in a delimited file, you must first set up the connection to this file.

Procedure
1. Expand the Metadata folder.

2. Right-click FileDelimited connections and then select Create File Delimited Connection to open the New Delimited File
wizard.

3. Follow the steps defined in the wizard to create a connection to a delimited file.

You can then create a column analysis and drop the columns to analyze from the delimited file metadata to the DQ
Repository tree view to the open analysis editor.

For more information, see Creating a basic column analysis on a file.

Results

Managing connections to data sources

Several management options are available for each of the connections created in Talend Studio.

Managing database connections

Many management options are available for database connections including editing and duplicating the connection or adding a task
to it.

The sections below explain in detail these management options.

Opening or editing a database connection

You can edit the connection to a specific database and change the connection metadata and the connection information.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 37/288
26/07/2022 09:49 PDF Export

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connection.

2. Either:

Double-click the database connection you want to open, or,

right-click the database connection and select Open in the contextual menu.

The Database Connection wizard is displayed.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 38/288
26/07/2022 09:49 PDF Export

3. Go through the steps in the wizard and modify the database connection settings as required.

4. Click Finish to validate the modifications.

A dialog box opens prompting you to reload the updated database connection.

5. Select the reload option if you want to reload the new database structure for the updated database connection.

Note: If you select the don't reload option, you will still be able to execute the analyses using the connection even after
you update it.

If the database connection is used by profiling analyses in the Studio, another dialog box is displayed to list all the analyses
that use the database connection. It alerts you that if you reload the database new structure, all the analyses using the
connection will become unusable although they will be always listed in the DQ Repository tree view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 39/288
26/07/2022 09:49 PDF Export

6. Click OK to accept reloading the database structure or Cancel to cancel the operation and close the dialog box.
A number of confirmation messages are displayed one after the other.

7. Click OK to close the messages and reload the structure of the new connection.

Filtering a database connection

After setting a specific database connection in the studio, you may not want to view all databases in the DQ Repository tree view of
your Studio.

You can filter your database connections to list the databases that match the filter you set. This option is very helpful when the
number of databases in a specific connection is very big.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connection.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 40/288
26/07/2022 09:49 PDF Export

2. Right-click the database connection you want to filter and select Package Filter to open the corresponding dialog box.

3. In the Package Filter field, enter the complete name of the database you want to view and then click Finish to close the
dialog box.
Only the database that matches the filter you set is listed under the database connection in the DQ Repository tree view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 41/288
26/07/2022 09:49 PDF Export

4. If you want to cancel the filter, do the following:

a. In the Package Filter dialog box, delete the text from the Package Filter field.
b. Click Finish to close the dialog box.
All databases are listed under the selected database connection in the DQ Repository tree view.

Duplicating a database connection

To avoid creating a DB connection from scratch, you can duplicate an existing one in the DB Connections list and work around its
metadata to have a new connection.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections.

2. Right-click the connection you want to duplicate and select Duplicate from the contextual menu.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 42/288
26/07/2022 09:49 PDF Export

Results

The duplicated database connection shows under the connection list in the DQ Repository tree view as a copy of the original
connection. You can now open the duplicated connection and modify its metadata as needed.

Adding a task to a database connection or any of its elements

You can add a task to a database connection to use it as a reminder to modify the connection or to flag a problem that needs to be
solved later, for example. You can also add a task to a catalog, a table or a column in the connection.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure
1. Expand Metadata > DB connections.

2. Right-click the connection to which you want to add a task, and then select Add task... from the contextual menu.
The Properties dialog box opens showing the metadata of the selected connection.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 43/288
26/07/2022 09:49 PDF Export

3. In the Description field, enter a short description for the task you want to attach to the selected connection.

4. On the Priority list, select the priority level and then click OK to close the dialog box.
The created task is added to the Tasks list.

Results

The created task is added to the Tasks list.

What to do next

You can follow the same steps in the above procedure to add a task to a catalog, a table or a column in the connection. For further
information, see Adding a task to a column in a database connection.

For more information on how to access the task list, see Displaying the task list.

Filtering tables/views in a database connection

You can filter the tables/views to list under any database connection.

This option is very helpful when the number of tables in the database to which the studio is connecting is very big. If so, a message is
displayed prompting you to set a table filter on the database connection in order to list only defined tables in the DQ Repository tree
view.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > Metadata.

2. Expand the database connection in which you want to filter tables/views and right-click the desired catalog/schema.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 44/288
26/07/2022 09:49 PDF Export

3. Select Table/View Filter from the list to display the corresponding dialog box.

4. Set a table and a view filter in the corresponding fields and click Finish to close the dialog box.

Results
Only tables/views that match the filter you set are listed in the DQ Repository tree view.

Deleting a database connection

You can move a database connection to the studio recycle bin whether it is used by analyses or not.

Before you begin

A database connection is created in Talend Studio. For further information, see Connecting to a database.

Procedure
1. In the DQ Repository tree view, expand Metadata > DB Connections.
2. Right-click a database connection and select Delete in the contextual menu.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 45/288
26/07/2022 09:49 PDF Export

The database connection is moved to the Recycle Bin.

You can still run the analyses that use the connection in the recycle bin. However, an alert message will be displayed next to
the connection name in the analysis editor.

3. To delete the connection from the Recycle Bin, do the following:

a. Right-click the database connection in the Recycle Bin and choose Delete from the contextual menu.
b. Click OK on the confirm deletion dialog box that opens.
If the connection is not used by any analysis, it is deleted from Talend Studio.

If the connection is used by one or more analyses in Talend Studio, a dialog box is displayed to list such analyses:

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 46/288
26/07/2022 09:49 PDF Export

Either click OK to close the dialog box without deleting the database connection from the recycle bin, or
Select the Force to delete all the dependencies check box and then click OK to delete the database connection from
the Recycle Bin and to delete all the dependent analyses from the Data Profiling node.

You can also delete permanently the database connection by emptying the recycle bin.

4. To empty the Recycle Bin, do the following:

a. Right-click the Recycle Bin and select Empty recycle bin.

If the connection is not used by any analysis in the current Studio, a confirmation dialog box is displayed.

b. Click Yes to empty the recycle bin.

If the connection is used by one or more analyses in the studio, a dialog box is displayed to list these analyses.

c. Click OK to close the dialog box without removing the connection from the recycle bin.

Restoring a database connection

You can restore the deleted database connection from the Talend Studio recycle bin.

Procedure

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 47/288
26/07/2022 09:49 PDF Export

In the Recycle Bin, right-click the connection and select Restore.

Results
The database connection is moved back to the Metadata node.

Managing file connections

Few management options are available for file connections including editing or deleting the connection, adding a task to it, or
importing and exporting the connection.

You can edit the connection to a specific file and change the connection metadata and the connection information.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections.

2. Double-click the file connection you want to open.

The connection wizard is displayed.

3. Go through the steps in the wizard and modify the file connection settings as required.

4. Click Finish to validate the modifications.

Other management procedures for file connection are the same as those for databases.

For further information on how to add a task to a file connection, see Adding a task to a database connection or any of its
elements.

For further information on how to delete or restore a file connection, see Deleting a database connection.

For further information on how to import or export a file connection, see Importing data profiling items.

Profiling database content

Analyzing databases

You can analyze the content of a database to have an overview of the number of tables in the database, rows per table and indexes
and primary keys.

You can also analyze one specific catalog or schema in a database, if a catalog or schema is used in the physical structure of the
database.

Creating a database content analysis

From the Profiling perspective of Talend Studio, you can create an analysis to examine the content of a given database.

Before you begin, you have defined at least one database connection in the Profiling perspective of Talend Studio.

To create a database content analysis, you must first define the relevant analysis and then select the database connection you want
to analyze.

From the Statistical information view, you can:

Click a catalog or a schema to list all tables included in it along with a summary of their content: number of rows, keys and
user-defined indexes.

The selected catalog or schema is highlighted in blue. Catalogs or schemas highlighted in red indicate potential problems in
data.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 48/288
26/07/2022 09:49 PDF Export

Right-click a catalog or a schema and select Overview analysis to analyze the content of the selected item.
Right-click a table or a view and select Table analysis to create a table analysis on the selected item.
Click any column header in the analytical table to sort alphabetically the data listed in catalogs or schemas.

Defining the connection overview analysis

Procedure

1. In the DQ Repository tree view, expand Data Profiling.

2. Right-click the Analyses folder and select New Analysis.

The Create New Analysis wizard opens.

3. In the filter field, start typing connection overview analysis , select Connection Overview Analysis from the list that is
displayed and click Next.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 49/288
26/07/2022 09:49 PDF Export

You can create a database content analysis in a shortcut procedure if you right-click the database under Metadata > DB
connections and select Overview analysis from the contextual menu.

4. In the Name field, enter a name for the current analysis.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and click Next.

Selecting the database connection you want to analyze

Procedure
1. Expand DB Connections and select a database connection to analyze, if more than one exists.

2. Click Next.

3. Set filters on the tables and/or views you want to analyze in their corresponding fields according to your needs using the SQL
language.
By default, the analysis examines all tables and views in the database.

4. Click Finish to close the Create New Analysis wizard.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 50/288
26/07/2022 09:49 PDF Export

A folder for the newly created analysis is listed under the Analyses folder in the DQ Repository tree view, and the connection
editor opens with the defined metadata.

Note: The display of the connection editor depends on the parameters you set in the Preferences window. For more
information, see Setting preferences of analysis editors and analysis results.

5. Click Analysis Parameters and do the following:

a. In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to
the selected database connection.
You can set this number according to the database available resources, that is the number of concurrent connections
each database can support.
b. Check/modify filters on table and/or views, if any.
You can use context values.
c. Select the Reload databases check box if you want to reload all databases in your connection on the server when you
run the overview analysis.
When you try to reload a database, a message will prompt you for confirmation as any change in the database
structure may affect existing analyses.

6. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
The table in this view lists all context environments and their values you define in the Contexts view in the analysis editor. For
further information, see Using context variables in analyses.

7. Press F6 to execute the analysis.

A message opens at the bottom of the editor to confirm that the operation is in progress and analysis results are opened in
the Analysis Results view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 51/288
26/07/2022 09:49 PDF Export

Creating a catalog or schema analysis

You can use the Profiling perspective of Talend Studio to analyze one specific catalog or schema in a database, if this entity is used in
the physical structure of the database.

The result of the analysis gives analytical information about the content of this schema, for example number of rows, number of
tables, number of rows per table and so on.

Before you begin

At least one database connection has been created to connect to a database that uses the "catalog" or "schema" entity. For further
information, see Connecting to a database.

Procedure

1. Under DB connections in the DQ Repository tree view, right-click the catalog or schema for which you want to create content
analysis and, select Overview analysis from the contextual menu.
This example shows how to create a schema analysis.

2. In the wizard that opens, enter a name for the current analysis.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

3. If required, set the analysis metadata (purpose, description and author name) in the corresponding fields and click Next.

4. Set filters on the tables and/or views you want to analyze in their corresponding fields according to your needs using the SQL
language.
By default, the analysis examines all tables and views in the catalog.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 52/288
26/07/2022 09:49 PDF Export

5. Click Finish.
A folder for the newly created analysis is listed under Analysis in the DQ Repository tree view, and the analysis editor opens
with the defined metadata.

6. Press F6 to execute the analysis.

A message opens at the bottom of the editor to confirm that the operation is in progress and analysis results are opened in
the Analysis Results view.

From the Statistical information view, you can:

Click the schema to list all tables included in it along with a summary of their content: number of rows, keys and
user-defined indexes.

The selected schema is highlighted in blue. Schemas highlighted in red indicate potential problems in data.

Right-click a schema and select Overview analysis to analyze the content of the selected item.

Right-click a table or a view and select Table analysis to create a table analysis on the selected item. You can also view
the keys and indexes of a selected table. For further information, see Displaying keys and indexes of database tables.

Click any column header in the analytical table to sort the listed data alphabetically.

Previewing data in the SQL editor

After you create a connection to a database, you can open a view in Talend Studio to see actual data in the database.

Before you begin

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 53/288
26/07/2022 09:49 PDF Export

A database connection is created in the Profiling perspective of Talend Studio.

Procedure
1. In the DQ Repository tree view, expand Metadata > DB Connections.

2. Browse to a table in a given database connection, right-click it and select Preview.

The SQL editor opens in Talend Studio listing the data in the selected table.

3. Optional: If required, do any modifications in the query and save it.

Results

The query is listed under the Libraries > Source Files folder in the DQ Repository tree view.

Displaying keys and indexes of database tables

After analyzing the content of a database, you can display the details of the key and user-defined index of a given table. This
information could be very interesting for the database administrator.

Before you begin

At least one database content analysis has been created and executed in the Profiling perspective of Talend Studio.

Procedure

1. In the Analysis Results view of the analysis editor, click a catalog or a schema under Statistical Information.
All the tables included in the selected catalog or schema are listed along with a summary of their content: number of rows,
keys and user-defined indexes.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 54/288
26/07/2022 09:49 PDF Export

2. In the table list, right-click the table and select View keys.
You cannot display the key details of tables in a Hive connection.

The Database Structure and the Database Detail views display the structure of the analyzed database and information about
the primary key of the selected table.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 55/288
26/07/2022 09:49 PDF Export

3. Optional: If one or both views do not show, select Window > Show View > Database Structure or Window > Show View >
Database Detail.

4. In the table list, right-click the table and select View indexes.
You cannot display the index details of tables in a Hive connection.
The Database Structure and the Database Detail views display the structure of the analyzed database and information about
the user-defined index of the selected table.

5. If required, click any of the tabs in the Database Detail view to display the relevant metadata about the selected table.

Synchronizing metadata connections and database structures

When the data in a source database is changed or updated, it is necessary that the relevant connection structure in Talend Studio
follows that change or update as well. Otherwise, errors may occur when trying to analyze a column that has been modified/deleted
in a database.

You can synchronize the connection structure displayed in the DQ Repository tree view with the database structures to eliminate any
incoherence. You can perform synchronization at the following three different levels:

DB connection: to refresh the catalog and schema lists,

Tables: to refresh the list of tables,
Column: to refresh the list of columns.

Synchronizing and reloading catalog and schema lists

You can compare and match the catalog and schema lists in the DQ Repository tree view with those in the database.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 56/288
26/07/2022 09:49 PDF Export

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections.

2. Right-click the database connection you want to synchronize with the database and select Reload database list.

A message will prompt you for confirmation as any change in the database structure may affect the analyses created on
these catalogus or schemas from Talend Studio.

3. Click OK to close the confirmation message, or Cancel to stop the operation.

Results

The selected database connection is updated with the new catalogs and schemas, if any.

Synchronizing and reloading table lists

You can compare and match the table lists in the DQ Repository tree view with those in the database.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections.

2. Browse through the entities in your database connection to reach the Table folder you want to synchronize with the
database.

3. Right-click the Tables folder and select Reload table list.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 57/288
26/07/2022 09:49 PDF Export

A message will prompt you for confirmation as any change in the database structure may affect the analyses created on
these tables from the Talend Studio.

4. Click OK to close the confirmation message, or Cancel to stop the operation.

The selected table list is updated with the new tables in the database, if any.

Synchronizing and reloading column lists

You can compare and match the column lists in the DQ Repository tree view with those in the database.

Before you begin

A database connection is created in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections.

2. Browse through the entities in your database connection to reach the Columns folder you want to synchronize with the
database.

3. Right-click the Columns folder and select Reload column list.

A message will prompt you for confirmation as any change in the database structure may affect the analyses created on
these columns from the Talend Studio.

4. Click OK to close the confirmation message, or Cancel to stop the operation.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 58/288
26/07/2022 09:49 PDF Export

Results

The selected column list is updated with the new column in the database, if any.

Redundancy analyses
What are redundancy analyses?

Redundancy analyses are column comparison analyses that better explore the relationships between tables through:

Comparing identical columns in different tables,

Matching foreign keys in one table to primary keys in the other table and vice versa.

Redundancy analyses support only database tables.

Comparing identical columns in different tables

From Talend Studio, you can create an analysis that compares two identical sets of columns in two different tables. This redundancy
analysis supports only database tables.

Prerequisite(s): At least one database connection is set in the Profiling perspective of the studio. For further information, see
Connecting to a database.

Through this view, you can also access the actual analyzed data via the Data Explorer.

To access the analyzed data rows, right-click any of the lines in the table and select:

Option To...

View match rows access a list of all rows that could be matched in the two identical column
sets

View not match rows access a list of all rows that could not be matched in the two identical
column sets

View rows access a list of all rows in the two identical column sets

Warning: The data explorer does not support connections which has empty user name, such as Single sign-on of MS SQL Server.
If you analyze data using such connection and you try to view data rows in the Data Explorer perspective, a warning message
prompt you to set your connection credentials to the SQL Server.

The figure below illustrates the data explorer list of all rows that could be matched in the two sets, eight in this example.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 59/288
26/07/2022 09:49 PDF Export

From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ Repository tree
view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed on indicators.

The figure below illustrates the data explorer list of all rows that could not be matched in the two sets, three in this example.

Defining the redundancy analysis

Procedure

1. In the DQ Repository tree view, expand Data Profiling.

2. Right-click the Analyses folder and select New Analysis.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 60/288
26/07/2022 09:49 PDF Export

The Create New Analysis wizard opens.

3. In the filter field, start typing redundancy analysis , select Redundancy Analysis from the list and click Next.

4. In the Name field, enter a name for the current analysis.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Next.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 61/288
26/07/2022 09:49 PDF Export

Selecting the identical columns you want to compare

Procedure

1. Expand DB connections and in the desired database, browse to the columns you want to analyze, select them and then click
Finish to close the wizard.
A file for the newly created analysis is listed under the Analysis folder in the DQ Repository tree view. The analysis editor
opens with the defined analysis metadata.

The display of the analysis editor depends on the parameters you set in the Preferences window. For more information, see
Setting preferences of analysis editors and analysis results.

2. Click Analyzed Column Sets to open the view where you can set the columns or modify your selection.
In this example, you want to compare identical columns in the account and account_back tables.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 62/288
26/07/2022 09:49 PDF Export

3. From the Connection list, select the database connection relevant to the database to which you want to connect.
You can find in this list all the database connections you create and centralize in the Studio repository.

4. Click A column Set to open the Column Selection dialog box.

5. Browse the catalogs/schemas in your database connection to reach the table holding the columns you want to analyze.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.

6. Click the table name to list all its columns in the right-hand panel of the Column Selection dialog box.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 63/288
26/07/2022 09:49 PDF Export

7. In the list to the right, select the check boxes of the column(s) you want to analyze and click OK to proceed to the next step.
You can drag the columns to be analyzed directly from the DQ Repository tree view to the editor.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column will be automatically located under the corresponding connection in the tree view.

8. Click B Column B Set and follow the same steps to select the second set of columns or drag it to the right column panel.

9. Select the Compute only number of A rows not in B check box if you want to match the data from the A set against the data
from the B set and not vice versa.

10. Select the Ignore Null check box if you want to ignore the NULL values when matching.
This check box is available only if you have installed the R2021-05 Studio monthly update or a later one provided by Talend.

Finalizing and executing the analysis

Procedure

1. In the Data Filter view, enter an SQL WHERE clause to filter the data on which to run the analysis, if required.

2. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.

3. If you have defined context variables in the Contexts view in the analysis editor, complete the following steps:
a. Use the Data Filter and Analysis Parameter views to set/select context variables to filter data and to decide the
number of concurrent connections per analysis respectively.
b. In the Context Group Settings view, select from the list the context environment you want to use to run the analysis.
For further information about contexts and variables, see Using context variables in analyses.

4. Save the analysis and press F6 to execute it.

A confirmation message is displayed.

5. Click OK if you want to continue the operation.

Results

The Analysis Results view opens showing the analysis results.

In this example, 72.73% of the data present in the columns in the account table could be matched with the same data in the columns
in the account_back table.

Matching primary and foreign keys

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 64/288
26/07/2022 09:49 PDF Export

You can create an analysis that matches foreign keys in one table to primary keys in the other table and vice versa. This redundancy
analysis supports only database tables.

Prerequisites
A database connection is created in the Profiling perspective of Talend Studio.

Access the actual analyzed data

To match primary and foreign keys in tables, do the following:

Through this view, you can also access the actual analyzed data via the data explorer.

To access the analyzed data rows, right-click any of the lines in the table and select:

Option To...

View match rows access a list of all rows that could be matched in the two identical column
sets

View not match rows access a list of all rows that could not be matched in the two identical
column sets

View rows access a list of all rows in the two identical column sets

The figure below illustrates in the data explorer the list of all analyzed rows in the two columns.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 65/288
26/07/2022 09:49 PDF Export

Defining the analysis to match primary and foreign keys in tables

Procedure
1. In the DQ Repository tree view, expand the Data Profiling folder.

2. Right-click the Analyses folder and select New Analysis.

The Create New Analysis wizard opens.

3. In the filter field, start typing redundancy analysis and then select Redundancy Analysis, click Next.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 66/288
26/07/2022 09:49 PDF Export

4. In the Name field, enter a name for the current analysis.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

5. Set the analysis metadata (purpose, description and author name) in the corresponding fields and then click Finish.
A file for the newly created analysis is displayed under the Analysis folder in the DQ Repository tree view. The analysis editor
opens with the defined analysis metadata.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 67/288
26/07/2022 09:49 PDF Export

Selecting the primary and foreign keys

Procedure
1. Click Analyzed Column Sets to display the corresponding view.
In this example, you want to match the foreign keys in the customer_id column of the sales_fact_1998 table with the primary
keys in the customer_id column of the customer table, and vice versa. This will explore the relationship between the two
tables to show us for example if every customer has an order in the year 1998.

2. From the Connection list, select the database connection relevant to the database to which you want to connect.
You have in this list all the connections you create and centralize in the Talend Studio repository.

3. Click A Column Set to open the Column Selection dialog box.

If you want to check the validity of the foreign keys, select the column holding the foreign keys for the A set and the column
holding the primary keys for the B set.

4. Browse the catalogs/schemas in your database connection to reach the table holding the column you want to match.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 68/288
26/07/2022 09:49 PDF Export

In this example, the column to be analyzed is customer_id that holds the foreign keys.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.

5. Click the table name to display all its columns in the right-hand panel of the Column Selection dialog box.

6. In the list to the right, select the check box of the column holding the foreign keys and then click OK to proceed to the next
step.
You can drag the columns to be analyzed directly from the DQ Repository tree view to the editor.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column will be automatically located under the corresponding connection in the tree view.

7. Click B Column Set and follow the same steps to select the column holding the primary keys or drag it from the DQ
Repository to the right column panel.
If you select the Compute only number of rows not in B check box, you will look for any missing primary keys in the column
in the B set.

8. Click Data Filter in the analysis editor to display the view where you can set a filter on each of the analyzed columns.

9. Press F6 to execute this key-matching analysis.

A confirmation message is displayed.

10. Click OK in the message if you want to continue the operation.

The execution of this type of analysis may takes some time. Wait till the Analysis Results view opens automatically showing
the analysis results.

Results

In this example, every foreign key in the sales_fact_1998 table is identified with a primary key in the customer table. However,
98.22% of the primary keys in the customer table could not be identified with foreign keys in the sales_fact_1998 table. These
primary keys are for the customers who did not order anything in 1998.

Table analyses
Steps to analyze database tables

About this task

You can examine the data available in single tables of a database and collect information and statistics about this data.

The sequence of profiling data in one or multiple tables may involve the following steps:

Procedure

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 69/288
26/07/2022 09:49 PDF Export

1. Defining one or more tables on which to carry out data profiling processes that will define the content, structure and quality
of the data included in the table(s).

2. Creating SQL business rules based on WHERE clauses and add them as indicators to table analyses.

3. Creating column functional dependencies analyses to detect anomalies in the column dependencies of the defined table(s)
through defining columns as either "determinant" or "dependent".

What to do next

Check Analyzing tables in databases for information about the different options to analyze a table.

Analyzing tables in databases

Table analyses can range from simple table analyses to table analyses that uses SQL business rules or table analyses that detect
anomalies in the table columns.

Using Talend Studio, you can better explore the quality of data in a database table through either:

Creating a simple table analysis through analyzing all columns in the table using patterns.
Adding data quality rules as indicators to table analysis.
Detecting anomalies in column dependencies.
Comparing a set of columns and creating groups of similar records using blocking and matching keys and/or survivorship
rules.

Creating a simple table analysis (Column Set Analysis)

You can analyze the content of a set of columns. This set can represent only some of the columns in the defined table or the table as
a whole.

The analysis of a set of columns focuses on a column set (full records) and not on separate columns as it is the case with the column
analysis. The statistics presented in the analysis results (row count, distinct count, unique count and duplicate count) are measured
against the values across all the data set and thus do not analyze the values separately within each column.

With the Java engine, you may also apply patterns on each column and the result of the analysis will give the number of records
matching all the selected patterns together. For further information, see Adding patterns to the analyzed columns.

Note: When you use the Java engine to run a column set analysis on big sets or on data with many problems, it is advisable to
define a maximum memory size threshold to execute the analysis as you may end up with a Java heap error. For more
information, see Defining the maximum memory size threshold.

Creating an analysis of a set of columns using patterns

This type of analysis provides simple statistics on the full records of the analyzed column set and not on the values within each
column separately. For more information about simple statistic indicators, see Simple statistics.

With this analysis, you can use patterns to validate the full records against all patterns and have a single-bar result chart that shows
the number of the rows that match "all" the patterns.

Defining the set of columns to be analyzed

Before you begin, you have defined at least one database connection in the Profiling perspective of Talend Studio.

Defining the analysis

Procedure

Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the next
step.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 70/288
26/07/2022 09:49 PDF Export

Selecting the set of columns you want to analyze

Procedure

1. Expand DB connections.

2. In the desired database, browse to the columns you want to analyze, select them and then click Finish to close this New
analysis wizard.
In this example, you want to analyze a set of six columns in the customer table: account number ( account_num ), education
( education ), email ( email ), first name ( fname ), second name ( Iname ) and gender ( gender ). The statistics presented in
the analysis results are the row count, distinct count, unique count and duplicate count which all apply on records (values of
a set of columns).
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 71/288
26/07/2022 09:49 PDF Export

A sample data is displayed in the Data Preview section and the selected columns are displayed in the Analyzed Column
section of the analysis editor.

In this example, you want to analyze a set of six columns in the customer table: account number (account_num), education
(education), email (email), first name (fname), second name (Iname) and gender (gender). The statistics presented in the
analysis results are the row count, distinct count, unique count and duplicate count which all apply on records (values of a
set of columns).

3. In the Data preview section, select:

Option To...

New Connection open a wizard and create or change the connection to the
data source from within the editor.

The Connection field on top of this section lists all the

connections created in the Studio.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 72/288
26/07/2022 09:49 PDF Export

Option To...

Select Columns open the Column Selection dialog box where you can
select the columns to analyze or change the selection of
the columns listed in the table.

From the open dialog box, you can filter the table or
column lists by using the Table filter or Column filter fields
respectively.

n first rows or n random rows list in the table N first data records from the selected
columns or list N random records from the selected
columns.

Refresh Data display the data in the selected columns according to the
criteria you set.

Run with sample data run the analysis only on the sample data set in the Limit
field.

4. In the Limit field, set the number for the data records you want to display in the table and use as sample data.

Adding patterns to the analyzed columns

You can add patterns to one or more of the analyzed columns to validate the full record (all columns) against all the patterns, and
not to validate each column against a specific pattern as it is the case with the column analysis. The results chart is a single bar chart
for the totality of the used patterns. This chart shows the number of the rows that match "all" the patterns.

Warning: Before being able to use a specific pattern with a set of columns analysis, you must manually set in the patterns
settings the pattern definition for Java, if it does not already exist. Otherwise, a warning message will display prompting you to
set the definition of the Java regular expression.

Before you begin

An analysis of a set of columns is open in the analysis editor in the Profiling perspective of Talend Studio.

Procedure

Select the check box(es) of the expression(s) you want to add to the selected column, then Click OK.
The added regular expression(s) display(s) under the analyzed column(s) in the Analyzed Columns view and the All Match
indicator is displayed in the Indicators list in the Indicators view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 73/288
26/07/2022 09:49 PDF Export

Finalizing and execute the analysis of a set of columns

What is left before executing this set of columns analysis is to define the indicator settings, data filter and analysis parameters.

Before you begin

A column set analysis has already been defined in the Profiling perspective of the Talend Studio.

Procedure

1. In the Analysis Parameters view:

In the Number of connections per analysis field, set the number of concurrent connections allowed per analysis to
the selected database connection.

You can set this number according to the database available resources, that is the number of concurrent connections
each database can support.

From the Execution engine list, select the engine, Java or SQL, you want to use to execute the analysis.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 74/288
26/07/2022 09:49 PDF Export

If you select the Java engine, the Store data check box is selected by default and cannot be unselected. Once
the analysis is executed, the profiling results are always available locally to drill down through the Analysis
Results > Data view.

Executing the analysis with the Java engine uses disk space as all data is retrieved and stored locally. If you
want to free up some space, you may delete the data stored in the main Talend Studio directory, at Talend-
Studio/workspace/project_name/Work_MapDB .

If you select the SQL engine, you can use the Store data check box to decide whether to store locally the
analyzed data and access it in the Analysis Results > Data view.

Note: If the data you are analyzing is very big, it is advisable to leave the Store data check box unselected
in order not to store the results at the end of the analysis computation.

2. Save the analysis and press F6 to execute it.

The analysis editor switches to the Analysis Results view where you can read the analysis results in tables and graphics. The
graphical result provides the simple statistics on the full records of the analyzed column set and not on the values within
each column separately.

When you use patterns to match the content of the set of columns, another graphic is displayed to illustrate the match and
non-match results against the totality of the used patterns.

3. In the Simple Statistics table, right-click an indicator result and select View Rows or View Values.

If you run the analysis with the Java engine, a list of the analyzed data is opened in the Profiling perspective.
If you run the analysis with the SQL engine, a list of the analyzed data is opened in the Data Explorer perspective.

4. In the Data view, click Filter Data to filter the valid/invalid data according to the used patterns.
You can filter data only when you run the analysis with the Java engine.
For further information, see Filtering data against patterns.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 75/288
26/07/2022 09:49 PDF Export

Filtering data against patterns

After analyzing a set of columns against a group of patterns and having the results of the rows that match or do not match "all" the
patterns, you can filter the valid/invalid data according to the used patterns.

Before you begin

An analysis of a set of columns is open in the analysis editor in the Profiling perspective of Talend Studio.

You have used the Java engine to execute the analysis.

Procedure

1. In the analysis editor, click the Analysis Results tab at the bottom of the editor to open the detailed result view.

2. Click Data to open the corresponding view.

A table lists the actual analyzed data in the analyzed columns.

3. Click Filter Data on top of the table.

A dialog box is displayed listing all the patterns used in the column set analysis.

4. Select the check box(es) of the pattern(s) according to which you want to filter data.

5. Select a display option as the following:

Select To..

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 76/288
26/07/2022 09:49 PDF Export

Select To..

All data show all analyzed data.

matches show only the data that matches the selected pattern.

non-matches show the data that does not match the selected
pattern(s).

6. Click Finish to close the dialog box.

Results

In this example, data is filtered against the Email Address pattern, and only the data that does not match is displayed.

All email addresses that do not match the selected pattern appear in red. Any data row that has a missing value appear with a red
background.

The Previous and Next buttons under the table helps you to navigate back and forth through pages.

Numbered buttons are displayed under the table to access pages directly:

when you open the Data view for the first time after running the analysis,
if you did not select a pattern in the Filter Data dialog box, or
if you selected All data as the display option in the Filter Data dialog box.

Creating a column analysis from a simple table analysis

You can create a column analysis on one or more columns defined in a simple table analysis (column set analysis).

Before you begin

A simple table analysis is defined in the analysis editor in the Profiling perspective of Talend Studio.

Procedure

1. In the Analyzed Columns view, right-click the column(s) you want to create a column analysis on.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 77/288
26/07/2022 09:49 PDF Export

2. Follow the steps outlined in Creating a basic analysis on a database column to continue creating the column analysis.

Creating a table analysis with SQL business rules

You can set up SQL business rules based on WHERE clauses and add them as indicators to table analyses. You can as well define
expected thresholds on the SQL business rule indicator's value. The range defined is used for measuring the quality of the data in the
selected table.

It is also possible to create an analysis with SQL business rules on views in a database. The procedure is exactly the same as that for
tables.

For more information, see Creating a table analysis with an SQL business rule with a join condition.

Note: When you use the Java engine to run a column set analysis on big sets or on data with many problems, it is advisable to
define a maximum memory size threshold in the Talend Studio Preferences to execute the analysis as you may end up with a
Java heap error.

Creating an SQL business rule

SQL business rules can be simple rules with WHERE clauses. They can also have join conditions in them to combine common values
between columns in database tables and give a result data set.

Creating the business rule

Procedure

1. In the DQ Repository tree view, expand Libraries > Rules.

2. Right-click SQL.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 78/288
26/07/2022 09:49 PDF Export

3. From the contextual menu, select New Business Rule to open the New Business Rule wizard.

Consider as an example that you want to create a business rule to match the age of all customers listed in the age column
of a defined table. You want to filter all the age records to identify those that fulfill the specified criterion.

4. In the Name field, enter a name for this new SQL business rule.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

5. Set other metadata (purpose, description and author name) in the corresponding fields and then click Next.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 79/288
26/07/2022 09:49 PDF Export

6. In the Where clause field, enter the WHERE clause to be used in the analysis.
In this example, the WHERE clause is used to match the records where customer age is greater than 18.

7. Click Finish to close the New Business Rule wizard.

A sub-folder for this new SQL business rule is displayed under the Rules folder in the DQ Repository tree view. The SQL
business rule editor opens with the defined metadata.

Note: In the SQL business rule editor, you can modify the WHERE clause or add a new one directly in the Data quality rule
view.

8. If required, set a value in the Criticality Level field.

This will act as an indicator to measure the importance of the SQL business rule.

Creating a join condition

About this task

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 80/288
26/07/2022 09:49 PDF Export

This step is not obligatory. You can decide to create a business rule without a join condition and use it with only the WHERE clause in
the table analysis.

For an example of a table analysis with a simple business rule, see Creating a table analysis with a simple SQL business rule. For an
example of a table analysis with a business rule that has a join condition, see Creating a table analysis with an SQL business rule
with a join condition.

Procedure

1. In the SQL business rule editor, click Join Condition to open the corresponding view.

2. Click the [+] button to add a row in the Join Condition table.

3. Expand the Metadata folder in the DQ Repository tree view, and then browse to the columns in the tables on which you want
to create the join condition.
This join condition will define the relationship between a table A and a table B using a comparison operator on a specific
column in both tables. In this example, the join condition will compare the "name" value in the Person and Person_Ref
tables that have a common column called name .

Note: You must be careful when defining the join clause. In order to get an easy to understand result, it is advisable to
make sure that the joined tables do not have duplicate values. For further information, see Creating a table analysis with
an SQL business rule with a join condition.

4. Drop the columns from the DQ Repository tree view to the Join Condition table.
A dialog box is displayed prompting you to select where to place the column: in TableA or in TableB.

5. Select a comparison condition operator between the two columns in the tables and save your modifications.
In the analysis editor, you can now drop this newly created SQL business rule onto a table that has an "age" column. When
you run the analysis, the join to the second column is done automatically.

Warning: The table to which to add the business rule must contain at least one of the columns used in the SQL business
rule.

Editing an SQL business rule

About this task

To edit an SQL business rule, do the following:

Procedure

1. In the DQ Repository tree view, expand Libraries > Rules > SQL.

2. Right-click the SQL business rule you want to open and select Open from the contextual menu.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 81/288
26/07/2022 09:49 PDF Export

The SQL business rule editor opens displaying the rule metadata.

3. Modify the business rule metadata or the WHERE clause as required.

4. Click the save icon on top of the editor to save your modifications.
The SQL business rule is modified as defined.

Creating a table analysis with a simple SQL business rule

You can create analyses on either tables or views in a database using SQL business rules. The procedure for creating such analysis is
the same for a table or a view.

Prerequisite(s):

At least one SQL business rule has been created in the Profiling perspective of Talend Studio.

At least one database connection is set in the Profiling perspective of Talend Studio.

In this example, you want to add the SQL business rule created in Creating an SQL business rule to a top_custom table that contains
an age column. This SQL business rule will match the customer ages to define those who are older than 18.

Defining the table analysis

Procedure

1. In the DQ Repository tree view, expand Data Profiling.

2. In the filter field, start typing business rule analysis , select Business Rule Analysis and click Next.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 82/288
26/07/2022 09:49 PDF Export

3. Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the
next step.

Selecting the table you want to analyze

Procedure

1. Expand DB Connections, browse to the table to be analyzed and select it.

2. Click Finish to close the Create New Analysis wizard.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 83/288
26/07/2022 09:49 PDF Export

Note: You can directly select the data quality rule you want to add to the current analysis by clicking the Next button in
the New Analysis wizard or you can do that at later stage in the Analyzed Tables view as shown in the following steps.

The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.

3. If required:

Click Select Tables to open the Table Selection dialog box and select new table(s) to analyze.

You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively.
The lists will show only the tables/columns that correspond to the text you type in.

Select another connection from the Connection list to connect to a different database. This list has all the
connections created in Talend Studio. If the tables listed in the Analyzed Tables view do not exist in the new database
connection you want to set, you receive a warning message that enables you to continue or cancel the operation.

4. Right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column is automatically located under the corresponding connection in the tree view.

Selecting the business rule

Procedure

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 84/288
26/07/2022 09:49 PDF Export

1. Click the icon next to the table name where you want to add the SQL business rule.
The Business Rule Selector dialog box is displayed.

2. Expand the Rules folder and select the check box(es) of the predefined SQL business rule(s) you want to use on the
corresponding table(s).

3. Click OK.
The selected business rule is listed below the table name in the Analyzed Tables view.

You can also drag the business rule directly from the DQ Repository tree view to the table in the analysis editor.

4. If required, right-click the business rule and select View executed query.
The SQL editor opens in the Studio to display the query.

5. Save the analysis and press F6 to execute it.

An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 85/288
26/07/2022 09:49 PDF Export

All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar
charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and
non-match indicator that indicates in red the age records from the "analyzed result set" that do not match the criteria (age
below 18).

6. Right-click the business rule results in the second table, or right-click the result bar in the chart itself and select:
You can carry out a table analysis in a direct and more simplified way. For further information, see Creating a table analysis
with an SQL business rule in a shortcut procedure.

Creating a table analysis with an SQL business rule with a join condition

In some cases, you may need to analyze database tables or views using an SQL business rule that has a join clause that combines
records from two tables in a database. This join clause will compare common values between two columns and give a result data
set. Then the data in this set will be analyzed against the business rule.

Depending on the analyzed data and the join clause itself, several different results of the join are possible, for example #match + #no
match > #row count, #match + #no match < #row count or #match + #no match = #row count.

The example below explains in detail the case where the data set in the join result is bigger than the row count (#match + #no match
> #row count) which indicates duplicates in the processed data.

Before you begin:

At least one SQL business rule has been created in the Profiling perspective of Talend Studio.
At least one database connection is set in the Profiling perspective of Talend Studio.

In this example, you want to add the SQL business rule created in Creating an SQL business rule to a Person table that contains the
age and name columns. This SQL business rule will match the customer ages to define those who are older than 18. The business
rule also has a join condition that compares the "name" value between the Person table and another table called Person_Ref
through analyzing a common column called name .

Below is a capture of both tables:

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 86/288
26/07/2022 09:49 PDF Export

Below is a capture of the result of the join condition between these two tables:

The result set may give duplicate rows as it is the case here. Thus the results of the analysis may become a bit harder to understand.
The analysis here will not analyze the rows of the table that match the business rule but it will run on the result set given by the
business rule.

1. Define the table analysis and select the table you want to analyze.

The selected table is listed in the Analyzed Tables view.

2. Add the business rule with the join condition to the selected table through clicking the icon next to the table name.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 87/288
26/07/2022 09:49 PDF Export

This business rule has a join condition that compares the "name" value between two different tables through analyzing a
common column.

3. Save the analysis and press F6 to execute it.

An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.

All age records in the selected table are evaluated against the defined SQL business rule. The analysis results has two bar
charts: the first is a row count indicator that shows the number of rows in the analyzed table, and the second is a match and
non-match indicator that indicates the age records from the "analyzed result set" that do not match the criteria (age below
18).

Note: If a join condition is used in the SQL business rule, the number of the rows of the join (#match + # no match) can be
different from the number of the analyzed rows (row count).

4. Right-click the Row Count row in the first table and select View rows.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 88/288
26/07/2022 09:49 PDF Export

The SQL editor opens in the Studio to display a list of the analyzed rows.

5. Right-click the business rule results in the second table, or right-click the result bar in the chart itself and select:

Below is the list of the invalid rows in the analyzed table.

6. In the SQL editor, click the save icon on the toolbar to save the executed query on the SQL business rule and list it under the
Libraries > Source Files folder in the DQ Repository tree view.

To better understand the Business Rule Statistics bar chart in the analysis results, do the following:

1. In the analysis editor, right-click the business rule and select View executed query.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 89/288
26/07/2022 09:49 PDF Export

The SQL editor opens in the Studio.

2. Modify the query in the top part of the editor to read as the following: SELECT * FROM `my_person_joins`.`person` Person
JOIN `my_person_joins`.`Person_ref` Person_ref ON (Person.`name`=Person_ref.`Name`) .

This will list the result data set of the join condition in the editor.

3. In the top left corner of the editor, click the icon to execute the query.

The query result, that is the analyzed result set, is listed in the bottom part of the editor.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 90/288
26/07/2022 09:49 PDF Export

4. Go back to the analysis editor and click the Analysis Results tab at the bottom of the editor to open a detail view of the
analysis results.

The analyzed result set may contain more or fewer rows than the analyzed table. In this example, the number of match and
non-match records (5 + 2 = 7) exceeds the number of analyzed records (6) because the join of the two tables generates more
rows than expected.

Here 5 rows (71.43%) match the business rule and 2 rows do not match. Because the join generates duplicate rows, this
result does not mean that 5 rows of the analyzed table match the business rule. It only means that 5 rows among the 7 rows
of the result set match the business rule. Actually, some rows of the analyzed tables may not be even analyzed against the
business rule. This happens when the join excludes these rows. For this reason, it is advised to check for duplicates on the
columns used in the join of the business rule in order to make sure that the join does not remove or add rows in the analyzed
result set. Otherwise the interpretation of the result is more complex.

In the Analysis Results view, if the number of match and non-match records exceeds the number of analyzed records, you can
generate a ready-to-use analysis that will analyze the duplicates in the selected table.

Generating an analysis on the join results to analyze duplicates

In some cases, when you analyze database tables that have some duplicate records and a join clause, using an SQL business rule,
the join results show that there are more rows in the joint than in the analyzed table.

You can generate a ready-to-use analysis to analyze these duplicate records. The results of this analysis help you to better
understand why there are more records in the join results than in the table.

Before you begin

A table analysis with an SQL business rule, that has a join condition, is defined and executed in the Profiling perspective of Talend
Studio. The join results must show that there are duplicates in the table.

For more information, see Creating a table analysis with an SQL business rule with a join condition.

Procedure

1. After creating and executing an analysis on a table that has duplicate records as outlined in Creating a table analysis with an
SQL business rule with a join condition, click the Analysis Results tab at the bottom of the analysis editor.

2. Right-click the join results in the second table and select Analyze duplicates.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 91/288
26/07/2022 09:49 PDF Export

The Column Selection dialog box opens with the analyzed tables selected by default.

3. Modify the selection in the dialog box if needed and then click OK.

Two column analyses are generated and listed under the Analyses folder in the DQ Repository tree view and are open in the
analysis editor.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 92/288
26/07/2022 09:49 PDF Export

4. Save the analysis and press F6 to execute it.

The analysis results show two bars, one representing the row count of the data records in the analyzed column and the other
representing the duplicate count.

5. Click Analysis Results at the bottom of the analysis editor to access the detail result view.

6. Right-click the row count or duplicate count results in the table, or right-click the result bar in the chart itself and select:

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 93/288
26/07/2022 09:49 PDF Export

Option To...

open a view on a list of all data rows or duplicate rows in the analyzed
View rows
column.

View values open a view on a list of the duplicate data values of the analyzed column.

Creating a table analysis with an SQL business rule in a shortcut procedure

You can use a simplified way to create a table analysis with a predefined business rule. All what you need to do is to start from the
table name under the relevant DB Connection folder.

Before you begin

At least one SQL business rule is created in the Profiling perspective of Talend Studio.
At least one database connection is set in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Metadata > DB Connections, and then browse to the table you want to analyze.

2. Right-click the table name and select Table analysis from the list.
The New Table Analysis wizard is displayed.

3. Enter the metadata for the new analysis in the corresponding fields and then click Next to proceed to the next step.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 94/288
26/07/2022 09:49 PDF Export

4. Expand Rules > SQL and then select the check box(es) of the predefined SQL business rule(s) you want to use on the
corresponding table(s).

5. Click OK to proceed to the next step.

The table name along with the selected business rule are listed in the Analyzed Tables view.

6. Save the analysis and press F6 to execute it.

An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.

Detecting anomalies in columns (Functional Dependency Analysis)

This type of analysis helps you to detect anomalies in column dependencies through defining columns as either "determinant" or
"dependent" and then analyzing values in dependant columns against those in determinant columns. This analysis supports only
database tables.

This type of analysis detects to what extent a value in a determinant column functionally determines another value in a dependant
column.

This can help you identify problems in your data, such as values that are not valid. For example, if you analyze the dependency
between a column that contains United States Zip Codes and a column that contains states in the United States, the same Zip Code
should always have the same state. Running the functional dependency analysis on these two columns will show if there are any
violations of this dependency.

Defining the analysis to detect anomalies in columns

Before you begin

At least one database connection is set in the Profiling perspective of Talend Studio.

Procedure

1. In the DQ Repository tree view, expand Data Profiling.

2. In the filter field, start typing functional dependency analysis , select Functional Dependency Analysis from the list and
click Next.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 95/288
26/07/2022 09:49 PDF Export

3. Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the
next step.

Selecting the columns as either "determinant" or "dependent"

Procedure

1. Expand DB connections, browse to the database you want to analyze, select it and then click Finish to close the New Analysis
wizard.
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 96/288
26/07/2022 09:49 PDF Export

The Data Preview section shows a sample data of all the table columns.

2. In the Left Column panel, click A Columns Set to open the Column Selection dialog box.
Here you can select the first set of columns against which you want to analyze the values in the dependant columns. You can
also drag the columns directly from the DQ Repository tree view to the left column panel.
In this example, you want to evaluate the records present in the city column and those present in the state_province
column against each other to see if state names match to the listed city names and vice versa.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 97/288
26/07/2022 09:49 PDF Export

3. In the Column Selection dialog box, expand DB Connections and browse to the column(s) you want to define as determinant
columns.
You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields respectively. The lists
will show only the tables/columns that correspond to the text you type in.

4. Select the check box(es) next to the column(s) you want to analyze and click OK to proceed to the next step.
The selected column(s) are displayed in the Left Columns panel of the Analyzed Columns Set view. In this example, we select
the city column as the determinant column.

5. Do the same to select the dependant column(s) or drag it/them from the DQ Repository tree view to the Right Columns
panel. In this example, we select the state_province column as the dependent column. This relation will show if the state
names match to the listed city names.
If you right-click any of the listed columns in the Analyzed Columns view and select Show in DQ Repository view, the selected
column is automatically located under the corresponding connection in the tree view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 98/288
26/07/2022 09:49 PDF Export

6. Click the Reverse columns tab to automatically reverse the defined columns and thus evaluate the reverse relation, what city
names match to the listed state names.
You can select to connect to a different database by selecting another connection from the Connection list in the Data
Preview section. This list shows all the connections created in the Studio. If the columns listed in the Analyzed Columns Set
view do not exist in the new database connection you want to set, you will receive a warning message that enables you to
continue or cancel the operation.

Finalizing and executing the functional dependency analysis

Procedure

1. In the Analysis Parameter view and in the Number of connections per analysis field, set the number of concurrent
connections allowed per analysis to the selected database, if required.
You can set this number according to the database available resources, that is the number of concurrent connections each
database can support.

2. Save the analysis and press F6 to execute it.

An information pop-up opens to confirm that the operation is in progress and the analysis editor switches to the Analysis
Results view.

This functional dependency analysis evaluated the records present in the city column and those present in the
state_province column against each other to see if the city names match to the listed state names and vice versa. The
returned results, in the %Match column, indicate the functional dependency strength for each determinant column. The
records that do not match are indicated in red.
The #Match column in the result table lists the numbers of the distinct determinant values in each of the analyzed columns.
The #row column in the analysis results lists the actual relations between the determinant attribute and the dependant
attribute. In this example, #Match in the first row of the result table represents the number of distinct cities, and #row
represents the number of distinct pairs (city, state_province). Since these two numbers are not equal, then the functional
dependency relationship here is only partial and the ratio of the numbers (%Match) measures the actual dependency
strength. When these numbers are equal, you have a "strict" functional dependency relationship, that is to say each city
appears only once with each state.

Note: The presence of null values in either of the two analyzed columns will lessen the "dependency strength". The
system does not ignore null values, but rather calculates them as values that violates the functional dependency.

3. In the Analysis Results view, right-click any of the dependency lines and select:

Option To...

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 99/288
26/07/2022 09:49 PDF Export

View valid/invalid rows access a list in the SQL editor of all valid/invalid rows measured according to the
functional dependencies analysis

View valid/invalid values access a list in the SQL editor of all valid/invalid values measured according to the
functional dependencies analysis

View detailed valid/detailed access a detailed list in the SQL editor of all valid/invalid values measured according
invalid values to the functional dependencies analysis

From the SQL editor, you can save the executed query and list it under the Libraries > Source Files folders in the DQ
Repository tree view if you click the save icon on the editor toolbar. For more information, see Saving the queries executed
on indicators.

Analyzing tables in delimited files

From the Profiling perspective, you can better explore the quality of data in tables in delimited files through either:

Creating a simple table analysis through analyzing all columns in the table using patterns.
Comparing a set of columns and creating groups of similar records using blocking and matching keys and/or survivorship
rules.

Creating a column set analysis on a delimited file using patterns

This type of analysis provide simple statistics on the number of records falling in certain categories, including the number of rows,
the number of null values, the number of distinct and unique values, the number of duplicates, or the number of blank fields. For
more information about these indicators, see Simple statistics.

It is also possible to add patterns to this type of analysis and have a single-bar result chart that shows the number of the rows that
match "all" the patterns.

Defining the set of columns to be analyzed in a delimited file

You can analyze the content of a set of columns in a delimited file. This set can represent only some of the columns in the defined
table or the table as a whole.

You can then execute the created analysis using the Java engine.

Prerequisite(s): At least one connection to a delimited file is set in the Profiling perspective of the studio. For further information, see
Connecting to a database.

Warning: When carrying out this type of analysis, the set of columns to be analyzed must not include a primary key column.

Defining the column set analysis

Procedure

Set column analysis metadata (Purpose, Description and Author) in the corresponding fields and click Next to proceed to the next
step.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 100/288
26/07/2022 09:49 PDF Export

Selecting the set of columns you want to analyze in the delimited file

Procedure

1. Expand the FileDelimited connection and browse to the set of columns you want to analyze.

2. Select the columns to be analyzed, and then click Finish to close this New analysis wizard.
The analysis editor opens with the defined analysis metadata, and a folder for the newly created analysis is displayed under
Analyses in the DQ Repository tree view.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 101/288
26/07/2022 09:49 PDF Export

A sample data is displayed in the Data Preview section and the selected columns are displayed in the Analyzed Column
section of the analysis editor.

3. If required, select another connection from the Connection box in the Analyzed Columns view. This box lists all the
connections created in the Studio with the corresponding database names.
By default, the delimited file connection you have selected in the previous step is displayed in the Connection box.

4. If required, click the Select columns to analyze link to open a dialog box where you can modify your column selection.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 102/288
26/07/2022 09:49 PDF Export

Note: You can filter the table or column lists by typing the desired text in the Table filter or Column filter fields
respectively. The lists will show only the tables/columns that correspond to the text you type in.

5. In the column list, select the check boxes of the column(s) you want to analyze and click OK to proceed to the next step.
In this example, you want to analyze a set of six columns in the delimited file: account number (account_num), education
(education), email (email), first name (fname), last name (lname) and gender (gender). You want to identify the number of
rows, the number of distinct and unique values and the number of duplicates.

Adding patterns to the analyzed columns in the delimited file

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 103/288
26/07/2022 09:49 PDF Export

Before you begin

An analysis of a set of columns is open in the analysis editor in the Profiling perspective of Talend Studio.

Procedure

Finalizing and executing the column set analysis on a delimited file

What is left before executing this set of columns analysis is to define the indicator settings, data filter and analysis parameters.

Before you begin

A column set analysis is defined in the Profiling perspective of Talend Studio.

Procedure

1. In the Analysis Parameters view, select the Allow drill down check box to store locally the data that will be analyzed by the
current analysis.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 104/288
26/07/2022 09:49 PDF Export

2. In the Max number of rows kept per indicator field enter the number of the data rows you want to make accessible.

Note: The Allow drill down check box is selected by default, and the maximum analyzed data rows to be shown per
indicator is set to 50.

3. Save the analysis and press F6 to execute it.

Results

The editor switches to the Analysis Results view and displays the graphical result corresponding to the Simple Statistics indicators
used to analyze the defined set of columns.

When you use patterns to match the content of the set of columns, another graphic is displayed to illustrate the match and non-
match results against the totality of the used patterns.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 105/288
26/07/2022 09:49 PDF Export

Filtering analysis data against patterns

The procedure to filter the data of the analysis of a delimited file is the same as that for the database analysis. For further
information, see Filtering data against patterns.

Creating a column analysis from the analysis of a set of columns

You can create a column analysis on one or more columns defined in the set of columns analysis.

Before you begin

A simple table analysis is defined in the analysis editor in the Profiling perspective of Talend Studio.

Procedure

1. In the Analyzed Columns view, right-click the column(s) you want to create a column analysis on.

2. Follow the steps outlined in Creating a basic column analysis on a file to continue creating the column analysis on a
delimited file.

Analyzing duplicates

You can use the match analysis in the Profiling perspective of Talend Studio to compare columns in databases or delimited files and
create groups of similar records using the VSR or the T-Swoosh algorithm.

This analysis provides you with a simple way to create match rules, test them on a set of columns and see the results directly in the
editor..

You can also use the Profiling perspective to define match rules in a match rule editor and save them in the Talend Studio repository.

Creating a match analysis

The match analysis enables you to compare a set of columns in databases or in delimited files and create groups of similar records
using blocking and matching keys and/or survivorship rules.

About this task

This analysis enables you to create match rules and test them on data to assess the number of duplicates . Currently, you can test
match rules only on columns in the same table.

Prerequisite(s): At least one database or file connection is defined under the Metadata node.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 106/288
26/07/2022 09:49 PDF Export

The sequence of setting up a match analysis involves the following steps:

Procedure
1. Creating the connection to a data source from inside the editor if no connection has been defined under the Metadata folder
in the Studio tree view.
For further information, see Configuring the match analysis.

2. Defining the table or the group of columns you want to search for similar records using match processes.
For further information, see Defining a match analysis from the Analysis folder or Defining a match analysis from the
Metadata folder.

3. Defining blocking keys to reduce the number of pairs that need to be compared.
For further information, see Defining a match rule.

4. Defining match keys, the match methods according to which similar records are grouped together. For further information,
see Defining a match rule.

5. Exporting the match rules from the match analysis editor and centralize them in the studio repository.
For further information, see Importing or exporting match rules.

Defining a match analysis from the Analysis folder

Procedure

1. In the DQ Repository tree view, expand Data Profiling.

2. Right-click the Analysis folder and select New Analysis.

The Create New Analysis wizard opens.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 107/288
26/07/2022 09:49 PDF Export

3. Start typing match in the filter field, select Match Analysis and then click Next to open a wizard.

4. Set the analysis name and metadata and then click Next.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 108/288
26/07/2022 09:49 PDF Export

5. Expand DB connections or FileDelimited connections depending on if the columns you want to match are in a database or a
delimited file.

6. Browse to the columns you want to match, select them and then click Finish.
The columns you select should be in the same table. Currently, the match analysis does not work on columns in different
tables.
The match analysis editor opens listing the selected columns.

You can also define a match analysis starting from the table or columns you want to match. For further information, see
Defining a match analysis from the Metadata folder.

7. Modify the parameters in the match analysis editor according to your needs.
For further information, see Configuring the match analysis.

Defining a match analysis from the Metadata folder

Procedure

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 109/288
26/07/2022 09:49 PDF Export

1. In the DQ Repository tree view, expand Metadata.

2. Do either of the following operations:

Browse the database or the file connection to the table you want to match, right-click it and select Match Analysis, or

Browse the database or the file connection to the columns you want to match, right-click them and select Analyze
matches.

The columns you select should be in the same table. Currently, the match analysis does not work on columns in different
tables.
The match analysis editor opens listing all columns in the table or the group of selected columns.

3. Set the analysis name and metadata and click Next to open the analysis editor.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

4. Modify the parameters in the match analysis editor according to your needs.
For further information, see Configuring the match analysis.

Configuring the match analysis

Procedure

1. In the Limit field in the match analysis editor, set the number for the data records you want to use as a data sample.
Data is displayed in the Data Preview table.

2. If required, click any column name in the table to sort the sample data in an ascending or descending order.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 110/288
26/07/2022 09:49 PDF Export

3. In the match analysis editor, select:

Option To...

locate the selected table under the Metadata node in the tree view.

New Connection create a connection to a database or to a file from inside the match
analysis editor where you can expand this new connection and select the
columns on which to do the match.

For further information about how to create a connection to data sources,

see Connecting to a database and Connecting to a file.

Select Data update the selection of the columns listed in the table.

If you change the data set for an analysis, the charts that display the match
results of the sample data will be cleared automatically. You must click
Chart to compute the match results for the new data set you have defined.

refresh the view of the columns listed in the table.

Refresh Data

lists in the table N first data records from the selected columns or list N
n first rows
random records from the selected columns.

n random rows

Select Blocking Key define the column(s) from the input flow according to which you want to
partition the processed data in blocks.

For more information, see Defining a match rule.

Select Matching Key define the match rules and the column(s) from the input flow on which you
want to apply the match algorithm.

For more information, see Defining a match rule.

Results

The Data Preview table has some additional columns which show the results of matching data. The indication of these columns are
as the following:

Column Description

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 111/288
26/07/2022 09:49 PDF Export

GID represents the group identifier.

GRP_SIZE counts the number of records in the group, computed only on the master
record.

MASTER identifies, by true or false , if the record used in the matching

comparisons is a master record. There is only one master record per group.

Each input record will be compared to the master record, if they match, the
input record will be in the group.

SCORE measures the distance between the input record and the master record
according to the matching algorithm used.

GRP_QUALITY only the master record has a quality score which is the minimal value in the
group.

ATTRIBUTE_SCORE lists the match score and the names of the columns used as key attributes
in the applied rules.

Defining a match rule

You can define match rules from the match analysis editor by defining:

blocking keys, the column(s) from the input flow according to which you want to partition the processed data in blocks,

matching keys and survivorship rules, the match algorithms you want to apply on columns from the input flow.

Defining a blocking key

About this task

Defining a blocking key is not mandatory but strongly advisable. Using a blocking key to partition data in blocks reduces the number
of records that need to be examined as comparisons are restricted to record pairs within each block. Using blocking column(s) is
very useful when you are processing a big data set.

Procedure

1. In the Data section, click the Select Blocking Key tab and then click the name of the column(s) you want to use to partition
the processed data in blocks.
Blocking keys that have the exact name of the selected columns are listed in the Blocking Key table.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 112/288
26/07/2022 09:49 PDF Export

You can define more than one column in the table, but only one blocking key will be generated and listed in the BLOCK_KEY
column in the Data table.
For example, if you use an algorithm on the country and lname columns to process records that have the same first
character, data records that have the same first letter in the country and last names are grouped together in the same block.
Comparison is restricted to record within each block.
To remove a column from the Blocking key table, right-click it and select Delete or click on its name in the Data table.

2. Select an algorithm for the blocking key, and set the other parameters in the Blocking Key table as needed.
In this example, only one blocking key is used. The first character of each word in the country column is retrieved and listed
in the BLOCK_KEY column.

3. Click Chart to compute the generated key, group the sample records in the Data table and display the results in a chart.
This chart allows you to visualize the statistics regarding the number of blocks and to adapt the blocking parameters
according to the results you want to get.

Defining a matching key with the VSR algorithm

Procedure

1. In the Record linkage algorithm section, select Simple VSR Matcher if it is not selected by default.

2. In the Data section, click the Select Matching Key tab and then click the name of the column(s) on which you want to apply
the match algorithm.
Matching keys that have the exact names of the selected input columns are listed in the Matching Key table.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 113/288
26/07/2022 09:49 PDF Export

To remove a column from this table, right-click it and select Delete or click on its name in the Data table.

3. Select the match algorithms you want to use from the Matching Function column and the null operator from the Handle Null
column.
In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on first
names and last names respectively and get the duplicate records.
If you want to use an external user-defined matching algorithm, select Custom and use the Custom Matcher column to load
the Jar file of the user-defined algorithm.

Defining a matching key with the T-Swoosh algorithm

Procedure

Make sure first to select the column(s) on which to apply the match algorithm either from the Data section by using the Select
Matching Key tab, or directly from the Matching Key table.

Creating a match key

Procedure

1. In the Record linkage algorithm section, select T-Swoosh.

2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a
new rule, and then set the following criteria.

Match Key Name: Enter the name of your choice for the match key.

Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you
want to use an external user-defined matching algorithm.

Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-
defined algorithm.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 114/288
26/07/2022 09:49 PDF Export

Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

Confidence Weight: Set a numerical weight (between 1 and 10) to the column you want to use as a match key. This
value is used to give greater or lesser importance to certain columns when performing the match.

Handle Null: Specify how to deal with data records which contain null values.

nullMatchNull: If both records contain null values, consider this a match.

nullMatch None: If one record contains a null, do not consider this a match.

nullMatch All: If one record contains a null, consider this a match.

Survivorship Function: Select how two similar records will be merged from the drop-down list.

Concatenate: It adds the content of the first record and the content of the second record together - for
example, Bill and William will be merged into BillWilliam. In the Parameter field, you can specify a separator to
be used to separate values.

Prefer True (for booleans): It always set booleans to True in the merged record, unless all booleans in the
source records are False .

Prefer False (for booleans): It always sets booleans to False in the merged record, unless all booleans in the
source records are True .

Most common: It validates the most frequently-occurring field value in each duplicates group.

Most recent or Most ancient: The former validates the earliest date value and the latter the latest date value in
each duplicates group. The relevant Reference column must be of the Date type.

Longest or Shortest: The former validates the longest field value and the latter the shortest field value in each
duplicates group.

Largest or Smallest: The former validates the largest numerical value and the latter the smallest numerical
value in a duplicates group.

Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of
numeric type.

Reference column: If you set Survivor Function to Most recent or Most ancient, this item is used to select the reference
column.

Parameter: For the Concatenate survivorship function, this item is used to specify a separator you want to use for
concatenating data.

3. In the Match threshold field, enter the match probability threshold.

Two data records match when the probability is above this value.

In the Confident match threshold field, set a numerical value between the current Match threshold and 1 .

4. In the Survivorship Rules For Columns section, define how data records survive for certain columns. Click the [+] button to
add a new rule, and then set the following criteria:

Input Column: Enter the column to which you want to apply the survivorship rule.

Survivorship Function: Select how two similar records will be merged from the drop-down list.

Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source
you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to
specify a separator to use for concatenating data.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 115/288
26/07/2022 09:49 PDF Export

If you specify the survivorship function for a match key in the Match And Survivor section and also specify the survivorship
function for the match key as an input column in the Survivorship Rules For Columns section, the survivorship function
selected in the Match And Survivor section is applied to the column.

5. In the Default Survivorship Rules section, you define how to survive matches for certain data types: Boolean , Date ,
Number and String .
a. Click the [+] button to add a new row for each data type.
b. In the Data Type column, select the relevant data type from the drop-down list.
c. In the Survivorship Function column, select how two similar records will be merged from the drop-down list. Note
that, depending on the data type, only certain choices may be relevant.

Warning: Make sure you select Largest or Smallest as the survivorship function when the match key is of numeric
type.

d. Parameter: For the Most trusted source survivorship function, this item is used to set the name of the data source
you want to use as a base for the master record. For the Concatenate survivorship function, this item is used to
specify a separator you want to use for concatenating data.

If you specify the survivorship function for a column in the Survivorship Rules For Columns section and also specify the
survivorship function for the data type of the column in the Default Survivorship Rules section, the suvivorship function
selected in the Survivorship Rules For Columns is applied to the column.

If you do not specify the behavior for any or all data types, the default behavior (the Most common survivorship function) will
be applied, that is, the most frequently-occurring field value in each duplicates group will be validated.

6. Save your changes.

Editing rules and displaying sample results

Procedure
1. To define a second match rule, put your cursor on the top right corner of the Matching Key table, click the [+] button to create
a new rule.
Follow the steps outlined in Defining a match rule to define matching keys.
When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data.
Records are evaluated against the first rule and the records that match are not evaluated against the second rule and so on.

2. Click the button at the top right corner of the Matching Key or Match and Survivor section and replace the default name
of the rule with a name of your choice.

If you define more than one rule in the match analysis, you can use the up and down arrows in the dialog box to change the
rule order and thus decide what rule to execute first.

3. Click OK.
The rules are named and ordered accordingly in the section.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 116/288
26/07/2022 09:49 PDF Export

4. In the Match threshold field, enter the match probability threshold.

Two data records match when the probability is above this value.
In the Confident match threshold field, set a numerical value between the current Match threshold and 1 .
If the GRP-QUALITY calculated by the match analysis is equal to or greater than the Confident match threshold, you can be
confident about the quality of the group.

5. Click Chart to compute the groups according to the blocking key and match rule you defined in the editor and display the
results of the sample data in a chart.

This chart shows a global picture about the duplicates in the analyzed data. The Hide groups less than parameter is set to 2
by default. This parameter enables you to decide what groups to show in the chart, you usually want to hide groups of small
group size.
The chart in the above image indicates that out of the 1000 sample records you examined and after excluding items that are
unique, by having the Hide groups less than parameter set to 2 :

49 groups have 2 items each. In each group, the 2 items are duplicates of each other.

7 groups have 3 duplicate items and the last group has 4 duplicate items.

Also, the Data table indicates the match details of items in each group and colors the groups in accordance with their colors
in the match chart.

How to show the match results

About this task

To collect duplicates from the input flow according to the match types you define, Levenshtein and Jaro-Winkler in this example, do
the following:

Procedure
Save the settings in the match analysis editor and press F6.
The analysis is executed. The match rule and blocking key are computed against the whole data set and the Analysis Results view is
open in the editor.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 117/288
26/07/2022 09:49 PDF Export

In this view, the charts give a global picture about the duplicates in the analyzed data. In the first tables, you can read statistics about
the count of processed records, distinct records with only one occurrence, duplicate records (matched records) and suspect records
that did not match the rule. Duplicate records represent the records that matched with a good score - above the confidence
threshold. One record of the matched pair is a duplicate that should be discarded and the other is the survivor record.
In the second table, you can read statistics about the number of groups and the number of records in each group. You can click any
column header in the table to sort the results accordingly.

Importing or exporting match rules

You can import match rules from the studio repository and use them in the match editor to test them on your data. You can also
export match rules from the match editor and save them in the studio repository.

You can import match rules stored in the studio repository into the match editor and test them on your data. You can also export
match rules from the editor and store them in the studio repository.

Importing match rules from the repository

Procedure

1. In the match editor, click the icon on top of the editor.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 118/288
26/07/2022 09:49 PDF Export

2. In the Match Rule Selector wizard, select the match rule you want to import into the match analysis editor and use on the
analyzed data.
A warning message displays in the wizard if the match rule you want to import is defined on columns that do not exist in the
analyzed data. Ignore the message as you can define input columns later in the match analysis editor.

3. Select the Overwrite current Match Rule in the analysis check box if you want to replace the rule in the editor with the rule
you import, otherwise, leave the box unselected.

4. Click OK.
The match rule is imported and the matching and blocking keys and /or survivorship rules are listed in the Matching Key and
Blocking Key tables respectively.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 119/288
26/07/2022 09:49 PDF Export

5. Click in the Input column and select from the list the column on which you want to apply the imported blocking and
matching keys.
If you have in the analyzed data a column that match the input column in the imported keys, it will be automatically defined
in the Input column, you do not need to define it yourself.
When you analyze data with multiple conditions, the match results will list data records that meet any of the defined rules.
When you execute the match analysis, an OR match operation is conducted on data and data records are evaluated against
the first rule and the records that match are not evaluated against the other rules.

Exporting match rules to the repository

Procedure

1. In the match editor, click the icon on top of the editor.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 120/288
26/07/2022 09:49 PDF Export

2. In the open wizard, enter a name for the rule and set other metadata, if needed.

3. Click Finish.
The rule editor opens on the rule settings and the rule is saved and listed under Libraries > Rules > Match in the DQ
Repository tree view.

Creating a match rule

In data quality, match rules are used to compare a set of columns and create groups of similar records using blocking and matching
keys and/or survivorship functions.

From the studio, you can create match rules with the VSR or the T-Swoosh algorithm and save them in the studio repository. Once
centralized in the repository, you can import them in the match analysis editor and test them on your data to group duplicate
records. For further information about the match analysis, see Creating a match analysis.

The two algorithms produce different match results because of two reasons:

first, the master record is simply selected to be the first input record with the VSR algorithm. Therefore, the list of match
groups may depend on the order of the input records,

second, the output records do not change with the VSR algorithm, whereas the T-Swoosh algorithm creates new records.

Defining the rule

About this task

Procedure

1. In the DQ Repository tree view, expand Libraries > Rules.

2. Right-click Match and select New Match Rule.

3. In the New Match Rule wizard, enter a name and set other metadata, if required.

Note:

Avoid using special characters in the item names including:

"~", "!", "`", "#", "^", "&", "*", "\\", "/", "?", ":", ";", "\"", ".", "(", ")", "'", "Â¥", "'", """, "Â«", "Â»", "<", ">".

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 121/288
26/07/2022 09:49 PDF Export

These characters are all replaced with "_" in the file system and you may end up creating duplicate items.

Consider as an example that you want to create a rule to match customer full names.

4. Click Finish.
A match rule editor opens in the studio and the new match rule is listed under Libraries > Rule > Match in the DQ Repository
tree view.

In the Record Linkage algorithm view, the Simple VSR Matcher algorithm is selected by default.

5. Start defining the match rule items as described in Rules with the VSR algorithm and Rules with the T-Swoosh algorithm.

Duplicating a rule

About this task

To avoid creating a match rule from scratch, you can duplicate an existing one and work around its metadata and definition to have
a new rule.

To duplicate a rule, do the following:

Procedure

1. In the DQ Repository tree view, expand Libraries > Rules > Match.

2. Browse through the match rule list to reach the rule you want to duplicate.

3. Right-click its name and select Duplicate.

The duplicated rule is created under the Match folder in the DQ Repository tree view.

4. Double-click the duplicated rule to open it and modify its metadata and/or definition as needed.

Rules with the VSR algorithm

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 122/288
26/07/2022 09:49 PDF Export

The VSR algorithm takes a set of records as input and groups similar encountered duplicates together according to defined match
rules. It compares pairs of records and assigns them to groups. The first processed record of each group is the master record of the
group. So, the order of the records is important and can have an impact on the creation process of the master records.

The VSR algorithm compares each record with the master of each group and uses the computed distances, from master records, to
decide to what group the record should go.

In the match analysis, the matching results of the VSR algorithm may vary depending on the order of the input records. If possible,
put the records in which you have more confidence first in the input flow, to have better algorithm accuracy.

You can import and test the rule on your data in the match analysis editor. For further information, see Importing or exporting match
rules.

Defining a blocking key from the match analysis

About this task

Defining a blocking key is not mandatory but advisable. Using a blocking key partitions data in blocks and thus reduces the number
of records to be examined, as comparisons are restricted to record pairs within each block. Using blocking key(s) is very useful when
you are processing big data set.

Procedure

1. In the rule editor and in the Generation of Blocking Key section, click the [+] button to add a row to the table.

2. Set the parameters of the blocking key as the following:

Blocking Key Name: Enter a name for the column you want to use to reduce the number of record pairs that need to
be compared.

Pre-algorithm: Select from the drop-down list an algorithm and set its value where necessary.

Defining a pre-algorithm is not mandatory. This algorithm is used to clean or standardize data before processing it
with the match algorithm and thus improve the results of data matching.

Algorithm: Select from the drop-down list the match algorithm you want to use and set its value where necessary.

Post-algorithm: Select from the drop-down list an algorithm and set its value where necessary

Defining a post-algorithm is not mandatory. This algorithm is used to clean or standardize data after processing it
with the match algorithm and thus improve the outcome of data matching.

3. If required, follow the same steps to add as many blocking keys as needed.
When you import a rule with many blocking keys into the match analysis editor, only one blocking key will be generated and
listed in the BLOCK_KEY column in the Data table.

Defining a matching key

Procedure

1. In the rule editor and in the Matching Key table, click the [+] button to add a row to the table.

2. Set the parameters of the matching key as the following:

Match Key Name: Enter the name of your choice for the match key.

Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you
want to use an external user-defined matching algorithm.

In this example two match keys are defined, you want to use the Levenshtein and Jaro-Winkler match methods on
first names and last names respectively and get the duplicate records.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 123/288
26/07/2022 09:49 PDF Export

Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-
defined algorithm.

Handle Null: Specify how to deal with data records which contain null values.

3. In the Match threshold field, enter the match probability threshold. Two data records match when the probability is above
this value.
In the Confident match threshold field, set a numerical value between the current Match threshold and 1 . Above this
threshold, you can be confident about the quality of the group.

4. To define a second match rule, place your cursor on the top right corner of the Matching Key table and then click the [+]
button.
Follow the steps to create a match rule.

When you define multiple conditions in the match rule editor, an OR match operation is conducted on the analyzed data.
Records are evaluated against the first rule and the records that match are not evaluated against the second rule.

5. If required, put your cursor on the top right corner of the table and click the button then replace the default names of
the rules with names of your choice.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 124/288
26/07/2022 09:49 PDF Export

You can also use the up and down arrows in the dialog box to change the rule order and thus decide what rule to execute
first.

6. Click OK.
The rules are named and ordered accordingly in the Matching Key table.

7. Save the match rule settings.

The match rule is saved and centralized under Libraries > Rule > Match in the DQ Repository tree view.

Rules with the T-Swoosh algorithm

You can use the T-Swoosh algorithm to find duplicates and to define how two similar records are merged to create a master record,
using a survivorship function. These new merged records are used to find new duplicates.

The differences between the T-Swoosh and the VSR algorithms are the following:

When using the T-Swoosh algorithm, the master record is in general a new record that does not exist in the list of input
records.
When using the T-Swoosh algorithm, you can define a survivorship function for each column to create a master record.

Creating a match key

Procedure

1. In the Record linkage algorithm section, select T-Swoosh.

2. In the Match and Survivor section, you define the criteria to use when matching data records. Click the [+] button to add a
new rule, and then set the following criteria.

Match Key Name: Enter the name of your choice for the match key.

Matching Function: Select the type of matching you want to perform from the drop-down list. Select Custom if you
want to use an external user-defined matching algorithm.

Custom Matcher: This item is only used with the Custom matching function. Browse and select the Jar file of the user-
defined algorithm.

Threshold: Specify the match score (between 0 and 1) above which two values should be considered a match.

Handle Null: Specify how to deal with data records which contain null values.

nullMatchNull: If both records contain null values, consider this a match.

https://help.talend.com/internal/api/webapp/print/7c156cc9-9c1d-4bdf-a254-62d723661df1 125/288
26/07/2022 09:49 PDF Export