TalendOpenStudio DQ GettingStarted 7.0.1 EN
TalendOpenStudio DQ GettingStarted 7.0.1 EN
7.0.1
Contents
Copyleft.................................................................................................................................................3
Introduction to Talend Open Studio for Data Quality....................................................................4
Functional architecture of Talend Open Studio for Data Quality............................................................................... 4
Prerequisites to using Talend Open Studio for Data Quality........................................................ 5
Memory requirements.................................................................................................................................................................. 5
Software requirements.................................................................................................................................................................5
Installing Java..................................................................................................................................................................................6
Setting up the Java environment variable on Windows................................................................................................. 6
Setting up the Java environment variable on Linux........................................................................................................6
Installing 7-Zip (Windows)......................................................................................................................................................... 7
Downloading and installing Talend Open Studio for Data Quality.............................................. 8
Downloading Talend Open Studio for Data Quality........................................................................................................ 8
Installing Talend Open Studio for Data Quality................................................................................................................8
Configuring and setting up your Talend product......................................................................... 10
Launching the Studio for the first time............................................................................................................................. 10
Installing additional packages............................................................................................................................................... 10
Profiling data..................................................................................................................................... 11
Setting up input data................................................................................................................................................................ 11
Identifying anomalies in data................................................................................................................................................ 11
Browsing non-match data........................................................................................................................................................20
What's next.................................................................................................................................................................................... 21
Copyleft
Copyleft
Adapted for 7.0.1. Supersedes previous releases.
Publication date: April 13, 2018
This documentation is provided under the terms of the Creative Commons Public License (CCPL).
For more information about what you can and cannot do with this documentation in accordance with
the CCPL, please read: http://creativecommons.org/licenses/by-nc-sa/2.0/.
Notices
Talend is a trademark of Talend, Inc.
All brands, product names, company names, trademarks and service marks are the properties of their
respective owners.
License Agreement
The software described in this documentation is licensed under the Apache License, Version 2.0 (the
"License"); you may not use this software except in compliance with the License. You may obtain
a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.html. Unless required by
applicable law or agreed to in writing, software distributed under the License is distributed on an "AS
IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under the License.
This product includes software developed at ASM, AntlR, Apache ActiveMQ, Apache Ant, Apache
Axiom, Apache Axis, Apache Axis 2, Apache Chemistry, Apache Common Http Client, Apache Common
Http Core, Apache Commons, Apache Commons Bcel, Apache Commons Lang, Apache Datafu, Apache
Derby Database Engine and Embedded JDBC Driver, Apache Geronimo, Apache HCatalog, Apache
Hadoop, Apache Hbase, Apache Hive, Apache HttpClient, Apache HttpComponents Client, Apache
JAMES, Apache Log4j, Apache Neethi, Apache POI, Apache Pig, Apache Thrift, Apache Tomcat, Apache
Xml-RPC, Apache Zookeeper, CSV Tools, DataNucleus, Doug Lea, Ezmorph, Google's phone number
handling library, Guava: Google Core Libraries for Java, H2 Embedded Database and JDBC Driver,
HighScale Lib, HsqlDB, JSON, JUnit, Jackson Java JSON-processor, Java API for RESTful Services, Java
Universal Network Graph, Jaxb, Jaxen, Jetty, Joda-Time, Json Simple, MapDB, MetaStuff, Paraccel JDBC
Driver, PostgreSQL JDBC Driver, Protocol Buffers - Google's data interchange format, Resty: A simple
HTTP REST client for Java, SL4J: Simple Logging Facade for Java, SQLite JDBC Driver, The Castor
Project, The Legion of the Bouncy Castle, Woden, Xalan-J, Xerces2, XmlBeans, XmlSchema Core,
atinject. Licensed under their respective license.
3
Introduction to Talend Open Studio for Data Quality
Talend provides unified development and management tools to integrate and process all of your data
with an easy to use, visual designer.
From Talend Open Studio for Data Quality, users can access and examine the data available in
different data sources and collect statistics and information about this data.
4
Prerequisites to using Talend Open Studio for Data Quality
This chapter provides basic software and hardware information required and recommended to get
started with your Talend Open Studio for Data Quality.
• Memory requirements on page 5
• Software requirements on page 5
It also guides you to install and configure required and recommended third-party tools:
• Installing Java on page 6
• Setting up the Java environment variable on Windows on page 6 or Setting up the Java
environment variable on Linux on page 6
• Installing 7-Zip (Windows) on page 7
Memory requirements
To make the most out of your Talendproduct, please consider the following memory and disk space
usage:
Software requirements
To make the most out of your Talend product, please consider the following system and software
requirements:
Required software
• Operating System for Talend Studio:
5
Prerequisites to using Talend Open Studio for Data Quality
Optional software
• 7-Zip. See Installing 7-Zip (Windows) on page 7.
Installing Java
To use your Talend product, you need Oracle Java Runtime Environment installed on your computer.
Procedure
1. From the Java SE Downloads page, under Java Platform, Standard Edition, click the JRE Download.
2. From the Java SE Runtime Environment 8 Downloads page, click the radio button to Accept License
Agreement.
3. Select the appropriate download for your Operating System.
4. Follow the Oracle installation steps to install Java.
Results
When Java is installed on your computer, you need to set up the JAVA_HOME environment variable. For
more information, see:
• Setting up the Java environment variable on Windows on page 6.
• Setting up the Java environment variable on Linux on page 6.
Procedure
1. Go to the Start Menu of your computer, right-click on Computer and select Properties.
2. In the Control Panel Home window, click Advanced system settings.
3. In the System Properties window, click Environment Variables....
4. Under System Variables, click New... to create a variable. Name the variable JAVA_HOME, enter the
path to the Java 8 JRE, and click OK.
Example of default JRE path: C:\Program Files\Java\jre1.8.0_77.
5. Under System Variables, select the Path variable and click Edit... to add the previously defined
JAVA_HOME variable at the end of the Path environment variable, separated with semi colon.
Example: <PathVariable>;%JAVA_HOME%\bin.
6
Prerequisites to using Talend Open Studio for Data Quality
Procedure
1. Find the JRE installation home directory.
Example: /usr/lib/jvm/jre1.8.0_65
2. Export it in the JAVA_HOME environment variable.
Example:
export JAVA_HOME=/usr/lib/jvm/jre1.8.0_65
export PATH=$JAVA_HOME/bin:$PATH
3. Add these lines at the end of the user profiles in the ~/.profile file or, as a superuser, at the end
of the global profiles in the /etc/profile file.
4. Log on again.
Procedure
1. Download the 7-Zip installer corresponding to your Operating System.
2. Navigate to your local folder, locate and double-click the 7z exe file to install it.
Results
The download will start automatically.
7
Downloading and installing Talend Open Studio for Data Quality
Procedure
1. Go to the Talend Open Studio for Data Quality download page.
2. Click DOWNLOAD FREE TOOL.
Results
The download will start automatically.
For Windows, Talend recommends you to install 7-Zip and use it to extract files. For more information,
see Installing 7-Zip (Windows) on page 7.
To install the studio, follow the steps below:
Procedure
1. Navigate to your local folder, locate the TOS zip file and move it to another location with a path as
short as possible and without any space character.
Example: C:/Talend/
2. Unzip it by right-clicking on the compressed file and selecting 7-Zip > Extract Here.
8
Downloading and installing Talend Open Studio for Data Quality
If you do not want to use 7-Zip, you can use Windows default unzipping tool.
Procedure
1. Unzip it by right-click the compressed file and select Extract All.
2. Click Browse and navigate to the C: drive.
3. Select Make new folder and name the folder Talend. Click OK.
4. Click Extract to begin the installation.
Procedure
1. Navigate to your local folder, locate the zip file and move it to another location with a path as
short as possible and without any space character.
Example: home/user/talend/
2. Unzip it by right-clicking on the compressed file and selecting Extract Here.
9
Configuring and setting up your Talend product
This chapter provides basic information required to configure and set up your Talend Open Studio for
Data Quality.
Procedure
1. Double-click the executable file corresponding to your operating system, for example:
• TOS_*-win-x86_64.exe, for Windows.
• TOS_*-linux-gtk-x86_64, for Linux.
• TOS_*-macosx-cocoa.app, for Mac.
2. In the User License Agreement dialog box that opens, read and accept the terms of the end user
license agreement to proceed.
Results
The Talend Studio opens briefly, then the Connect to TalendForge wizard opens. You can connect to it
to benefit from the Talend community or Skip this step.
Procedure
1. When the Additional Talend Packages wizard opens, install additional packages by selecting the
Required and Optional third-party libraries check boxes and clicking Finish.
This wizard opens each time you launch the studio if any additional package is available for
installation unless you select the Do not show this again check box. You can also display this
wizard by selecting Help > Install Additional Packages from the menu bar.
For more information, see the section about installing additional packages in the Talend Open
Studio for Data Quality Installation and Upgrade Guide
2. In the Download external modules window, click the Accept all button at the bottom of the wizard
to accept all the licenses of the external modules used in the studio.
Depending on the libraries you selected, you may need to accept their license more than once.
Wait until all the libraries are installed before starting to use the studio.
3. If required, restart your Talend Studio for certain additional packages to take effect.
10
Profiling data
Profiling data
This chapter takes the example of a company that provides movie rental and streaming video services,
and shows how such a company could make use of Talend Studio.
You will work with data about your customers as you learn how to validate email addresses for
customers and standardize phone numbers before sending them to the Customer Support System.
Procedure
1. Open the MySQL Workbench to launch an instance of the database.
2. From the menu bar, select Server > Data Import to open the import wizard wizard.
3. Select the Import from Self-Contained File option and browse to where you have stored the
gettingstarted.sql file.
4. Select the schema to which you want to import the data, or click New... to define a new schema.
5. Click Start Import in the lower right corner.
Results
The gettingstarted database is imported in the MySQL database.
Procedure
1. Create a column analysis on customer email addresses and phone numbers. For further
information, see Defining a column analysis on page 12.
11
Profiling data
2. Connect to the database which holds the customer data from the analysis editor. For further
information, see Creating the database connection on page 13.
3. Add indicators to provide simple statistics on data such as row , blank and duplicate counts. For
further information, see Setting system indicators on page 15.
4. Add standard patterns against which to match email addresses and phone numbers. For further
information, see Setting patterns on page 17.
5. Execute the analysis to show results in tables and charts. For further information, see Showing
analysis results on page 18.
6. Access a view of the analyzed data to see invalid records. For further information, see Browsing
non-match data on page 20.
Procedure
1. In the DQ Repository tree view, right-click Analyses and select New Analysis.
The [Create New Analysis] wizard opens.
2. Start typing Basic column analysis in the search field, select Basic Column Analysis from the list
and click Next.
3. In the Name field, enter a name for the analysis.
The Name field is mandatory. Do not use spaces or special characters in the analysis name.
4. Set a purpose and a description for the analysis, and click Finish to open the analysis editor.
The Purpose and Description fields are not mandatory, but you are advised to fill in this
information which is displayed in Detail View when you select the analysis.
12
Profiling data
Results
The new analysis is listed under the Analysis folder in the DQ Repository tree view.
Procedure
1. In the analysis editor, click the New Connection tab to open the [Create New Connection] wizard.
2. From the Connection Type list, select DB connections and click Next.
3. Click Finish to create the database connection, list it under the Metadata node and open a new
step in the wizard.
4. Expand the database connection, click on the table name and select the check boxes of the
columns on which you want to create the analysis.
13
Profiling data
5. Click OK to close the wizard and list the columns in the analysis editor.
You can click Refresh Data to display the actual data in the analysis editor.
14
Profiling data
Procedure
1. In the Data Preview section in the analysis editor, click Select indicators to open the [Indicator
Selection] dialog box.
15
Profiling data
2. Expand Simple Statistics and select Row Count, Blank Count and Duplicate Count. Click OK to close
the wizard.
You want to see the row, blank and duplicate counts in the Email and Phone columns to see how
consistent the data is.
Indicators are added accordingly to the columns in the Analyzed Columns section.
16
Profiling data
3. Click the icon next to the Duplicate Count and Blank Count indicator and set 0 in the Upper
threshold field.
Defining thresholds on the Email and Phone columns is very helpful as it will write in red the
count of the duplicate and blank values in the analysis results.
Setting patterns
This column analysis uses predefined patterns to match the content of the Email and Phone columns
against standard email and US phone patterns respectively. This defines the content, structure and
quality of emails and phone numbers and give a percentage of the data that match the standard
formats and the data that does not match.
Procedure
1. In the Data Preview section in the analysis editor, click the icon next to the Email column to
open the [Pattern Selector] dialog box.
2. Expand Regex > internet, select the Email Address check box and click OK to close the dialog box.
The pattern is added to the column in the Analyzed Columns section.
3. Click the icon next to the Phone column to open the [Pattern Selector] dialog box.
4. Expand Regex > phone, select the US phone numbers check box and click OK to close the dialog
box.
The pattern is added to the column in the Analyzed Columns section.
5. Click the icon next to the Email Address and US phone numbers patterns and set 98.0 in the
Lower threshold (%) fields.
17
Profiling data
If the number of the records that match the patterns is fewer than 98%, it will be written in red in
the analysis results.
Procedure
1. In the Analysis Parameters, select java from the Execution engine list to run the analysis with the
Java engine.
2. In the analysis editor, press F6 to execute the analysis or click the Run button.
The editor switches to the Analysis Results view. The analysis results show the generated charts
for the analyzed columns accompanied with tables that detail the statistic and pattern matching
results.
The results for the Email column look as the following:
18
Profiling data
19
Profiling data
Results
The result sets for the Email and Phone columns give the count of the records that match and those
that do not match the standard email pattern and the standard US phone numbers respectively. The
results also give the blank and duplicate counts. This shows that the data is not very consistent and
that it needs to be corrected.
Procedure
1. In the Analysis Results view, right-click the Blank Count in the statistic results of the Email column
and select View rows for example.
A view opens listing all the blank rows in the Email column.
20
Profiling data
2. In the Analysis Results view, right-click the result in the Pattern Matching of the Email column and
select View invalid values for example.
Results
A view opens listing all the invalid email addresses.
What's next
You have learned how Talend Studio helps you profile your data and collect statistics and information
about it in order to assess the quality level of the data according to defined set goals.
You have seen:
• How to use the Profiling perspective of the studio to analyze customer email addresses and phone
numbers by using out-of-box indicators and patterns.
• How the analysis results show the matching and non-matching address records and how it is
possible to browse such data.
Once you succeed with the simple procedures outlined in Identifying anomalies in data on page 11,
you can start digging deeper to see in detail all the profiling capabilities of Talend Studio.
For further information about Talend Studio, see Talend Studio User Guide.
To learn more about Talend products and solutions, visit www.talend.com.
21