SAS DataFlux Management Studio
SAS DataFlux Management Studio
Management Studio:
Basics
Course Notes
DataFlux Data Management Studio: Basics Course Notes was developed by Kari Richardson. Editing
and production support was provided by the Curriculum Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
DataFlux Data Management Studio: Basics Course Notes
Copyright © 2012 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
Book code E2229, course code DQ22DMP1, prepared date 09May2012. DQ22DMP1_001
ISBN 978-1-61290-309-5
For Your Information iii
Table of Contents
Exercises.................................................................................................................. 3-53
Demonstration: Profiling Data Using Text File Input ............................................ 3-56
Demonstration: Profiling Data Using Filtered Data and an SQL Query ................ 3-61
Demonstration: Profiling Data Using a Collection ................................................ 3-68
Exercises.................................................................................................................. 3-70
Demonstration: Data Profiling – Additional Analysis (Self-Study) ........................ 3-73
Exercises................................................................................................................ 4-108
Demonstration: Creating an Entity Resolution Job ............................................... 4-112
Demonstration: Creating a Data Job to Compare Clusters (Optional) .................. 4-138
Course Description
This course is for data quality stewards who perform data management tasks, such as data quality, data
enrichment and entity resolution.
To learn more…
For information about other courses in the curriculum, contact the SAS
Education Division at 1-800-333-7660, or send e-mail to [email protected].
You can also find this information on the Web at support.sas.com/training/
as well as in the Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department
at 1-800-727-3228 or send e-mail to [email protected]. Customers outside
the USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for
a complete list of books and a convenient order form.
For Your Information vii
Prerequisites
There currently are no prerequisites for this course.
viii For Your Information
Chapter 1 Introduction to DataFlux
Methodology and Course Flow
1.1 Introduction
Objectives
Provide an overview of DataFlux.
Explore the DataFlux Data Management methodology.
Introduction to DataFlux
DataFlux is a leading provider of data management
solutions.
Founded in 1997
4
4
1-4 Chapter 1 Introduction to DataFlux Methodology and Course Flow
CONTROL DEFINE
MONITOR PLAN
EVALUATE DISCOVER
EXECUTE DESIGN
ACT
5
5
The DataFlux Data Management Methodology is a step-by-step process for performing data management
tasks, such as data quality, data integration, data migrations, and MDM (Master Data Management).
When organizations plan, take action on, and monitor data management projects, they build the
foundation to optimize revenue, control costs, and mitigate risks. No matter what type of data you
manage, DataFlux technology can help you gain a more complete view of corporate information.
PLAN
DEFINE The planning stage of any data management project starts with this essential first step.
This is where the people, processes, technologies, and data sources are defined.
Roadmaps are built that include articulating the acceptable outcomes. Finally, the
cross-functional teams across business units and between business and IT communities
are created to define the data management business rules.
DISCOVER A quick inspection of your corporate data would probably find that it resides in many
different databases. The data is managed by many different systems, with many
different formats and representations of the same data. This step of the methodology
enables you to explore metadata to verify that the right data sources are included
in the data management program. It also enables you to create detailed data profiles
of identified data sources to understand their strengths and weaknesses.
1.1 Introduction 1-5
ACT
DESIGN After you complete the first two steps, this phase enables you to take the different
structures, formats, data sources, and data feeds, and create an environment that
accommodates the needs of your business. At this step, business and IT users build
jobs to enforce business rules for data quality and data integration. They create data
models to house data in consolidated or master data sources.
EXECUTE After business users establish how the data and rules should be defined, the IT staff can
install them in the IT infrastructure and determine the integration method – real time,
batch, or virtual. These business rules can be reused and redeployed across
applications, which helps increase data consistency in the enterprise.
MONITOR
EVALUATE This step of the methodology enables users to define and enforce business rules
to measure the consistency, accuracy, and reliability of new data as it enters the
enterprise. Reports and dashboards about critical data metrics are created for business
and IT staff members. The information gained from data monitoring reports is used
to refine and adjust the business rules.
CONTROL The final stage in a data management project involves examining any trends to validate
the extended use and retention of the data. Data that is no longer useful is retired.
The project’s success can then be shared throughout the organization. The next steps
are communicated to the data management team to lay the groundwork for future data
management efforts.
1-6 Chapter 1 Introduction to DataFlux Methodology and Course Flow
Objectives
Review the steps for the course.
Overview the course data.
This step involves becoming familiar with the DataFlux Data Management Studio interface. As part of the
exploration, you define a repository and verify the course QKB (Quality Knowledge Base). In addition, a
new data connection is defined.
1.2 Course Flow 1-7
The PLAN phase of the DataFlux Data Management Methodology involves exploring the structural
information in the data sources and analyzing the data values of the data sources. As a result of that
analysis, this phase builds necessary or needed schemes.
10
1
0
The analysis and exploration of the data sources can lead to the discovery of data quality issues. The ACT
phase is designed to create the data jobs that cleanse or correct the data. The steps involved do the following:
• Standardize, parse, and/or case the data
• Correctly identify types of data (identification analysis)
• Perform methods to remove duplicates from data sources, or to join tables with no common key
1-8 Chapter 1 Introduction to DataFlux Methodology and Course Flow
11
1
1
After data corrections are applied, a final phase in the DataFlux Data Management Methodology
is to set up processes to monitor various aspects of the data.
Course Data
dfConglomerate has many lines
of business and is need of an
enterprise-level view of its data.
Gifts Grocery
This course works with two
lines of data:
dfConglomerate Gift
(used in demonstrations) Clothing Auto
dfConglomerate Grocery
(used in exercises)
12
1.2 Course Flow 1-9
Founded in 1926, dfConglomerate pursued a policy of managed growth and acquisitions to transform
itself into an organization with world-wide operations and more than 30,000 employees. Headquartered
in Toledo, Ohio, dfConglomerate has offices in London, Frankfurt, Rio de Janeiro, Tokyo, and Bangalore.
Because dfConglomerate operates across dozens of different lines of business, the reach of its products
and services encompasses everything from heavy mining and manufacturing to automobiles and
consumer goods. dfConglomerate prides itself on its decentralized management structure, with each
division having a high degree of autonomy over its own operations. Recently, corporate level executives
realized that the company’s history of mergers and acquisitions endowed it with a legacy of data silos,
incompatible systems, and bad and duplicate data. It became apparent that the company lacked the
essential enterprise-level view of its data. It also realized that to achieve this view, the company needed
to improve the overall quality of its data. To achieve this, dfConglomerate began establishing data
management centers of excellence, and used these as pilot programs with the intent of replicating them
across the enterprise.
There are five tables that exist in the dfConglomerate Gifts Microsoft Access database.
There are two tables that exist in the dfConglomerate Grocery Microsoft Access database.
1-10 Chapter 1 Introduction to DataFlux Methodology and Course Flow
Chapter 2 DataFlux Data
Management Studio: Getting Started
2.1 Introduction
Objectives
Provide an overview of the components of the
DataFlux Data Management Platform.
Define key components and terms of DataFlux Data
Management Studio.
Define a DataFlux Data Management Studio
repository.
4
2-4 Chapter 2 DataFlux Data Management Studio: Getting Started
DataFlux Data Management Studio provides a single interface for both business and IT users. Built on
DataFlux's technology foundation, DataFlux Data Management Studio provides a unified development
and delivery environment, giving organizations a single point of control to manage data quality, data
integration, master data management (MDM), or any other enterprise data initiative. DataFlux Data
Management Studio enables top-down design of business process and rules, as well as bottom-up
development and delivery of specific components in a single user environment. Building jobs and
business rules in DataFlux Data Management Studio enables transparency in the design process,
collaborative development, and change management during and after the deployment.
Working in unison with DataFlux Data Management Studio, DataFlux Data Management Server helps
form the backbone of the DataFlux Data Management Platform. The technology can implement the rules
created in DataFlux Data Management Studio in both batch and real-time environments. Through an
innovative, standards-based service-oriented architecture (SOA), DataFlux Data Management Server
frees you from writing and testing hundreds of lines of extra code to enable real-time data quality, data
integration, and data governance routines. DataFlux Data Management Server is the workhorse that
enables pervasive data quality, data integration, and master data management (MDM) throughout your
organization.
DataFlux provides integrated views of data located in multiple repositories without the need to physically
consolidate the data. This virtual integration enables powerful querying capabilities, as well as improved
source data management. Data federation efficiently joins data from many heterogeneous sources, without
moving or copying the data. The result is improved performance and speed, while reducing costly and
labor-intensive data movement and consolidation. DataFlux Federation Server enables organizations to
view data from multiple sources through an integrated, virtual data view. While the data remains stored in
its source application, multiple users can view integrated data without physically moving or changing
data. The data appears as if it were physically integrated in one place, either as a table, file, or Web call.
The DataFlux Authentication Server acts as a central point of security for users of the DataFlux Data
Management Platform. By providing authentication services, the Authentication Server helps the users of
DataFlux Data Management Studio securely connect to the DataFlux Data Management Server and/or
database servers across your enterprise.
2.1 Introduction 2-5
Main Menu
Navigation
Pane
Information
Pane
Navigation
Riser Bars
Secondary Toolbar
Secondary
Tabs Detach
Selected
Work
Resource Tab
Area
Panes
Details Pane
6
2-6 Chapter 2 DataFlux Data Management Studio: Getting Started
In this demonstration you explore and navigate the interface for DataFlux Data Management Studio.
1. Select Start All Programs DataFlux Data Management Studio 2.2.
A splash screen appears as DataFlux Data Management Studio is initialized.
2.1 Introduction 2-7
An interactive Help window for the DataFlux Data Management Methodology also appears by
default.
2. Clear the Display Data Management Methodology on Startup check box. (This check box appears
in the right of the bottom panel of this interactive Help window.)
The DataFlux Data Management Methodology Help can be redisplayed in DataFlux Data
Management Studio by clicking anyplace in the Data Management Methodology portlet.
The left side of the interface displays navigation information that includes a navigation pane and a
Navigation Risers Bar area. The right side of the interface displays the information pane for the
selected element of the navigation risers bar. The main menu and a toolbar are also on the Home tab.
2-8 Chapter 2 DataFlux Data Management Studio: Getting Started
4. Verify that the main menu and toolbar appear on the Home tab.
Home Tab
5. Verify that the left side of the interface displays navigation information.
Navigation
Pane
Navigation Area
Navigation
Riser Bars
2.1 Introduction 2-9
6. Verify the right side of the interface displays the Information pane.
Information
Pane
Recent Files
Methodology
Data Roundtable
Discussions
Documentation
Settings
Links
There are four main items in the Define portion of the methodology (Connect to Data, Explore
Data, Define Business Rules, Build Schemes). Selecting any of these items provides a brief
overview and a link to more in-depth help.
b. Click Explore Data.
c. Click Click here to learn more.
2-12 Chapter 2 DataFlux Data Management Studio: Getting Started
The online Help for DataFlux Data Management Studio appears for the topic of Data
Explorations.
The Data riser bar allows a user to work with collections, data connections and master data
foundations.
Collections folder A collection is a set of fields that are selected from tables that
are accessed from different data connections. A collection
provides a convenient way for a user to build a data set using
those fields. A collection can be used an an input source for
other components in Data Management Studio, such as the
Data Viewer, profiles, queries, and data exploarations.
Data Connections folder Data connections are used to access data in jobs, profiles,
data explorations and data collections.
Master Data Foundations folder The Master Data Foundation feature in Data Management
Studio uses master data projects and entity definitions to
develop the best possible record for a specific resource, such
as a customer or a product, from all of the source systems that
might contain a reference to that resource.
2-14 Chapter 2 DataFlux Data Management Studio: Getting Started
The information area provides overview information about all defined data connections such as
names, descriptions, and types.
Different types of data connections can be made in DataFlux Data Management Studio:
• ODBC connections
• Federated Server connections
• Localized DSN connections
• Custom connections
• SAS Data Set connections
ODBC Connection Displays the Microsoft Windows ODBC Data Source Administrator dialog,
which you can use to create ODBC connections
Federated Server Enables you to create a federated server data connection. A federated
Connection connection enables you to establish a threaded connection to a database
Localized DSN Enables you to create a localized DSN connection definition for aspecific data
Connection source that is available via the federated server. This connection definition is
used when you access federated tables. These localized connections are
federated server connections to a DBMS that are named and created as an extra
connection to the same source in metadata
Custom Enables you to create a custom connection string for non-ODBC connection
Connection types. These custom strings are useful for establishing native connections to
third-party databases or drawing data from more than one type of data input
SAS Data Set Enables you to create SAS data set connections
Connection
12. Click the DataFlux Sample data connection in the Navigation pane.
2.1 Introduction 2-15
The tables accessible through this data connection are revealed in the information area.
2-16 Chapter 2 DataFlux Data Management Studio: Getting Started
13. In the Navigation area, click the Data Management Servers riser bar.
In this Navigation pane, you can define new server instances or import jobs to process on the server.
The Administration navigation pane enables you to manage various items such as repositories, the
Quality Knowledge Base (QKB), reference sources (data packs), and macro files.
2-18 Chapter 2 DataFlux Data Management Studio: Getting Started
What Is a Repository?
Data Management Studio Collection of data and metadata
Repository around DataFlux objects
The DataFlux Data Management Studio repositories
are used to do the following:
Organize work
Show lineage
The repositories consist
of two parts:
Data Storage
File Storage
13
Administration
Riser Bar
14
2-20 Chapter 2 DataFlux Data Management Studio: Getting Started
This demonstration illustrates the steps necessary to create a DataFlux Data Management Studio
repository.
1. If necessary, access the Administration riser bar.
a. Select Start All Programs DataFlux Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Administration riser bar.
2. Click Repository Definitions in the list of Administration items on the Navigation pane.
The information pane displays overall repository information.
7) Click Open.
c. Click Browse next to the Folder field in the File storage area.
1) Navigate to S:\Workshop\dqdmp1\Demos.
2) Click Make New Folder.
3) Type files as the value for the name of the new folder.
4) Press ENTER.
5) Click OK.
d. Clear Private.
2.1 Introduction 2-23
The final settings for the new repository definition should resemble the following:
e. Click OK.
A message window appears and states that the repository does not exist.
f. Click Yes.
2-24 Chapter 2 DataFlux Data Management Studio: Getting Started
g. Click Close.
The repository is created and connected.
It is a best practice to use all lowercase and no spaces for folders that might
be used on the DataFlux Data Management Server because some server
operating systems are case sensitive. Because profiles and exploration run on
a DataFlux Data Management Server, you want to adhere to this best
practice.
The final set of folders for the Basics Demos repository should resemble the following:
2-26 Chapter 2 DataFlux Data Management Studio: Getting Started
Exercises
2. Updating the Set of Default Folders for the Basics Exercises Repository
Create two additional folders for the new Basics Exercises repository:
• output_files
• profiles_and_explorations
The final set of folders should resemble the following:
2.2 Quality Knowledge Base and Reference Sources 2-27
Objectives
Define a Quality Knowledge Base (QKB).
Define reference sources.
23
2-28 Chapter 2 DataFlux Data Management Studio: Getting Started
The Quality Knowledge Base (QKB) is a collection of files and configuration settings that contain all the
DataFlux data management algorithms. DataFlux is Unicode compliant, which means that it can read data
from any language. However, DataFlux only has definitions in the Quality Knowledge Base (QKB) to
standardize, identify, fuzzy match, and so on for data from particular locales. Each QKB supports data
management operations for a specific business area. The QKB for Contact Information supports
management of commonly used contact information for individuals and organizations — for example,
names, addresses, company names, and phone numbers. The supported locales for the QKB for Contact
Information are indicated by dark blue (or black) above. Locales are continuously added, so contact your
DataFlux representative for future locale support.
In addition, the DataFlux Data Management Studio application can perform cleansing functions for non-
contact information data as well. Depending on the activities you want to perform, you might need to
create your own definitions in the DataFlux Quality Knowledge Base using the Data Management
Studio – Customize module. This module enables you to modify existing definitions or create new
definitions in the QKB. (The Data Management Studio – Customize module has its own training course.)
Quality Knowledge Base locations are registered on the Administration Riser Bar in DataFlux Data
Management Studio. One QKB location needs to be designated as the default QKB location.
2.2 Quality Knowledge Base and Reference Sources 2-29
Administration
Riser Bar
25
2-30 Chapter 2 DataFlux Data Management Studio: Getting Started
Geo+Phone data
World data
26
A reference object is typically a database used by DataFlux Data Management Studio to compare user
data to a reference source (for example, USPS Address Data). You cannot directly access or modify
references.
Reference source locations are registered on the Administration riser bar in DataFlux Data Management
Studio. One reference source location of each type should be designated as the default for that type.
Summary
Information for
Reference
Reference Sources
Sources
Item Selected
Administration
Riser Bar
27
2.2 Quality Knowledge Base and Reference Sources 2-31
This demonstration illustrates the verification of the course QKB, as well as the reference sources that can
be defined on the course image.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Administration riser bar.
Objectives
Explore data connections.
35
2-34 Chapter 2 DataFlux Data Management Studio: Getting Started
Custom Connection
36
Data connections enable you to access your data in Data Management Studio from many types of data
sources.
• ODBC Connection – Displays the Microsoft Windows ODBC Data Source Administrator dialog box,
which you can use to create ODBC connections.
• Federated Server Connection – Enables you to create a federated server data connection. A federated
connection enables you to establish a threaded connection to a database.
• Localized DSN Connection – Enables you to create a localized DSN connection definition for a
specific data source that is available via the federated server. This connection definition is used when
you access federated tables. These localized connections are federated server connections to a DBMS.
They are named and created as an extra connection to the same source in metadata.
• Custom Connection – Enables you to create a custom connection string for non-ODBC connection
types. These custom strings are useful for establishing native connections to third-party databases or
drawing data from more than one type of data input.
• SAS Data Set Connection – Enables you to create SAS data set connections.
2.3 Data Connections 2-35
Data Connections
Summary
Item Selected
Information for
Data Connections
37
Data Viewer
38
2-36 Chapter 2 DataFlux Data Management Studio: Getting Started
This demonstration illustrates how to define and work with a data connection, including using the data
viewer and generating queries.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Data riser bar.
j. Click Next.
1) Click With SQL Server authentication using a login ID and password entered by the
user.
2) Type sa in the Login ID field.
3) Type sas in the Password field.
k. Click Next. No changes are needed.
l. Click Next. No changes are needed.
m. Click Finish.
2-38 Chapter 2 DataFlux Data Management Studio: Getting Started
The new data source appears on the System DSN tab of the ODBC Data Source Administrator
window.
The dbo, INFORMATION_SCHEMA, and sys files are structures created and maintained by
SQL Server. It is desired to only show valid and usable data sources in DataFlux Data
Management Studio.
2.3 Data Connections 2-41
4) Click OK.
Only the two valid, usable data sources are displayed.
2-42 Chapter 2 DataFlux Data Management Studio: Getting Started
7. If necessary, save the connection information for the two dfConglomerate data connections.
a. Click the Data Connections item in the navigation pane.
The information area displays information for all data connections.
b. Right-click dfConglomerate Gifts and select Save User Credentials.
c. Right-click dfConglomerate Grocery and select Save User Credentials.
This action saves the data source connection information. If the data source connection
requires a username and password, you are prompted for that information. It is a
recommended best practice to save connection information.
a. In the Data Connections navigation area, click to expand the dfConglomerate Gifts data
connection.
2.3 Data Connections 2-43
2) Click to move the JOB TITLE field to the Selected Fields pane.
4) Click OK.
2.3 Data Connections 2-45
The data is now displayed in sorted order (by descending JOB TITLE).
d. Click OK.
The new query appears on a tab.
2.3 Data Connections 2-47
Details Pane
2-48 Chapter 2 DataFlux Data Management Studio: Getting Started
2) Click EMAIL.
The Select Field window displays a sample of the selected field’s values.
3) Click OK.
15. Click the Code tab on the Details pane to view the generated SQL code.
16. Select View Show Details Pane to toggle off the Details pane.
2.3 Data Connections 2-51
There are many types of available functions that can be used to customize the SQL Query further.
23. Select File Close to close the query.
2.3 Data Connections 2-53
Exercises
• View the data on the Data tab and perform a sort by the columns BRAND and NAME.
• Review the field attributes on the Fields tab.
QUESTION: How many fields are in the BREAKFAST_ITEMS table?
Answer:
Answer:
Answer:
• Create a graph using the Graph tab with the following specifications:
Chart type: Area
X axis: NAME
Y axis: SIZE
QUESTION: What NAME value has the highest value (or sum of values) of SIZE
for the sample defined by the default “row count range” of 30?
Answer:
2-54 Chapter 2 DataFlux Data Management Studio: Getting Started
g) Click Open.
3) Click Browse next to the Folder field in the File Storage area.
a) Navigate to S:\Workshop\dqdmp1\Exercises.
b) Click Make New Folder.
c) Type files as the name of the new folder.
d) Press ENTER.
e) Click OK.
4) Clear Private.
5) Click OK. A message window appears and states the repository does not exist.
6) Click Yes.
Information regarding the repository initialization appears in a window:
2.4 Solutions to Exercises 2-55
7) Click Close.
The repository is created and connected:
2. Updating the Set of Default Folders for the Basics Exercises Repository
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Click to expand the Basics Exercises repository.
6) Click OK.
2-58 Chapter 2 DataFlux Data Management Studio: Getting Started
The data is displayed in sorted order (by BRAND and then by NAME).
QUESTION: What NAME value has the highest value (or sum of values) of SIZE
for the sample defined by the default “row count range” of 30?
Demonstration: Profiling Data Using Filtered Data and an SQL Query .............................. 3-61
Objectives
Define a data collection.
4
4
3-4 Chapter 3 PLAN
Selected Collection
Two Collections
in Basics Demos
Repository
5
5
You can use data collections to group data fields from different tables and/or database connections.
These collections can be used as input sources for data explorations and profiles.
3.1 Creating Data Collections 3-5
This demonstration illustrates the steps necessary to create a data collection. Two collections are built:
one with address information and one with notes.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Data riser bar.
d. Click OK.
The new collection appears on a tab.
3-6 Chapter 3 PLAN
All the desired fields from the Customers table are selected.
h) Click Add.
i) Clear the selections for the Customers table.
5) Add fields from the Employees table.
Exercises
Objectives
Define and investigate data explorations.
14
3.1 Creating Data Collections 3-15
15
A data exploration reads data from databases and categorizes the fields in the selected tables into
categories. These categories are predefined in the Quality Knowledge Base (QKB). Data explorations
perform this categorization by matching column names. You also have the option of sampling the data in
the table to determine whether the data is one of the specific types of categories in the QKB.
For example, your customer metadata might be grouped into one catalog and your address metadata might
be grouped in another catalog. After you organize your metadata into manageable chunks, you can
identify relationships between the metadata by table-level profiling. Creating a data exploration enables
you to analyze tables in databases to locate potential matches and plan for the profiles that you need to
run.
After you identify possible matches, you can plan the best way to handle your data and create a profile job
for any database, table, or field. Thus, you can use a data exploration of your metadata to decide on the
most efficient and profitable way to profile your physical data.
3-16 Chapter 3 PLAN
16
1
6
The Data Exploration map is used to quickly review the relationships between all of the databases, tables,
and fields that are included in a Data Exploration project.
17
1
7
The Field Match report displays a list of the database fields that match a selected field.
3.1 Creating Data Collections 3-17
18
1
8
The Identification Analysis report lists the database fields that match categories in the definitions that you
selected for field name and sample data analysis.
19
1
9
The Table Match report displays a list of database tables that match a selected table or field.
3-18 Chapter 3 PLAN
This demonstration illustrates creating a data exploration, and then examining the resultant report.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. If necessary, click to expand the Basics Demos repository.
5. Right-click the profiles_and_explorations folder and select New Data Exploration.
a. Type Ch3D2_dfConglomerate_Data_Exploration in the Name field.
b. Click OK.
The new data exploration appears on a tab.
3.1 Creating Data Collections 3-19
d. Click Add.
e. Click Close to close the Add Tables window.
3-20 Chapter 3 PLAN
3) Click once in the Match Definition field. This reveals a selection tool.
4) Click in the Match Definition field and select Field Name.
3) Click once in the Identification Definition field. This reveals a selection tool.
4) Click in the Identification Definition field and select Field Name.
4) Click once in the Identification Definition field. This reveals a selection tool.
5) Click in the Identification Definition field and select Contact Info.
b. Locate the MANUFACTURERS table from the dfConglomerate Grocery data connection.
Objectives
Define and investigate data profiles.
28
29
2
9
3.3 Creating Data Profiles 3-33
Identify Tables
within
Data Connections Identify Columns
to Profile from
Selected Table
to Profile
3
0
30
31
The first four metric selections take additional processing time to calculate, so be judicious in
their selection.
3-34 Chapter 3 PLAN
32
33
Visualizations are customized charts that you create based on your data and the metrics that you apply.
34
35
Distributions can be calculated for the frequency counts of the actual values of a field. To view the
record(s) where a particular distribution occurs, double-click on the value. The list of distribution values
can also be filtered and visualized.
3.3 Creating Data Profiles 3-37
36
Distributions can be calculated for the pattern frequency counts (the pattern of the value in the field). To
view the record(s) where a particular pattern distribution occurs, double-click on the pattern distribution
value. The list of pattern distribution values can also be filtered and visualized.
For a pattern, the following rules apply:
• “A” represents an uppercase letter
• “a” represents a lowercase letter
• “9” represents a digit.
3-38 Chapter 3 PLAN
37
Percentiles provide a numeric layout of your data at a percentile interval that you specify. The percentile
interval is specified when you set your metrics.
38
The Outliers tab lists the top X minimum and maximum value outliers. The number of listed minimum
and maximum values are specified when you set your metrics.
3.3 Creating Data Profiles 3-39
This demonstration illustrates the basics of setting the properties for a data profile, running the data
profile, and then viewing various aspects of the data profile report.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. If necessary, click to expand the Basics Demos repository.
4. Right-click the profiles_and_explorations folder and select New Profile.
a. Type Ch3D3_dfConglomerate_Profile field.
b. Click OK.
The new profile appears on a tab.
3-40 Chapter 3 PLAN
Answer: Click to expand the data connection. Then click each of the desired tables.
3.3 Creating Data Profiles 3-41
The first four metric selections take additional processing time to calculate,
so be judicious in their selection.
c. Click OK to close the Metrics window.
8. Select File Save Profile to save the profile to this point.
3.3 Creating Data Profiles 3-43
Some metrics display (not applicable), which indicates that the metric calculation is not
applicable to that field type.
3-46 Chapter 3 PLAN
Almost 83% of the data has the pattern of three capital letters. It would be worthwhile to
investigate fixing the patterns of the nine observations that do not have the AAA pattern.
g. Select Insert New Note.
1) Type Check the patterns of this field. in the Add Note window.
If the desired visualization is not understandable as a bar chart, it can easily be changed by
choosing one of the other available types.
3-52 Chapter 3 PLAN
14. Investigate a different table that was also profiled in this job.
Similarly, the tabs for Pattern Frequency Distribution, Percentiles, and Outliers display
(Not calculated) because these options were also cleared in the metric override for this field.
15. Select File Close Profile.
3.3 Creating Data Profiles 3-53
Exercises
QUESTION: How many distinct values exist for the UOM field in the BREAKFAST_ITEMS
table?
Answer:
QUESTION: What are the distinct values for the UOM field in the BREAKFAST_ITEMS table?
Answer:
QUESTION: What is the ID field value for the records with PK as a value for the UOM field in the
BREAKFAST_ITEMS table?
Answer:
• Filter the list of distinct values to show only those with a frequency count more than 5.
Hint: Select View Report Filter.
• Add a note to the UOM field. Type Standardize the UOM field. After it is added, verify that the
note appears in the UOM field.
• View the pattern frequency distribution for the PHONE field in the MANUFACTURERS table.
View the records that match the pattern (999)999-9999. Add a note saying to standardize this
PHONE field.
• Create a line graph to visually explore the comparison between the minimum length and maximum
length metrics. Compare these metrics for all fields in the MANUFACTURERS table.
3-54 Chapter 3 PLAN
QUESTION: How many distinct values exist for the STATE field in the Contacts table?
Answer:
QUESTION: How many distinct patterns exist for the STATE field in the Contacts table?
Answer:
QUESTION: What does a bar chart graphic (visualization) tell you about the comparison of the data
length and maximum length metrics for all fields in the Contacts table? (Do not
include the DATE, DELETE_FLG, ID, and MATCH_CD fields in the graphic
view.)
Answer:
QUESTION: What is the ID field value for the record(s) with a “bad” PHONE field pattern in the
Contacts table? Add a field level note saying to fix this value.
Answer:
• Add a table level note saying to check for patterns for the STATE and PHONE fields in the
Contacts table.
3.3 Creating Data Profiles 3-55
Text Files
Filtered Data
Collections
46
A sample interval can be specified for any input type. By default the sample interval is 1. However,
another interval can be specified by selecting Actions Change Sample Interval.
3-56 Chapter 3 PLAN
This demonstration illustrates the steps necessary to use a text file as an input data source for a data
profile.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. Click the Basics Demos repository.
4. Click Profile.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of
the Save in field.)
b. Type Ch3D4_TextFile_Profile in the Name field.
c. Click OK.
The new profile appears on a tab.
e. Click in the Text qualifier field and select " (double quotation mark).
3) Click OK.
The Fields area is populated with information about the fields in the text file.
The final settings for the Delimited File Information window should resemble the following:
The field information is returned to the profile. The fields here can be “processed” just as you did for
a table source.
8. If necessary, select all fields by clicking (check box) in front of Gift Tradeshow Information.
This demonstration illustrates the steps necessary to access and filter data as input for a data profile.
In addition, input for a data profile is an SQL query.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
3. Click the Basics Demos repository.
4. Click Profile.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of
the Save in field.)
b. Type Ch3D5_FilterAndSQLQuery_Profile in the Name field.
c. Click OK.
The new profile appears on a tab.
5. Set up the SQL query.
a. On the Properties tab, click in front of the dfConglomerate Gifts data connection.
b. Click the Products table object.
c. Select Insert New SQL Query.
d. Type Miscellaneous Products _SQL_Query_ in the SQL Query name field.
2) Click Test.
6. Select all fields for the SQL query results by clicking (check box) in front of Miscellaneous
Products _SQL_Query_.
a. On the Properties tab, if necessary, click in front of the dfConglomerate Gifts data
connection.
b. Click the Products table object.
c. Select Insert New Filtered Table.
d. Type Miscellaneous Products _Filter_ in the Filter name field.
IMPORTANT:
The data generated for both the SQL query and filter have the same results.
The filter pulled all records.
The filter was processed on the machine that the profile was run on.
The database does the filtering for an SQL query.
This demonstration illustrates the steps necessary to access data from a collection as input for a data
profile.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Data riser bar.
6. Click (Profile tool) from the toolbar at the top of the Data navigation pane.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
b. Type Ch3D6_Collection_Profile_DESCRIPTIONS in the Name field.
c. Click OK.
The new profile appears on a tab.
7. Verify that the “description” fields are selected from various data connections.
a. Verify that four tables are selected.
b. Click the Customers table and verify that the NOTES field is selected.
c. Click the Employees table and verify that the NOTES field is selected.
3.3 Creating Data Profiles 3-69
d. Click the Orders table and verify that the NOTES field is selected.
e. Click the MANUFACTURERS table and verify that the NOTES field is selected.
8. Select all metrics.
a. Select Tools Default Profile Metrics.
b. Verify that all metrics are selected.
c. Click OK to close the Metrics window.
9. Select File Save Profile to save the profile.
10. Select Actions Run Profile Report.
a. Type Profiling a Collection of NOTES fields in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
11. Review the profile report.
c. Click the Customers table and verify that the NOTES field was the only analyzed field.
d. Click the Employees table and verify that the NOTES field was the only analyzed field.
e. Click the Orders table and verify that the NOTES field was the only analyzed field.
f. Click the MANUFACTURERS table and verify that the NOTES field was the only analyzed
field.
12. Continue to investigate the profile report (if desired).
13. Select File Close Profile.
3-70 Chapter 3 PLAN
Exercises
Find more information about the data contained in the Manufacturer_Contact_List.txt file.
Some specifics follow:
• Use the Basics Exercises repository to store the profile. (Add to the profile_and_explorations
folder.) Provide a name of Ch3E5_Manufacturer_TextFile_Profile.
• File attributes:
Table name: Manufacturer_Contacts
File Type: Delimited
Filename and location: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.txt
Text qualifier: Double quotation mark
Field delimiter: Comma
Number of rows to skip: 1
Fields: Use the Suggest feature.
(The first row contains field names.
Guess field types and lengths from file content.)
• Use the Export feature (to export the file layout) for this text file. Name the exported file
S:\Workshop\dqdmp1\data\Output Files\Manufacturers_Contact_List.dfl.
• Select all fields for profiling.
• Select all profiling metrics.
• Save and run the profile.
3.3 Creating Data Profiles 3-71
QUESTION: What can be said about the comparison of the Pattern Count and Unique Count
metrics for all fields? (Hint: Create a bar chart to display the comparison.)
Answer:
3-72 Chapter 3 PLAN
58
Redundant Data analysis enables you to quickly review whether there is redundant data for selected fields
and tables.
59
The Primary Key/Foreign Key analysis can be used to determine the number of common values in two
different tables to determine whether it is possible to set a primary key/foreign key relationship and thus
create a parent and child table. This analysis is useful in maintaining the referential integrity between
tables.
3.3 Creating Data Profiles 3-73
This demonstration investigates how to setup and review a redundant data analysis as well as a primary
key / foreign key analysis.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Click the Folders riser bar.
The view contains two portions: Grid Data Table and the Data Venn Diagram.
The Grid Data Table displays the relationships between the fields that were selected for analysis.
You can click on a cell to see the relationship displayed in the Venn diagram. You can also click
the Grid data radio buttons (Left outliers, Common, Right outliers) to control which of the
relationships are displayed in the table.
The colored dots in the table cells indicate the percentage of matching rows between the two
selected rows:
• Green reflects a low level of redundancy (under 20% by default).
• Yellow reflects a medium level of redundancy (between 20% and 50% by default).
• Red reflects a high level of redundancy (more than 50% by default).
These levels can be changed in the Options dialog box on the Report tab.
3.3 Creating Data Profiles 3-81
c. Click in the cell where the BUSINESS PHONE from Customers intersects with the BUSINESS
PHONE from Orders.
3-82 Chapter 3 PLAN
This value for the BUSINESS PHONE was found four times in the Customers table, and once
in the Employees table.
e. Click Close.
17. Select File Close Profile.
3.4 Designing Data Standardization Schemes 3-83
Objectives
Define and create standardization schemes.
67
68
The analysis of an individual field can be counted as a whole (phrase) or based on each one of the field’s
elements. For example, the field value DataFlux Corporation is treated as two permutations if the
analysis is set as Element, but it would be treated only as one permutation if the analysis is set as Phrase.
The above analysis is a phrase analysis of the Company Name field. Similar company names are
grouped together based on the match definition and sensitivity selected for the analysis.
3-84 Chapter 3 PLAN
69
A standardization scheme can be built based on the analysis report. When a scheme is applied, if the input
data is equal to the value in the Data column, then the data is changed to the value in the Standard
column. The standard of DataFlux was selected by the Scheme Build function because it was the
permutation with the most occurrences in the analysis report.
The following special values can be used in the Standard column:
//Remove The matched word or phrase is removed from the input string.
% The matched word or phrase is not updated. (This is used to show that a word or phrase is
explicitly marked for no change.)
3.4 Designing Data Standardization Schemes 3-85
The Scheme Builder window appears with the Report side populated with an alphabetical list of
the values found in the Frequency Distribution report. The values are grouped with similar names.
The group is determined by the selected match definition and sensitivity.
By default, all values are “moved” to the Scheme side of the Scheme Builder window. The multi-
item groups are given a standard value, which is the most frequently occurring value with the
grouping on the report side.
e. Change the standard value of dfConglomerate Incorporated.
1) Scroll on the Scheme side and locate the group of records with the standard value of
dfConglomerated Incorporated.
2) Right-click on one of the dfConglomerate Incorporated standard values and select Edit.
a) Type dfConglomerate Inc. in the Standard field.
b) Click OK.
Notice that the change applies to all items in the group.
2) Right-click on one of the Eta Technologies standard values and select Edit.
a) Type ETA Computers in the Standard field.
b) Click OK.
Notice that the change applies to all items in the group.
This removes the value from the field where this scheme is applied.
b) Click OK.
j. Make the scheme case insensitive (by default).
1) Click Options.
2) Clear the Case sensitive check box.
3) Click OK.
12. Save the scheme to the default QKB.
a. Select File Save.
b. Type Ch3D8 Company Phrase Scheme in the Name field.
c. Click Save.
13. Select File Exit to close the Scheme Builder window.
14. If necessary, click the Report tab.
c. Locate the group of records that begin with Arrowhead Mills, Inc.
d. Click the first of these records.
e. Press and hold the SHIFT key and click the last of these records.
These five values are added to the scheme. Notice that the values are no longer highlighted in red.
23. Update the existing scheme for a set of values.
a. A\ the bottom of the Report side, locate the Standard field.
b. Type Breadshop Natural Foods in the Standard field.
c. Locate the group of records that begin with Breadshop Natural Foods.
d. Click the first of these records.
e. Press and hold the SHIFT key and click the last of these records.
i. Add Dr, Dr., Drive, and Drive, to the scheme with a standard of Drive.
1) Type Drive in the Standard field.
2) In the report, click the Dr value. Then hold down the CTRL key, and click the Dr., Drive, and
Drive, values.
3) Click Add To Scheme.
9. Investigate and set a standard for the Avenue values.
a. Select Report Find in Report.
b. Type Avenue in the Find what field.
c. Click the Match whole word only check box.
d. Click Up.
e. Click OK.
3.4 Designing Data Standardization Schemes 3-97
Exercises
f) Click Add.
g) Clear the selections for the Employees table.
i. Add phone number fields from the dfConglomerate Grocery data connection.
QUESTION: How many phone number fields were added to the collection?
Answer: Ten (10)
3. Profiling the Tables in dfConglomerate Grocery
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
b. Click the Folders riser bar.
c. Click the Basics Exercises repository.
3.5 Solutions to Exercises 3-103
d. Click Profile.
1) Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
2) Type Ch3E3_dfConglomerateGrocery_Profile in the Name field.
3) Click OK.
e. Verify that the Properties tab is selected.
f. Click the dfConglomerate Grocery check box. An X appears in the check box.
g. Define profile options.
1) Select Tools Default Profile Metrics.
2) Verify that all metrics are selected.
3) Click OK to close the Metrics window.
h. Select File Save Profile to save the profile to this point.
i. Override metrics for two ID columns.
Answer: Click to expand the BREAKFAST_ITEMS table. Click the UOM field.
The Column Profiling tab shows the Unique Count metric to have a value
of 6 (six).
QUESTION: What are the distinct values for the UOM field in the BREAKFAST_ITEMS
table?
5) Click Close to close the Pattern Frequency Distribution Drill Through window.
3-106 Chapter 3 PLAN
4) Type Comparing Min and Max Lengths for all fields in the Description field.
5) Select Line as the value for the Chart type field.
6) Verify that the data type is set to Field metrics.
7) Click Select/unselect all under the Fields area. (This selects all fields.)
8) Click Minimum Length and Maximum Length as the choices under the Metrics area.
9) Click OK to close the Chart Properties window.
3.5 Solutions to Exercises 3-107
1) Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
2) Type Ch3E4_DataFluxSample_Profile in the Name field.
3) Click OK.
e. Verify that the Properties tab is selected.
QUESTION: What does a bar chart graphic (visualization) tell you about the comparison of the
data length and maximum length metrics for all fields? (Do not include the fields
DATE, DELETE_FLG, ID, and MATCH_CD in the graphic view.)
Answer:
(i) Click the Contacts table.
(ii) Click the Visualizations tab.
(iii) Click to the right of the Chart field.
(iv) Type Comparing Data and Max Lengths for all fields in the Description field.
(v) Select Bar as the value for the Chart type field.
(vi) Verify that the data type is set to Field metrics.
(vii) Click Select/unselect all in the Fields area. (This selects all fields.)
(viii) Clear the selections for DELETE_FLG, ID, and MATCH_CD fields.
(ix) Click Data Length and Maximum Length as the choices in the Metrics area.
(x) Click OK to close the Chart Properties window.
The Visualizations tab now displays the following:
One field (ADDRESS) has a defined length much longer than the maximum
length. Several additional fields have data lengths longer than maximum
length used.
3.5 Solutions to Exercises 3-111
QUESTION: What is the ID field value for the record(s) with “bad” PHONE field pattern in the
CONTACTS table?
Answer:
1) Double-click the profiles_and_explorations folder. (This action makes this folder the value
of the Save in field.)
2) Type Ch3E5_Manufacturer_TextFile_Profile in the Name field.
3) Click OK.
3-112 Chapter 3 PLAN
QUESTION: What can be said about the comparison of the Pattern Count and Unique
Count metrics for all fields?
1) Click the text file Manufacturer_Contacts.
2) Click the Visualizations tab.
3) Click to the right of the Chart field.
a) Type Comparing Pattern and Unique Count metrics for all fields in the Description
field.
b) Select Bar as the value for the Chart type field.
c) Verify that the Data type is set to Field metrics.
d) Click Select/unselect all in the Fields area.
e) Click Pattern Count and Unique Count in the Metrics area.
f) Click OK.
j. Click Close.
k. From the Tables riser bar, click the MANUFACTURERS table.
l. Right-click the CONTACT_CNTRY field and select Build a Scheme.
m. Accept the defaults in the Report Generation window and click OK.
n. Double-click the value USA.
o. Click all three values in the report.
p. Click Add To Scheme.
2) Click in the Text qualifier field and select " (double quotation mark).
Objectives
Discuss and explore the basic DataFlux Data
Management Studio options for affecting data jobs.
Create and execute a simple data job.
4
4
4-4 Chapter 4 ACT
5
5
This dialog box enables you to set global options for data jobs as well as other areas of DataFlux Data
Management Studio.
4.1 Introduction to Data Jobs 4-5
In this demonstration, you investigate and set various DataFlux Data Management Studio options.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Select Tools Data Management Studio Options.
The Data Management Studio Options window appears.
4-6 Chapter 4 ACT
10. If necessary, click in the Default Locale field and select English (United States).
11. Click OK to save the change and close the Data Management Studio Options window.
4-8 Chapter 4 ACT
12
1
2
Use the nodes in the Data Inputs section to specify different types of input to a data job. Use the nodes
in the Data Outputs section to specify different types of output from a data job. Data jobs can have more
than one input node and output node.
13
1
3
Run Tool
14
1
4
The Run tool submits the job contained in the Data Flow Editor for processing and creates the output(s)
specified in the job.
The Preview icon initiates a preview run and displays the results at the point of the selected node.
A preview of a Data Output node does not show field name changes or deletions. This enables
the flexibility to continue your data flow after a Data Output node. In addition, previewing
a Data Output node does not create the output. You must run the data job to create the output.
15
1
5
The Log tab displays the status of the last job run.
4-10 Chapter 4 ACT
In this demonstration, you create a data job that reads records from the Products table (in the
dfConglomerate Gifts data source). The data job filters the data, and then writes the filtered records
to a text file.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D2_Products_Misc in the Name field.
c. Click OK.
The new data job appears on a tab.
Home tab
4.1 Introduction to Data Jobs 4-11
You can access the properties window for any node by right-clicking the node
and selecting Properties.
d. Type Products Table in the Name field.
All the fields from the table are automatically selected because the Output Fields option
was set to All in the Data Management Studio Options window.
Fields can be deselected or selected for output from the node by using the right and left
arrow keys:
The output names for the fields can be renamed by double-clicking the field name
and changing the name.
The order of the output fields from the node can be controlled by highlighting field(s)
and using the up and down arrow keys.
f. Move the CATEGORY field so that it follows the PRODUCT NAME field.
1) Scroll in the Selected list box of Output fields and locate the CATEGORY field.
2) Click the CATEGORY field.
3) Click seven (7) times so that it appears after PRODUCT NAME.
g. Click OK to save the changes and close the Data Source Properties window.
4.1 Introduction to Data Jobs 4-13
The Data Source node appears in the job diagram and the updated information is displayed.
7. Edit the Basic Settings for the Data Source node, and then Preview the data.
a. If necessary, click View Show Details Pane.
b. If necessary, click the Basic Settings tab.
c. Type Data Source as the value for the Description field.
4-14 Chapter 4 ACT
You must explicitly select Preview because you turned off the Auto Preview option.
4.1 Introduction to Data Jobs 4-15
a. In the Nodes resource pane, click to collapse the Data Inputs grouping of nodes.
f. In the Audit action area, verify that Fails is selected for Row validation and
Remove row from output is selected for as the action.
4.1 Introduction to Data Jobs 4-17
The final settings for the Data Validation node should resemble the following:
10. Establish an appropriate description and preview the Data Validation node.
a. If necessary, click the Data Validation node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Subset for Misc as the value for the Description field.
e. Click the Data Validation node in the data flow diagram.
f. Click the Preview tool ( ).
A sample of records appears on the Preview tab of the Details panel.
g. Verify that the CATEGORY field values are all Miscellaneous.
4.1 Introduction to Data Jobs 4-19
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D2_Products_Misc.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
4) Click Display file after job runs.
d. Specify only the desired fields of PRODUCT CODE and PRODUCT NAME.
g. Click OK to save the changes and close the Text File Output Properties window.
4-22 Chapter 4 ACT
13. Establish an appropriate description and preview the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output to Text File as the value for the Description field.
e. Verify the note entered for the node appears.
f. Click the Text File Output node in the data flow diagram.
14. Verify that the added note is reflected on the Text File Output node.
a. Locate the Notes icon in the Text File Output node.
4.1 Introduction to Data Jobs 4-23
2) Click next to the Output table field. The Select Table window appears.
d) Type Products_Misc in the Enter a name for the new table field.
e) Click OK to close the New Table window.
f) Click OK to close the Select Table window.
4-24 Chapter 4 ACT
To undo the grouping, you can right-click the grouping element and select
Ungroup Selected Groups.
4-26 Chapter 4 ACT
The tool found on the Data Flow toolbar can be used as an alternative to right-
clicking for some actions.
4-30 Chapter 4 ACT
c. Right-click in the background of the data flow (not on a particular node) and select Print
Print Preview.
The Print Preview window appears.
d. Scroll to the last page and verify that the node-specific note appears.
Recall the option that was set to enable the printing of the node-specific notes.
f. Right-click in the background of the data flow (not on a particular node) and select
Clear Run Results Clear All Run Results.
g. Right-click in the background of the data flow (not on a particular node) and select Save
Diagram As Image Entire Diagram.
h. Navigate to S:\Workshop\dqdmp1\demos\files\output_files and click Save. (Accept the default
name of the .png file.)
i. View the newly created .png file.
1) Access a Windows Explorer window.
2) Navigate to S:\Workshop\dqdmp1\demos\files\output_files.
3) Double-click the.png file.
The node-specific notes are not “printed” when the diagram is saved as an image.
Exercises
QUESTION: How many records were selected in the filter and therefore written to the text file?
Answer:
QUESTION: How many records were read from the source table?
Answer:
4-34 Chapter 4 ACT
Objectives
Explore applying standardization schemes versus
standardization definitions.
Explore parsing.
Explore casing.
Explore identification analysis versus right-fielding.
24
Standardization Scheme
A scheme is a simple find-and-replace action based on
the information in the scheme file.
Data Standard
Street St
Partial Scheme St. St
ST. St
Rd. Rd
Road Rd
RD. Rd
25
4.2 Data Quality Jobs 4-35
Standardization Definition
A standardization definition
is more complex than a standardization scheme
26
If you standardize a data value using both a definition and a scheme, the definition is applied first
and then the scheme is applied.
Data standardization does not perform a validation of the data (for example, Address
Verification). Address verification is a separate component of the DataFlux Data Management
Studio application. (It is discussed in another chapter.)
4-36 Chapter 4 ACT
Parsing Example
Parsed Information
Street Number 123
Pre-direction N
Address Information Street Name MAIN
123 N MAIN ST APT 201 Street Type ST
Post-direction
Address Extension APT
Address Extension Number 201
27
Parsing separates multi-part field values into multiple, single-part fields. For example, if you have a
Name field that includes the value Mr. John Q. Smith III, Esq., you can use parsing to create six separate
fields:
Name Prefix: Mr.
Middle Name: Q.
Casing Example
Input Data
Dataflux corporation
28
For best results, if available, select an applicable definition when you use Proper casing.
29
Bob Brown & Sons Bob Brown & Sons
Identification analysis and right fielding use the same listing of definitions from the QKB, but in different
ways. Identification analysis identifies the type of data in a field and right fielding moves the data into
fields based on its identification.
4-38 Chapter 4 ACT
In this demonstration, you create a data job that standardizes selected fields from the Customers table.
You parse and case the e-mail information and write the results to a text file.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D3_Customers_DataQuality in the Name field.
c. Click OK.
The new data job appears on a tab.
The final settings for the Data Source Properties should resemble the following:
f. Click OK to save changes and close the Data Source Properties window.
7. Establish an appropriate description and preview the Data Source node.
a. If necessary, click the Data Source node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Data Source as the value for the Description field.
e. Click the Data Source node in the data flow diagram.
f. Click the Preview tool ( ).
A sample of records appears on the Preview tab of the Details panel.
4-40 Chapter 4 ACT
The standardization for a field can use a standardization definition and/or a scheme. If
both are specified, the definition is applied first. Then the scheme is applied on the results
from the definition.
Selecting the Preserve null values option ensures that if a field is null when it enters the
node, then the field is null after the node. Selecting this option if the output will be
written to a database table is recommended.
n. Click OK to close the Standardization Properties window.
4.2 Data Quality Jobs 4-43
g. Scroll to the right to view the new cased e-mail token fields.
4.2 Data Quality Jobs 4-47
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D3_CustomerInfo.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click the Include header row check box.
4) Click the Display file after job runs check box.
18. Establish an appropriate description and preview the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output to Text File as the value for the Description field.
19. Select File Save.
20. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions Run Data Job.
c. Verify that the text file appears.
In this demonstration, you examine the difference between identification analysis and right fielding.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
5. Click Data Job.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D4_RightFielding_IDAnalysis in the Name field.
c. Click OK.
The new data job appears on a tab.
6. Add the Text File Input node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the resource pane.
b. If necessary, click to collapse the Data Outputs grouping of nodes.
h. Click OK to save the changes and close the Text File Input Properties window.
The Text File Input node appears in the job diagram with updated information.
24. Establish an appropriate description and preview the Text File Input node.
a. If necessary, click the Text File Input node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Text File Input as the value for the Description field.
e. Click the Text File Input node in the data flow diagram.
f. Click the Preview tool ( ).
A sample of records appears on the Preview tab of the Details panel.
The Contact field has a mixture of individual names as well as corporate names. Right-fielding
and Identification Analysis both used identification definitions to help correctly identify the
information found in a field.
4.2 Data Quality Jobs 4-53
c. Under the Input fields area, double-click the Contact field to move it from the Available list box
to the Selected list box.
d. Under the Output fields area, add three new fields.
1) Click Add.
The final settings for the Right Fielding node should resemble the following:
4.2 Data Quality Jobs 4-55
The Contact field is moved to the end of the list so that it can be more easily compared to
the results of the Right Fielding node.
3) Click OK to close the Additional Outputs window.
g. Click OK to close the Right Fielding Properties window.
4-56 Chapter 4 ACT
10. Establish an appropriate description and preview the Right Fielding node.
a. If necessary, click the Right Fielding node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Right Fielding as the value for the Description field.
e. Click the Right Fielding node in the data flow diagram.
f. Click the Preview tool ( ).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the new column information.
The Right Fielding node correctly identified data values from the Contact field as either
Company (Organization) values, Person (Individual) values, or Unknown values.
11. Add an Identification Analysis node to the job flow.
a. Verify that the Quality grouping of nodes is expanded.
b. Double-click the Identification Analysis node.
The node appears in the data flow. The Identification Analysis Properties window appears.
4.2 Data Quality Jobs 4-57
d. Specify the correct identification definition for the selected Contact field.
2) Select Individual/Organization.
3) Verify that Contact_Identity is the value for the Output Name field.
e. Click Preserve null values.
The final settings for the Identification Analysis node should resemble the following:
13. Establish an appropriate description and preview the Identification Analysis node.
a. If necessary, click the Identification Analysis node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Identification Analysis as the value for the Description field.
e. Click the Identification Analysis node in the data flow diagram.
f. Click the Preview tool ( ).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the new column information.
The Identification Analysis node also correctly identified data values from the Contact field as
either Organization or Individual values. Right fielding moves the data to an appropriate field, but
identification analysis creates a field with values to indicate whether each observation is an
INDIVIDUAL or ORGANIZATION value.
14. Add a Text File Output node to the job flow.
a. In the Nodes resource pane, click to collapse the Quality grouping of nodes.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D4_Prospects_RF_ID.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click the Include header row check box.
4) Click the Display file after job runs check box.
Exercises
• Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data
source. Select the following fields.
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
• Standardize the following fields. Accept the default names for the standardized fields.
Field Name Definition Scheme
CONTACT Name
CONTACT_ADDRESS Address
• Parse the standardized CONTACT field using the Name parse definition. Be sure to preserve null
values. Select only three tokens and rename the output fields according to the following:
Token Name Output Name
• Output the standardized, parsed data to a new table named MANUFACTURERS_STND (in the
dfConglomerate Grocery data connection). If the data job runs multiple times, ensure that the
records for each run are the only records in the table. In addition, only output the following fields
with output names:
Field Name Output Name
ID ID
MANUFACTURER_Stnd MANUFACTURER
FIRST_NAME FIRST_NAME
MIDDLE_NAME MIDDLE_NAME
LAST_NAME LAST_NAME
CONTACT_ADDRESS_Stnd CONTACT_ADDRESS
CONTACT_CITY CONTACT_CITY
CONTACT_STATE_PROV_Stnd CONTACT_STATE_PROV
CONTACT_POSTAL_CD CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd CONTACT_CNTRY
CONTACT_PHONE CONTACT_PHONE
POSTDATE POSTDATE
• Verify that the records were written to the MANUFACTURERS_STND table in the
dfConglomerate Grocery data connection.
4.2 Data Quality Jobs 4-63
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D5_StandardizationDefinitionAndScheme in the Name field.
c. Click OK.
The new data job appears on a tab.
4-64 Chapter 4 ACT
6. Add the Text File Input node to the Data Flow Editor.
a. Verify that the Nodes riser bar is selected in the Resource pane.
b. If necessary, click to collapse the Data Outputs grouping of nodes.
c. Click in front of the Data Inputs grouping of nodes.
d. Double-click the Text File Input node.
An instance of the node appears in the data flow and the properties window for the node appears.
7. Specify the properties of the Text File Input node.
a. Type Prospective Customers in the Name field.
b. Click next to the Input file field.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D5_Customer_StdState.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
4) Click Display file after job runs.
d. Accept all fields as selected fields for the output file.
e. Export the field layout.
1) Click Export.
2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D5_Customer_StdState.dfl in the File name field.
4) Click Save.
f. Click OK to save the changes and close the Text File Output Properties window.
14. Edit the Basic Settings for the Text File Output node.
a. If necessary, click View Show Details Pane.
b. If necessary, click the Basic Settings tab.
4.2 Data Quality Jobs 4-67
c. Type Text File Output as the value for the Description field.
The data job should resemble the following:
3. Click Profile.
a. Double-click the profiles_and_explorations folder. (This action makes this folder the value of
the Save in field.)
b. Type Ch4D5_TextFile_Profile in the Name field.
c. Click OK.
The new profile appears on a tab.
1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
2) Click the Ch4D5_Customer_StdState.txt file.
3) Click Open.
The path and filename are returned to the Input file name field.
b. Click Import below the Fields area.
1) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
2) Click the Ch4D5_Customer_StdState.dfl file.
3) Click Open.
c. Click Number of rows to skip and verify that the default value is set to 1 (one).
d. Click OK to close the Delimited File Information window.
The fields area is populated with information about the fields in the text file.
7. Click the State check box.
8. Click the State_Stnd check box.
9. Select all metrics.
a. Select Tools Default Profile Metrics.
b. Verify that all metrics are selected.
c. Click OK to close the Metrics window.
10. Select File Save Profile to save the profile.
11. Select Actions Run Profile Report.
a. Type Profiling State fields in the Description field.
b. Click OK to close the Run Profile window.
The profile executes. The status of the execution is displayed.
The Report tab becomes active.
12. Review the Profile report.
Objectives
Explore address verification and geocoding.
46
Address Verification
Verified Address Information
Street Address 940 NW CARY PKWY
City CARY
Original Address
State NC
940 Cary Parkway
Zip 27513
27513
Zip+4 2792
County Name WAKE
Congressional District 4
47
Address verification identifies, corrects, and enhances address information. Data files can be licensed to
verify address information for U.S., Canada, and international addresses.
4-76 Chapter 4 ACT
Geocoding
Original Address Verified Geocode Information
940 NW CARY PKWY Latitude 35.811753
CARY , NC 27513-2792 Longitude -78.802326
48
Geocoding latitude and longitude information can be used to map locations and plan efficient delivery
routes. Geocoding can be licensed to return this information for the centroid of the postal code or at the
roof-top level.
Currently, there are only geocoding data files for the United States and Canada. Also, currently,
roof-top level geocoding is available only for the United States.
4.3 Data Enrichment Jobs (Self-Study) 4-77
In this demonstration, you use an Address Verification node to verify the addresses present in the
Customers table from dfConglomerate Gifts. The resultant verified ZIP field is used in a Geocoding
node to produce additional desired geocode columns. This final result set is written to a text file.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D6_AddressVerification_Example in the Name field.
c. Click OK.
The new data job appears on a tab.
The node appears in the data flow. The Address Verification (US/Canada) Properties window
appears.
9. Specify properties for the Address Verification (US/Canada) node.
a. Type Address Verification in the Name field.
b. Verify that United States is selected in the Address area.
4-80 Chapter 4 ACT
1) Click under Field Type for the ADDRESS field and click Address Line 1.
2) Click under Field Type for the CITY field and click City.
3) Click under Field Type for the STATE/PROVINCE field and click State.
4) Click under Field Type for the ZIP/POSTAL CODE field and click Zip.
2) Click the EMAIL field in the Output fields area, and then click .
3) Click the JOB TITLE field in the Output fields area, and then click .
4) Click the BUSINESS PHONE field in the Output fields area, and then click .
4-82 Chapter 4 ACT
5) Click the HOME PHONE field in the Output fields area, and then click .
6) Click the MOBILE PHONE field in the Output fields area, and then click .
7) Click the FAX NUMBER field in the Output fields area, and then click .
8) Click the COUNTRY/REGION field in the Output fields area, and then click .
9) Click the NOTES field in the Output fields area, and then click .
The final set of additional output fields should resemble the following:
Text Numeric
Result Result
Code Code Description
CITY 12 Could not locate city, state, or ZIP code in the USPS database. At least
city and state or ZIP code must be present in the input.
MULTI 13 Ambiguous address. There are two or more possible matches for this
address with different data.
OVER 15 One or more input strings is too long (maximum 100 characters).
4-84 Chapter 4 ACT
c. In the Output fields area, click to move all available fields to the Selected list box.
d. Click OK to save the settings and close the Geocoding Properties window.
4.3 Data Enrichment Jobs (Self-Study) 4-85
Geocode_Result_Code The result code indicates whether the record was successfully
geocoded. Other possible codes are as follows:
• DP - The match is based on the delivery point.
• PLUS4 - The match failed on the delivery point, so the match is
based on ZIP+4.
• ZIP - The ZIP+4 match failed, so the match is based on the ZIP
code.
• NOMATCH - The first three checks failed, so there is no match in
the geocoding database.
Geocode_Latitude This is the numerical horizontal map reference for address data.
Geocode_Census_Tract This is a U.S. Census Bureau reference number assigned using the
centroid latitude and longitude. This number contains references to the
State and County codes.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D6_Customer_Verify_Geocode.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
4) Click Display file after job runs.
d. Accept all fields as selected fields for the output file.
e. Export the field layout.
1) Click Export.
2) If necessary, navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D6_Customer_Verify_Geocode.dfl in the File name field.
4) Click Save.
f. Click OK to save the changes and close the Text File Output Properties window.
16. Establish an appropriate description for the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Output Text File as the value for the Description field.
4.3 Data Enrichment Jobs (Self-Study) 4-87
Exercises
• Use the MANUFACTURERS table from dfConglomerate Grocery as the data source.
• Add an Address Verification (US/Canada) node to the job flow. Use the following specifications:
− Map the following input fields:
Field Name Field Type
MANUFACTURER Firm
CITY City
STATE_PROV State
POSTAL_CD Zip
State State_Verified
• Retain only the following list of original fields for additional outputs:
ID STATE/PROV
MANUFACTURER POSTAL_CD
STREET_ADDR COUNTRY
CITY PHONE
• Add a Text File Output node to the job flow. Use the following specifications:
Output File:
S:\Workshop\dqdmp1\Exercises\files\OutputFiles\Ch4E3_Manufacturers_Verify.txt.
Text qualifier: “ (double quotation mark)
Field delimiter: Comma
Include header row
Display the file after job runs.
Specify the following fields for the text file with the specified output name:
ID ID
Firm_Verified Manufacturer
Address_Verified Address
City_Verified City
State_Verified State
ZIP_Verified ZIP
US_County_Name US_County_Name
US_Result_Code US_Result_Code
• Save and run the job. Verify that the text file is created properly.
4-90 Chapter 4 ACT
Objectives
Define match codes.
Describe cluster conditions.
53
4.4 Entity Resolution Jobs 4-91
Match Codes
Name Match Code @ 85 Sensitivity
John Q Smith 4B~2$$$$$$$C@P$$$$$$
Johnny Smith 4B~2$$$$$$$C@P$$$$$$
Jonathon Smythe 4B~2$$$$$$$C@P$$$$$$
DataFlux uses the strengths of probabilistic and deterministic matching. DataFlux provides dichotomous
(true/false) results. It also takes advantage of the flexibility in scoring of the probabilistic matching
technique. With a single pass through the data, the DataFlux matching engine produces an unambiguous
match code that represents the identifying variables that were specified by the user. The key to the
DataFlux technology is this match code.
• Prior to match code generation, the matching engine parses the data into its components and implicitly
removes all ambiguities in the system. Using industry-leading data quality algorithms, it removes non-
essential matching information and standardizes the data.
• The matching engine enables near matching by enabling you to specify a match threshold (sensitivity)
for each identifying variable. Match codes are generated in direct correlation to the desired sensitivity.
If the sensitivity is higher, the matching requirements are more stringent and the match code is more
specific. This enables you to specify high matching requirements for some variables and lower
requirements for others.
• The matching engine enables the specification of configurable business rules that impact the match.
The rules are implicitly applied before generating the match code. This enables you to make specific
corrections in the data before performing any linkage. Changing the matching logic requires the use
of the Data Management Studio – Customize module.
• Regardless of the data source, the matching engine creates the same match code as long as the same
match definition and sensitivity setting are selected when the match code is generated. This enables
matching across multiple data sources without constantly reprocessing and re-indexing the data, which
is something that probabilistic matching systems cannot do.
4-92 Chapter 4 ACT
The Clustering node provides the ability to match records based on multiple conditions. Create the
conditions that support your business needs.
4.4 Entity Resolution Jobs 4-93
In this demonstration, you will generate clusters of data based on three conditions. The three conditions
will take advantage of match codes generated on various fields, as well as standardized field information.
A match report will display the results of the clustering.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Click the Basics Demos repository.
a. Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
b. Type Ch4D7_ClusterRecords_Example in the Name field.
c. Click OK.
The new data job appears on a tab.
f. Click OK to save the changes and close the Data Source Properties window.
7. Establish an appropriate description for the Data Source node.
a. If necessary, click the Data Source node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Data Source as the value for the Description field.
8. Add a Match Codes (Parsed) node to the job flow.
a. In the Nodes resource pane, click to collapse the Data Inputs grouping of nodes.
The (Parsed) nodes are used when the input data is parsed. Because the Name Match
definition is designed for a full name, if you only generate a match code on
FIRST_NAME, in most cases the definition would assume this was a last name. Thus,
the matching of nicknames would not happen. For example, Jon would not match John.
4.4 Entity Resolution Jobs 4-95
The Allow generation of multiple match codes per definition option requires the creation
of a special match definition in the Data Management Studio – Customize module.
The Generate null match codes for blank field values option generates a NULL match
code if the field is blank. If this option is not selected, then a match code of all $ symbols
is generated for the field. When you match records, a field with NULL does not equal
another field with NULL. A field with all $ symbols equals another field with all $
symbols.
10. Establish an appropriate description and preview the Match Codes (Parsed) node.
a. If necessary, click the Match Codes (Parsed) node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Parsed Input as the value for the Description field.
e. Click the Match Code (Parsed) node in the data flow diagram.
f. Click the Preview tool ( ).
A sample of records appears on the Preview tab of the Details panel.
g. Scroll to the right to view the Name_MatchCode field.
4.4 Entity Resolution Jobs 4-97
The final settings for the Match Codes Properties window should resemble the following:
13. Establish an appropriate description and preview the Match Codes node.
a. If necessary, click the Match Codes node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Match Codes as the value for the Description field.
e. Click the Match Codes node in the data flow diagram.
1) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
2) Type Ch4D7_Customers_MatchReport in the File name field.
3) Click Save.
c. Type Customers Match Report – Three Conditions in the Report Title field.
d. Click Launch Viewer after job is completed.
4-104 Chapter 4 ACT
At this point, the Match Report Properties window should resemble the following:
2) Double-click Cluster_ID to move this field from the Available list box to the Selected
list box.
3) Double-click ID to move this field from the Available list box to the Selected list box.
4) Double-click COMPANY to move this field from the Available list box to the Selected
list box.
5) Double-click LAST NAME to move this field from the Available list box to the Selected
list box.
6) Double-click FIRST NAME to move this field from the Available list box to the Selected
list box.
7) Double-click ADDRESS to move this field from the Available list box to the Selected
list box.
8) Double-click CITY to move this field from the Available list box to the Selected list box.
9) Double-click STATE/PROVINCE to move this field from the Available list box to the
Selected list box.
10) Double-click ZIP/POSTAL CODE to move this field from the Available list box to the
Selected list box.
11) Double-click BUSINESS PHONE_Stnd to move this field from the Available list box to the
Selected list box.
4.4 Entity Resolution Jobs 4-105
Because you requested all clusters, you get many “clusters” with one record. Use these tools to
scroll through the pages of clusters: .
Exercises
• Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data
source. Select the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
• Use the specified definitions and sensitivities. Accept the default names for the match code fields.
Be sure to generate null match codes for blank field values and to preserve null values. Generate
match codes for the following fields:
Field Name Definition Sensitivity
MANUFACTURER Organization 75
CONTACT Name 75
CONTACT_ADDRESS Address 85
CONTACT_STATE_PROV State/Province 85
CONTACT_PHONE Phone 95
4.4 Entity Resolution Jobs 4-109
• Cluster similar records using the following two conditions. Create a cluster field named
Cluster_ID. The output must contain all clusters and be sorted by cluster number.
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_ADDRESS_MatchCode
CONTACT_POSTAL_CD_MatchCode
OR
MANUFACTURER_MatchCode
CONTACT_MatchCode
CONTACT_STATE_PROV_MatchCode
CONTACT_PHONE_MatchCode
• Create a match report to display the cluster groupings. The match report should be opened
automatically after the job is run. Name the match report
Ch4E4_Manufacturers_MatchReport.mre and store it in the directory
S:\Workshop\dqdmp1\Exercises\files\output_files. Display the following fields for the report:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
• Save and run the data job. Review the resulting match report and log.
4-110 Chapter 4 ACT
63
Record-level rules select which record from a grouping should survive. If there is ambiguity about which
record is the survivor, the tool selects the first remaining record in the grouping.
64
Field rules are used to “steal” information from other records in the grouping.
4.4 Entity Resolution Jobs 4-111
65
6
5
Entity resolution is the process of merging duplicate records in a single file or multiple files so that
records referring to the same physical object are treated as a single record. Records are matched based on
the information that they have in common. The records that are merged might appear to be different but
can actually refer to the same person or item. For example, a record for John Q. Smith at 220 Academy
Street might be the same person as J. Q. Smith at the same address.
The Entity Resolution File enables you to manually review the merged records and make adjustments as
necessary. This can involve the following tasks:
• Examining clusters
• Reviewing the Cluster Analysis section
• Reviewing related clusters
• Processing cluster records
4-112 Chapter 4 ACT
In this demonstration, you investigate various settings for the Surviving Record Identification node.
In addition, you explore the Entity Resolution File Output node.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Copy an existing job to use as a starting point.
7. Right-click the match report node (labeled Customers Match Report) and select Delete.
8. Add a Surviving Record Identification node to the job flow.
a. In the Nodes resource pane, click in front of the Entity Resolution grouping of nodes.
b. Double-click the Surviving Record Identification node.
The node appears in the data flow. The Surviving Record Identification Properties window
appears.
9. Specify properties for the Surviving Record Identification node.
a. Type Select Best Record in the Name field.
b. Click next to the Cluster ID field and select Cluster_ID.
11. Add a (temporary) Text File Output node to demonstrate surviving record options.
a. In the Nodes resource pane, click to collapse the Entity Resolution grouping of nodes.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Test.txt in the File name field.
4) Click Save.
c. Specify attributes for the file.
1) Verify that the Text qualifier field is set to “ (double quotation mark).
2) Verify that the Field delimiter field is set to Comma.
3) Click Include header row.
d. Move the Cluster_ID field so that it follows the ID field.
1) Scroll in the Selected list box and locate the Cluster_ID field.
2) Click until Cluster_ID follows ID field.
e. Click OK to save the changes and close the Text File Output Properties window.
13. Establish an appropriate description for the Text File Output node.
a. If necessary, click the Text File Output node in the job diagram.
b. Verify that the Details pane is displayed.
c. If necessary, click the Basic Settings tab.
d. Type Text File Output as the value for the Description field.
14. Save the job.
a. Click the Data Flow tab.
b. Select File Save.
15. Run the job.
a. Verify that the Data Flow tab is selected.
4.4 Entity Resolution Jobs 4-115
Notice that the Surviving Record Identification node read 63 rows. For each multi-row cluster, only
one record was selected (the record with the maximum ID value). Therefore, the number of records
written to the text file is 57 rows. Selecting only one record from each cluster is the default action.
16. Edit the properties of the Surviving Record Identification node.
a. Right-click on the Select Best Record node and select Properties.
b. Click Options to the right of the Cluster ID field.
1) Click Keep duplicate records.
2) Type SR_Flag in the Surviving record ID field field.
f. Click to move the SR_Flag up in the list box to follow the Cluster_ID field.
Notice that the text file output node wrote the entire 63 records. The text file shows that the surviving
record in a cluster has true in the SR_Flag field and the duplicate records (non-surviving) have false
in the SR_Flag field.
20. Edit the properties of the Surviving Record Identification node.
a. Right-click the Select Best Record node and select Properties.
b. Click Options to the right of the Cluster ID field.
1) Click Generate distinct surviving record.
2) Click next to Primary key field and select ID.
There is now an extra record for each cluster. The text file shows that the surviving record in a
cluster has true in the SR_Flag field and the duplicate records (non-surviving) have false in the
SR_Flag field. If a field (such as ID in this example) is not chosen as the Primary key field, then
the ID retains its value across all records in a cluster.
Thus, an Options specification such as the one shown below:
The text file shows that the surviving record in a cluster has a null value in the SR_Flag field.
The duplicate records (non-surviving) have the value of the primary key value of the surviving
record in the SR_Flag field.
4-122 Chapter 4 ACT
26. Update the properties for the Surviving Record Identification node.
a. Right-click the Surviving Record Identification node and select Properties.
b. Click Field Rules in the Output fields area on the lower right.
c. Add the first of two field rules.
1) Click Add in the Field Rules window.
2) Click Add in the Rule expressions area of the Add Field rule window.
3) Click next to Field and select EMAIL.
11) In the Affected fields area, double-click the NOTES field to move it from the Available list
box to the Selected list box.
c. Click to move the EMAIL up in the list box to follow the SR_Flag field.
e. Click to move the JOB TITLE up in the list box to follow the EMAIL field.
With the field rules in place, important information that is potentially spread across multiple records
of a cluster can be retrieved for the surviving record. In this case, with the field rules in place, the
surviving record with ID=63 has an e-mail address as well as a job title.
30. Right-click the Text File Output node (labeled Test) and select Delete.
31. Select Edit Clear Run Results Clear All Run Results.
4-126 Chapter 4 ACT
32. Add an Entity Resolution File Output node to the job flow.
1) Double-click EMAIL to move it from the Available fields list box to the Selected fields list
box.
2) Double-click JOB TITLE to move it from the Available fields list box to the Selected fields
list box.
3) Double-click NOTES to move it from the Available fields list box to the Selected fields list
box.
4) Click OK to close the Edit Field Settings window.
a) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
b) Type Ch4D8_Demo_ERF_Audit.txt in the File name field.
c) Click Save.
5) Click OK to close the Target Settings window.
j. Click OK to close the Entity Resolution File Output Properties window.
34. Save the job.
a. Click the Data Flow tab.
b. Select File Save.
35. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions Run Data Job.
4-128 Chapter 4 ACT
36. Review the entity resolution file. Use the Entity Resolution Viewer.
a. Verify that the Cluster tab is selected.
A list of the clusters contained in the entity resolution file is displayed.
b. If necessary, click (the last tool on the toolbar) to reveal the Cluster Analysis pane.
c. Move the cursor over the bar where the record count equals 1.
Tooltip text appears and states that 53 clusters consist of one record.
4-130 Chapter 4 ACT
d. Move the cursor over the bar where the record count equals 2.
Tooltip text appears and states that two clusters consist of two records.
e. Move the cursor over the bar where the record count equals 3.
Tooltip text appears and states that two clusters consist of three records.
f. Click the bar with a record count of 3 (three).
The Cluster Details pane is suppressed and the two clusters with a record count of three are
highlighted in the cluster listing.
g. Click Cluster 22 in the cluster listing.
h. If necessary, select View Show Cluster Details Pane.
4.4 Entity Resolution Jobs 4-131
The top row shows the surviving record based on the rules specified in the Surviving Record
Identification node.
• Maximum value of ID.
• EMAIL is not null.
• JOB TITLE is not null AND Longest Value of JOB TITLE.
In this Edit mode, examining the Cluster records area shows that the selected record and the
surviving record area have values from the surviving record that are overwritten according to
field rule specifications.
1) Double-click the value for the Address field of the Surviving record.
This action makes the field editable.
2) Replace the current value of 333 Christian Street with the value 333 Christian St.
3) Press ENTER.
Notice that the edited value is bold.
4) Click .
4-132 Chapter 4 ACT
5) Click Yes.
Cluster 22 now appears as dimmed and is no longer editable.
j. Type 35 in the Go to cluster field.
k. Click .
1) Double-click the value for the BUSINESS PHONE field of the Surviving record.
2) Replace the current value of (718)555-0100 with the value 718-555-0100.
3) Press ENTER.
4) Click .
A check mark appears next to the two clusters where changes are applied.
m. Select File Close Entity Resolution File.
n. Review the audit file that was produced.
1) Select Start All Programs Accessories Windows Explorer to access a Windows
Explorer window.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Double-click Ch4D8_Demo_ERF_Audit.txt.
4-134 Chapter 4 ACT
Exercises
• Copy the data job Ch4E4_Manufacturers_MatchReport located in the batch_jobs folder of the
Basics Exercises repository. Paste it in the batch_jobs folder of the Basics Exercises. Rename it
Ch4E5_Manufacturers_SelectBestRecord.
• Remove the Match Report node and add a Surviving Record Identification node with the
following properties:
Name: Select Best Record
Cluster Difference
73
7
3
The Cluster Diff node is used to compare two sets of clustered records by reading in data from a left and a
right table (side). From each side, the Cluster Diff node takes two inputs: a numeric record ID field and a
cluster number field.
Possible values for diff types include the following:
COMBINE A record belongs to a set of records from one or more clusters in the left table that are
combined into a larger cluster in the right table.
DIVIDE A record belongs to a set of records in a cluster in the left table that is divided into two or
more clusters in the right table.
NETWORK A record belongs to a set of records that are involved in one or more different multi-
record clusters in the left and right tables.
. A record belongs to a set of records that are in the same cluster in both the left and right
tables.
4-138 Chapter 4 ACT
In this demonstration, you investigate the Cluster Diff node and the results that it can produce.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Copy an existing job to use as a starting point.
a. In the Nodes resource pane, click in front of the Utilities grouping of nodes.
b. Drag the Branch node to the Data Flow tab. The Branch node appears on the job flow but is not
connected to any node.
c. Establish node connections and a description for the Branch node.
1) Click the new Branch node in the data flow.
2) Verify that the Details pane is displayed.
3) If necessary, click the Basic Settings tab.
4) Type Branch as the value for the Name field.
5) Type Branch as the value for the Description field.
6) Click the Node Connections tab.
7) In the Connect from area, click . The Add Nodes window appears.
8) Click the Match Codes node (with a name of Match Codes for Various Fields).
9) Click OK to close the Add Nodes window.
10. Add a Clustering node to the job flow.
a. Click the Branch node in the job flow.
b. In the Nodes resource pane, click to collapse the Utilities grouping of nodes.
d. Establish node connections and a description for the Cluster Diff node.
1) Click the new Cluster Diff node in the data flow.
2) Verify that the Details pane is displayed.
3) If necessary, click the Basic Settings tab.
4) Type Cluster Diff as the value for the Name field.
5) Type Cluster Diff as the value for the Description field.
6) Click the Node Connections tab.
g. Type Diff_Set in the Diff set field in the Output field area.
h. Type Diff_Type in the Diff type field in the Output field area.
i. Verify that Skip rows with “same” diff type is selected.
j. Click Additional Outputs.
2) Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
3) Type Ch4D9_ClusterDiffReport.html in the File name field.
4) Click Save.
d. Click Display report in browser after job run.
20. Close the browser when you are finished viewing the HTML report.
4-146 Chapter 4 ACT
Objectives
Explore pre-built jobs.
Understand how to specify and verify the node
connections.
Review several “new” nodes.
77
Branch
Sequencer (Autonumber)
78
4-148 Chapter 4 ACT
79
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-149
In this demonstration, you examine a data job that takes advantage of two inputs and creates two outputs.
The data job uses a Data Joining node, a Branch node, and a Sequencer (Autonumber) node. Each node’s
properties are explored, as well as the data connection information.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Access an existing job.
The job flow diagram accessed may be “vertical”. In the above picture the job diagram is
horizontal for display purposes. Also, the job flow diagram accessed will have sticky note
objects that are not displayed in the above picture.
6. Review the Data Source node.
a. Right-click the Data Source node and select Properties.
b. Verify that the Input table field displays the Products table from the dfConglomerate Gifts data
connection.
c. Verify that all fields from the Products table are selected.
d. Click OK to close the Data Source Properties window.
7. Review the data in the dfConglomerate Products table.
a. Click the Home tab.
b. Click the Data riser bar.
d. In the Output fields area, verify that the fields from the Products data source (the left table) have
updated output names - each field ends with the text _1.
e. In the Output fields area, verify that the fields from the text file input (the right table) have
updated output names - each field ends with the text _2.
The Data Joining node enables multiple inputs. The first node connected to this
node is always considered the left side and the second connected node is
considered the right side.
f. Click OK to close the Data Joining Properties window.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-153
The Branch node enables multiple outputs. It is typically followed by two or more
Data Validation nodes to filter records down a particular path. A Branch node can
have up to 32 connections.
The Memory cache size field specifies the amount of memory (in Megabytes)
allocated to this step.
The Land all data locally before processing continues check box, when selected,
ensures that all data from the data sources is placed on the local machine before
the job continues to be processed. (Selecting this option can reduce the job
performance.)
b. Click OK to close the Branch Properties window.
4-154 Chapter 4 ACT
Because this is a right join, if a match occurs (that is, if PRODUCT CODE equals
PRD_CODE), then the ID_1 field has a value.
c. Click OK to close the Data Validation Properties window.
15. Review the Data Validation node labeled No Match.
a. Right-click the No Match Data Validation node and select Properties.
b. Verify that the expression is ID_1 is null.
Because this is a right join, if a match does not occur (that is, if PRODUCT CODE does
not equal PRD_CODE), then the ID_1 field does not have a value.
c. Click OK to close the Data Validation Properties window.
16. Review the Text File Output node labeled Products Matched.
a. Right-click the Text File Output node and select Properties.
b. Review the output file specifications.
c. Verify that all input (Available) fields are selected for output.
d. Click OK to close the Text File Output Properties window.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-155
17. Review the Sequencer (Autonumber) node labeled Create Unique ID.
a. Right-click the Sequencer (Autonumber) node and select Properties.
b. Verify that the field to be created (Field name field) is named PK.
c. Verify that the Start number field is set to 91.
d. Verify that the Interval field is set to 1.
c. Select File Exit to close Microsoft Excel. (If you are prompted, do not save the changes.)
21. Click the Log tab to review the log.
Verify that 8 rows were added to the Products table.
4-158 Chapter 4 ACT
In this demonstration, you examine a data job that takes advantage of two inputs and creates two outputs.
The data job generates match codes on a variety of fields from each of the source then performs a data
join using various conditions that involve the match codes. The matches will be written to a text file, as
will the non-matches.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Copy an existing job to use as a starting point.
The job flow diagram accessed may be “vertical”. In the above picture the job diagram is
horizontal for display purposes. Also, the job flow diagram accessed will have sticky note
objects that are not displayed in the above picture.
5. Review the Data Source node.
a. Right-click the Data Source node and select Properties.
b. Verify that the Input table field displays the Customers table from the dfConglomerate Gifts
data connection.
c. Verify that all fields from the Customers table are selected.
d. Click OK to close the Data Source Properties window.
4-160 Chapter 4 ACT
d. In the Output fields area, verify that the fields from the Customers data source (the left table)
have updated Output Names. That is, each field ends with the text _1.
e. In the Output fields area, verify that the fields from the text file input (the right table) have
updated Output Names. That is, each field ends with the text _2.
f. Click OK to close the Data Joining Properties window.
13. Review the Data Validation node labeled Text File - Record Matches DB Record.
a. Right-click the Data Validation node and select Properties.
b. Verify that the expression is ID_1 is not null.
Because this is a right join, if a match occurs (if at least one of the four conditions specified
in the data joining is met), then the ID_1 field has a value.
c. Click OK to close the Data Validation Properties window.
14. Review the Data Validation node labeled Text File - Record Does Not Match DB Record.
a. Right-click this Data Validation node and select Properties.
b. Verify that the expression is ID_1 is null.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-163
Because this is a right join, if a match does not occur (if at least one of the four conditions
specified in the data joining is met), then the ID_1 field does not have a value.
c. Click OK to close the Data Validation Properties window.
15. Review the Text File Output node labeled Matches.
a. Right-click the Text File Output node and select Properties.
b. Review the output file specifications.
c. Verify that all input (Available) fields are selected for output.
d. Click OK to close the Text File Output Properties window.
16. Review the Text File Output node labeled Non-Matches.
a. Right-click the Text File Output node and select Properties.
b. Review the output file specifications.
c. Verify that only the input (Available) fields with names ending in _2 are selected for output.
d. Click OK to close the Text File Output Properties window.
17. If necessary, save the job.
a. Click the Data Flow tab.
b. Select File Save.
18. Run the job.
a. Verify that the Data Flow tab is selected.
b. Select Actions Run Data Job.
4-164 Chapter 4 ACT
c. Select File Exit to close each of the Microsoft Excel sessions. (If you are prompted, do not
save the changes.)
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-165
Exercises
• Create a new data job named Ch4E6_MultiInputOutput in the batch_jobs folder of the Basics
Exercises repository.
• Add the MANUFACTURERS table in the dfConglomerate Grocery data connection as the data
source. Select the following fields:
ID
MANUFACTURER
CONTACT
CONTACT_ADDRESS
CONTACT_CITY
CONTACT_STATE_PROV
CONTACT_POSTAL_CD
CONTACT_CNTRY
CONTACT_PHONE
POSTDATE
• Standardize the following fields with the corresponding definitions.
Field Name Definition
MANUFACTURER Organization
CONTACT Name
CONTACT_ADDRESS Address
CONTACT_STATE_PROV State/Province(Abbreviation)
CONTACT_PHONE Phone
• Use the specified definitions and sensitivities. Accept the default names for the match code fields.
Be sure to generate null match codes for blank field values and to preserve null values. Generate
match codes for the following fields
Field Name Definition Sensitivity
MANUFACTURER_Stnd Organization 75
CONTACT_Stnd Name 75
CONTACT_ADDRESS_Stnd Address 85
CONTACT_STATE_PROV_Stnd State/Province 85
• Also add a text input as input. The text file uses “ (double quotation mark) to qualify data, is
comma-delimited, and has a header row.
Text file: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.txt
DFL file: S:\Workshop\dqdmp1\data\Text Files\Manufacturer_Contact_List.dfl
4-168 Chapter 4 ACT
• Add Performing Address Verification and Geocodinga Standardization node following the Text
File Input node and standardize as follows:
Field Name Definition
COMPANY Organization
NAME Name
WORK_ADDRESS Address
WORK_STATE State/Province(Abbreviation)
WORK_PHONE Phone
a. Use the specified definitions and sensitivities. Accept the default names for the match code fields.
Be sure to generate null match codes for blank field values and to preserve null values. Generate
match codes for the following fields:
Field Name Definition Sensitivity
COMPANY_Stnd Organization 75
NAME_Stnd Name 75
WORK_ADDRESS_Stnd Address 85
WORK_STATE_Stnd State/Province 85
• Create a unique identifier field in the data flowing from the text file. Name the field ID, start the
values at 1 (one), and increment by 1 (one).
• Add a Data Joining node to the job flow to join the data flowing from the data source with the
data flowing from the text file. The join should be a right join and the join criteria can be one of
the following:
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode
CONTACT_STATE_PROV_Stnd_MatchCode = WORK_STATE_Stnd_MatchCode
or
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_ADDRESS_Stnd_MatchCode = WORK_ADDRESS_Stnd_MatchCode
CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode
or
MANUFACTURER_Stnd_MatchCode = COMPANY_Stnd_MatchCode
CONTACT_Stnd_MatchCode = NAME_Stnd_MatchCode
CONTACT_PHONE_Stnd = WORK_PHONE_Stnd
CONTACT_POSTAL_CD_MatchCode = WORK_ZIP_MatchCode
• Add a Branch node following the Data Joining node.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-169
• Add two Data Validation nodes. One searches for records where ID_1 is not null (matches) and
the other searches for records where ID_1 is null (non-matches).
• Add two text file output: one to “hold” the matches and one for the non-matches. Create each file
to use “ (double quotation mark) to qualify data. It should be comma-delimited. Add a header
row.
Add the following fields to the Matches text file:
Field Name Output Name
ID_1 DB_ID
MANUFACTURER_1 MANUFACTURER
CONTACT_1 CONTACT
ID_2 TEXT_ID
COMPANY_2 COMPANY
NAME_2 NAME
COMPANY_2 COMPANY
NAME_2 NAME
WORK_ADDRESS_2 WORK_ADDRESS
WORK_CITY_2 WORK_CITY
WORK_STATE_2 WORK_STATE
WORK_ZIP_2 WORK_ZIP
WORK_PHONE_2 WORK_PHONE
1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in
field.)
2) Type Ch4E1_Breakfast_Items_OZ in the Name field.
3) Click OK.
The new data job appears on a tab.
f. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E1_Breakfast_Items_OZ.txt in the File name field.
d) Click Save.
3) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that the Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
4) Re-order selected fields for the output file.
a) Click MANUFACTURER_ID in the Selected area.
5) Click OK to save the changes and close the Text File Output Properties window.
l. Add a note to the data flow.
5) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in
field.)
6) Type Ch4E2_Manufacturers_DataQuality in the Name field.
7) Click OK.
c. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
5) For the CONTACT_ADDRESS field, click under Definition and select Address.
4) Double-click each of the following tokens to move them from the Available list box to the
Selected list box.
Given Name
Middle Name
Family Name
5) Change the output name for the Given Name token.
a) Double-click the default output name for the Given Name token.
b) Type FIRST_NAME.
c) Press ENTER.
6) Change the output name for the Middle Name token.
a) Double-click the default output name for the Middle Name token.
b) Type MIDDLE_NAME.
c) Press ENTER.
7) Change the output name for the Family Name token.
a) Double-click the default output name for the Family Name token.
b) Type LAST_NAME.
c) Press ENTER.
8) Click Preserve null values.
4-178 Chapter 4 ACT
The settings in the Parsing Properties window should resemble the following:
2) Click next to the Output table field. The Select Table window appears.
(1) Type MANUFACTURERS_STND in the Enter a name for the new table field.
(2) Click OK to close the New Table window.
4-180 Chapter 4 ACT
Each field can be removed by clicking the field in the Selected list box and
then clicking . Extended selection is also an option.
b) Move the following fields to the Selected side, in the following order:
ID
MANUFACTURER_Stnd
FIRST_NAME
MIDDLE_NAME
LAST_NAME
CONTACT_ADDRESS_Stnd
CONTACT_CITY
CONTACT_STATE_PROV_Stnd
CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd
CONTACT_PHONE
POSTDATE
5) Rename the _Stnd fields.
ID ID
MANUFACTURER_Stnd MANUFACTURER
FIRST_NAME FIRST_NAME
MIDDLE_NAME MIDDLE_NAME
LAST_NAME LAST_NAME
CONTACT_ADDRESS_Stnd CONTACT_ADDRESS
CONTACT_CITY CONTACT_CITY
CONTACT_STATE_PROV_Stnd CONTACT_STATE_PROV
CONTACT_POSTAL_CD CONTACT_POSTAL_CD
CONTACT_CNTRY_Stnd CONTACT_CNTRY
CONTACT_PHONE CONTACT_PHONE
POSTDATE POSTDATE
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-181
6) Click OK to save the changes and close the Data Target (Insert) Properties window.
m. Save the job.
1) Click the Data Flow tab.
2) Select File Save.
n. Run the job.
1) Verify that the Data Flow tab is selected.
2) Select Actions Run Data Job.
3) Notice the processing information on each node.
o. View the log information.
1) Click the Log tab.
2) Review the information for each of the nodes.
p. Select File Close.
q. View the newly created data.
1) If necessary, click the Home tab.
2) Click the Data riser bar.
1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
2) Type Ch4E3_MANUFACTURERS_Verify in the Name field.
3) Click OK.
4-182 Chapter 4 ACT
a) Click under Field Type for the MANUFACTURER field and click Firm.
b) Click under Field Type for the STREET_ADDR field and click Address Line 1.
c) Click under Field Type for the CITY field and click City.
d) Click under Field Type for the STATE_PROV field and click State.
e) Click under Field Type for the POSTAL_CD field and click Zip.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-183
c) Click the CONTACT_ADDRESS field in the Output fields area, and then click .
d) Click the CONTACT_CITY field in the Output fields area, and then click .
4-184 Chapter 4 ACT
e) Click the CONTACT_STATE_PROV field in the Output fields area, and then click .
f) Click the CONTACT_POSTAL_CD field in the Output fields area, and then click .
g) Click the CONTACT_CNTRY field in the Output fields area, and then click .
h) Click the CONTACT_PHONE field in the Output fields area, and then click .
i) Click the NOTES field in the Output fields area, and then click .
j) Click the POSTDATE field in the Output fields area, and then click .
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E3_Manufacturer_Verify.txt in the File name field.
d) Click Save.
3) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
4) Specify desired fields for the output file.
a) Click to move all default selected fields from the list box.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-185
ID ID
Firm_Verified Manufacturer
Address_Verified Address
City_Verified City
State_Verified State
ZIP_Verified ZIP
US_County_Name US_County_Name
US_Result_Code US_Result_Code
5) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in field.)
6) Type Ch4E4_Manufacturers_MatchReport in the Name field.
7) Click OK.
c. Add the Data Source node to the Data Flow Editor.
1) Verify that the Nodes riser bar is selected in the Resource pane.
b) Double-click CONTACT_MatchCode to move from the Available fields list box to the
Selected fields list box.
c) Double-click CONTACT_ADDRESS_MatchCode to move from the Available fields list
box to the Selected fields list box.
d) Double-click CONTACT_POSTAL_CD_MatchCode to move from the Available fields
list box to the Selected fields list box.
6) Specify the second condition.
a) Click OR.
b) Double-click MANUFACTURER_MatchCode to move from the Available fields list
box to the Selected fields list box.
c) Double-click CONTACT_MatchCode to move from the Available fields list box to the
Selected fields list box.
d) Double-click CONTACT_STATE_PROV_MatchCode to move from the Available
fields list box to the Selected fields list box.
e) Double-click CONTACT_HONE_MatchCode to move from the Available fields list box
to the Selected fields list box.
7) Click OK to close the Clustering Properties window.
j. Preview the Clustering node.
1) If necessary, click the Clustering node in the job diagram.
2) Verify that the Details pane is displayed.
3) Click the Preview tool ( ).
a) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
b) Type Ch4D7_Manufacturers_MatchReport in the File name field.
c) Click Save.
3) Type Manufacturers Match Report – Two Conditions in the Report Title field.
4) Click Launch Viewer after job is completed.
5) Click next to Cluster field and select Cluster_ID.
b) Double-click Cluster_ID to move this field from the Available list box to the Selected list
box.
c) Double-click ID to move this field from the Available list box to the Selected list box.
d) Double-click MANUFACTURER to move this field from the Available list box to the
Selected list box.
e) Double-click CONTACT to move this field from the Available list box to the Selected list
box.
f) Double-click CONTACT_ADDRESS to move this field from the Available list box to the
Selected list box.
g) Double-click CONTACT_CITY to move this field from the Available list box to the
Selected list box.
h) Double-click CONTACT_STATE_PROV to move this field from the Available list box
to the Selected list box.
i) Double-click CONTACT_POSTAL_CD to move this field from the Available list box to
the Selected list box.
j) Double-click CONTACT_CNTRY to move this field from the Available list box to the
Selected list box.
k) Double-click CONTACT_PHONE to move this field from the Available list box to the
Selected list box.
7) Click OK to close the Match Report Properties window.
m. Save the job.
1) Click the Data Flow tab.
2) Select File Save.
n. Run the job.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-191
1) In the Nodes resource pane, click in front of the Entity Resolution grouping of nodes.
2) Double-click the Surviving Record Identification node.
The node in the data flow and the Surviving Record Identification Properties window appear.
e. Specify properties for the Surviving Record Identification node.
1) Type Select Best Record in the Name field.
2) Click next to Cluster ID field and select Cluster_ID.
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E5_Manufacturer_Best_Record.txt in the File name field.
d) Click Save.
3) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that the Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
4-194 Chapter 4 ACT
b) Double-click Cluster_ID to move this field from the Available list box to the Selected list
box.
c) Double-click ID to move this field from the Available list box to the Selected list box.
d) Double-click SR_ID to move this field from the Available list box to the Selected list box.
e) Double-click POSTDATE to move this field from the Available list box to the Selected
list box.
f) Double-click MANUFACTURER to move this field from the Available list box to the
Selected list box.
g) Double-click CONTACT to move this field from the Available list box to the Selected list
box.
h) Double-click CONTACT_ADDRESS to move this field from the Available list box to the
Selected list box.
i) Double-click CONTACT_CITY to move this field from the Available list box to the
Selected list box.
j) Double-click CONTACT_STATE_PROV to move this field from the Available list box
to the Selected list box.
k) Double-click CONTACT_POSTAL_CD to move this field from the Available list box to
the Selected list box.
l) Double-click CONTACT_CNTRY to move this field from the Available list box to the
Selected list box.
m) Double-click CONTACT_PHONE to move this field from the Available list box to the
Selected list box.
5) Click OK to close the Text File Output Properties window.
h. Save the job.
1) Click the Data Flow tab.
2) Select File Save.
i. Run the job.
1) Verify that the Data Flow tab is selected.
2) Select Actions Run Data Job.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-195
Two groups are highlighted in the above view of the text file.
For Cluster ID=4, the selected record is the record with ID=78. Why?
For Cluster_ID=6, the selected record is the record with ID=11. Why?
3) Select File Exit to close the Notepad window.
j. Select File Close.
6. Creating a Multi-Input/Multi-Output Data Job
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Click the Basics Exercises repository.
e. Click Data Job.
1) Double-click the batch_jobs folder. (This action makes this folder the value of the Save in
field.)
2) Type Ch4E6_MultiOutput in the Name field.
3) Click OK.
4-196 Chapter 4 ACT
6) For the MANUFACTURER field, click under Definition and select Organization.
7) For the CONTACT field, click under Definition and select Name.
8) For the CONTACT_ADDRESS field, click under Definition and select Address.
2) Double-click the Standardization node. (Verify that the node was attached to the Text File
Input node).
3) Type Standardize Fields in the Name field.
4.5 Multi-Input/Multi-Output Data Jobs (Self-Study) 4-199
4) Double-click each of the following fields to move them from the Available list box to the
Selected list box.
COMPANY
NAME
WORK_ADDRESS
WORK_STATE
WORK_PHONE
5) For the COMPANY field, click under Definition and select Organization.
6) For the NAME field, click under Definition and select Name.
7) For the WORK_ADDRESS field, click under Definition and select Address.
13) Double-click the WORK_STATE_Stnd field to move it from the Available list box to the
Selected list box.
14) Click under Definition and select State/Province.
The performance of the data job accelerates if enough memory is available to load
one of the sides into memory.
The data job flow should now resemble the following:
4-202 Chapter 4 ACT
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E6_Manufacturers_Matches.txt in the File name field.
d) Click Save.
6) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
7) Specify desired fields for the output file.
a) Click to move all default selected fields from the list.
ID_1 DB_ID
MANUFACTURER_1 MANUFACTURER
CONTACT_1 CONTACT
ID_2 TEXT_ID
COMPANY_2 COMPANY
NAME_2 NAME
b) Navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E6_Manufacturers_NonMatches.txt in the File name field.
d) Click Save.
6) Specify attributes for the file.
a) Verify that the Text qualifier field is set to “ (double quotation mark).
b) Verify that the Field delimiter field is set to Comma.
c) Click Include header row.
d) Click Display file after job runs.
7) Specify desired fields for the output file.
COMPANY_2 COMPANY
NAME_2 NAME
WORK_ADDRESS_2 WORK_ADDRESS
WORK_CITY_2 WORK_CITY
WORK_STATE_2 WORK_STATE
WORK_ZIP_2 WORK_ZIP
WORK_PHONE_2 WORK_PHONE
a) Click Export.
b) If necessary, navigate to S:\Workshop\dqdmp1\Exercises\files\output_files.
c) Type Ch4E6_Manufacturer_NonMatches.dfl in the File name field.
d) Click Save.
9) Click OK to save the changes and close the Text File Output Properties window.
4-206 Chapter 4 ACT
Demonstration: Creating Another Data Job Calling the Same Task .................................... 5-38
Objectives
Define a business rule.
Create a row-based business rule.
4
4
5-4 Chapter 5 MONITOR
Business rules can be used in both data profiles and data jobs to monitor your data.
Types of Rules
Row-based rule Evaluates every row of data passed into the monitoring node
Set-based rule Evaluates and applies rules to all of the input data in totality (for example, evaluates
1000 rows as a set)
Group-based rule Evaluates and applies all rules to groups of data (for example, if data is grouped by
product code, then the rules are evaluated for each product code)
In this demonstration, you create a row-based business rule to monitor when an e-mail address and a
phone number are missing from a record.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Select Tools Business Rules Manager Basics Demos.
A tab appears and the Business Rules Manager appears in the Basics Demos repository.
3) Click OK.
5.1 Business Rules 5-7
3) Click OK.
c. Right-click the Fields folder and select New Field.
1) Type PHONE in the Name field.
2) Type Phone Number in the Description field.
3) Click OK.
The final set of fields should resemble the following:
5-8 Chapter 5 MONITOR
1) Click next to the Field field. The Select Field window appears.
j. Click Get information about the contents of a field (is null, is numeric, etc.) under Get Field
Information in the Step 1 area.
1) Click next to the Field field. The Select Field window appears.
Exercises
Is a field numeric?
OR
Is the length of the field equal to 14?
Objectives
Discuss how to define and explore business rules in
data profiles.
Discuss how to define and explore alerts in data
profiles.
Discuss how to define historical visualizations in a
data profile report.
15
16
On the Properties tab for a profile, select the Business Rules sub-tab to apply business rule(s) to the data.
5.1 Business Rules 5-13
17
After the profile is run, on the Report tab, select the Business Rules sub-tab to view the number of
violations.
Alerts
18
On the Properties tab for a profile, select the Alerts sub-tab to set up alert(s) for the data. After the profile
is run, on the Report tab, select the Alerts sub-tab to determine whether any alerts were triggered.
5-14 Chapter 5 MONITOR
Historical Visualizations
19
Historical visualizations are customized charts that show how metrics might change over a period of
time.
5.1 Business Rules 5-15
In this demonstration, you examine the steps necessary to add a business rule that was created previously.
In addition, you demonstrate how to set up an alert and how to locate and view the alerts in the profile
report.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
4. Expand the Basics Demos repository.
5. Expand the profiles_and_explorations folder.
6. Double-click the Ch3D3_dfConglomerate_Profile profile.
7. Click the Properties tab.
a. Click in the Business rule field and select Email and Phone Missing.
b. Click in the Field Name field for PHONE and select BUSINESS PHONE.
d. Click the Log data that violates the business rule check box.
e. Click OK.
The Business Rules tab displays the following:
5.1 Business Rules 5-17
b. Click Continue.
The Add Standard Metric Alert window appears.
e. Click in the Comparison field under Alert condition and select Metric is greater than.
The Send e-mail option requires that the EMAIL SMTP Server is configured in the
app.cfg file.
5-18 Chapter 5 MONITOR
h. Click OK.
The Alerts tab displays the following:
a. Click in front of dfConglomerate Gifts. Expand Products PRODUCT CODE. Verify that
a warning symbol with each selection.
The Alert was triggered and is written on the sub-tab. If the Alert was not triggered, there would
be no indicators on the Reports tab and nothing written on the Alerts sub-tab for the table.
QUESTION: What is the Pattern Count metric for the PRODUCT CODE field?
5-20 Chapter 5 MONITOR
c. Double-click the summary of the violations. The Business Rule Violation Drill Through window
appears.
This drill-through information is available only because Log data that violates the
business rule was selected.
d. Click Close to close the Business Rule Violation Drill Through window.
e. Click the Employees table.
f. If necessary, click the Business Rules sub-tab.
There should not be any violations for table. The business rule was established only for the
Customers table.
18. Select File Close Profile.
5.1 Business Rules 5-21
In this demonstration, you run a data job that updates the Customers table with standardized phone fields.
A profile on the Customers table is rerun and historical metrics are examined visually.
1. Access a data job in the Basics Solutions repository.
a. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
b. Verify that the Home tab is selected.
c. Click the Folders riser bar.
d. Expand the Basics Solutions repository.
e. Expand the batch_jobs folder.
f. Double-click the Ch5D3_Update_Customers_Table data job.
2) Verify that the standardized phone fields (those ending with default _Stnd) replace their
corresponding phone fields.
The Visualizations tab surfaces the newly specified historic line plot.
Your view might differ depending on the number of times that the above profile was run.
The Pattern Count and Unique Count metrics decreased for this last execution.
6. Select File Close Profile.
5.1 Business Rules 5-25
Exercises
QUESTION: Were there violations to the business rule? If so, what records did not pass the rule
check?
Answer:
Answer:
5-26 Chapter 5 MONITOR
Objectives
Discuss how to define a monitoring task.
Discuss and use several of the available events.
Create a monitoring data job.
View the Monitoring report.
30
Monitoring Tasks
Monitoring tasks are created by pairing a defined
business rule with one or more events. Some available
events include the following:
Call a realtime service
Execute a program
31
When creating a Monitoring task, select the business rule(s) that are part of the task. For each business
rule, select the event(s) that is triggered if there is a rule violation.
If the Send email message event is selected, a separate e-mail is sent for each rule violation.
5.1 Business Rules 5-27
32
In the Data Monitoring node, map the fields used in the task to the input fields in the data job. The Export
function from Business Rule Manager suggests a mapping based on field names. If a field is not mapped
or mapped incorrectly, select the menu to map the task field to the appropriate field name.
A Monitoring task can be applied to one or more data sources, including modified data n a data job. After
the job is run, the log lists the number of rows read and the number of events triggered by the Data
Monitoring step.
Previewing the Data Monitoring node does not cause an event to be triggered. An event is
triggered only when the job is run.
5-28 Chapter 5 MONITOR
Monitoring Report
The Monitoring report
provides a summary
listing of monitoring
task executions for a
selected repository.
The listing of items can be sorted or grouped to enable easier viewing of the results of the Monitoring task
executions. The data can also be exported as a CSV or XML file.
5.1 Business Rules 5-29
In this demonstration, you create a task that combines a previously created rule with two events. This task
is exported as a data job, and the data job is executed. The results are reviewed.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Select Tools Business Rules Manager Basics Demos.
A tab appears and the Business Rules Manager appears in the Basics Demos repository.
4. Right-click the Tasks folder and select New Task.
a. Type Email and Phone Missing in the Name field.
b. Double-click the Email and Phone Missing rule to move it from the Available list to the
Selected list.
c. With the rule highlighted in the Selected list, click Rule Details.
The Rule Details window appears.
1) Click Add in the Events window.
a) Click in the Select event field and select Log error to repository.
b) Click Continue.
c) Click to move the three fields (EMAIL, PHONE, PK) from Available fields to
Selected fields.
d) Click OK to close the Log Error To Repository Event Properties window.
2) Click Add in the Rule Details window.
a) Click in the Select event field and select Log error to text file.
b) Click Continue.
g) Click OK to close the Log Error To Text File Event Properties window.
3) Click OK to close the Rule Details window.
b. Click Next.
h. Click Next.
5.1 Business Rules 5-33
i. Click in the Data Source Field field for the PHONE task field and select BUSINESS
PHONE.
j. Click in the Data Source Field field for the PK task field and select ID.
k. Click Next.
l. Type Both Email And Phone Missing in the Description field.
o. Click Finish.
If the folders for Basics Demos do not appear when saving the above data job, do the
following:
• Click the Home tab.
• Click the Administration riser bar.
• Click to expand Repository Definitions.
5-34 Chapter 5 MONITOR
e. If necessary, click .
14. Access Windows Explorer and examine the text file that was created.
a. Select Start All Programs Accessories Windows Explorer to access Windows
Explorer.
b. Navigate to S:\Workshop\dqdmp1\Demos\files\output_files.
c. Double-click Ch5D4_EmailAndPhoneMissing.txt.
A Notepad window appears with the results of the cluster edits.
15. From the data job, select Tools Monitoring Report Basics Demos.
The specific violations are listed only because you selected the log error for the repository event. The
summary information is written regardless, but the specifics are listed only if there are violations and
this event is selected.
17. Click the History Graph tab.
19. Select File Close to close the tab for the Monitoring Report.
20. Select File Close to close the tab for the data job.
21. Select File Close to close the Business Rules Manager tab.
5-38 Chapter 5 MONITOR
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Click the Folders riser bar.
d. Click OK.
7. Right-click the new data job (Copy of Ch5D4_BothEmailAndPhoneMissing) and select Rename.
a. Type Ch5D5_BothEmailAndPhoneMissing for the new name.
b. Press ENTER.
8. Double-click the new data job (Ch5D5_BothEmailAndPhoneMissing). The data job opens on a new
tab.
5.1 Business Rules 5-39
There were no violations this time. Therefore, the text file is not created.
5-40 Chapter 5 MONITOR
14. From the data job, select Tools Monitoring Report Basics Demos.
15. Verify that no triggers occurred.
16. Select File Close to close the tab for the Monitor Viewer.
17. Select File Close to close the tab for the data job.
5.1 Business Rules 5-41
Exercises
In this demonstration, you create a new field for use in a new rule where the rule is a set-based rule. The
rule is combined with an event for a task and the new task is used in a data job.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
2. Verify that the Home tab is selected.
3. Select Tools Business Rules Manager Basics Demos.
4. Create a new field by right-clicking the Fields folder and selecting New Field.
a. Type COUNTRY in the Name field.
b. Type Country Value in the Description field.
c. Click OK.
5. Create a new set-based rule.
a. Right-click the Rules folder and select New Rule.
b. Type Country Pattern Count Check in the Name field.
c. Click Set in the Type area.
1) Click next to the Field field. The Select Field window appears.
4) Click PATCOUNT.
5) Click OK to close the Select Field window.
b) Click Continue.
c) Double-click SM$COUNTRY$BLANKCOUNT to move it from Available fields to
Selected fields.
d) Double-click SM$COUNTRY$NULLCOUNT to move it from Available fields to
Selected fields.
e) Double-click SM$COUNTRY$PATCOUNT to move it from Available fields to
Selected fields.
f) Double-click SM$COUNTRY$PATMODE to move it from Available fields to
Selected fields.
g) Click OK to close the Log Error To Repository Event Properties window.
2) Click OK to close the Events window.
b) Click Customers.
c) Click OK to close the Select Table window.
4) Double-click COUNTRY/REGION to move it from the Available list to the Selected list.
5) Click OK to save the changes and close the Data Source Properties window.
g. Add a Standardization node to the job flow.
1) In the Nodes resource pane, click to collapse the Data Inputs grouping of nodes.
2) Click in front of the Quality grouping of nodes.
2) Click for the Task name field and select Country Pattern Count Check.
5) Click under the Field Name field and select COUNTRY/REGION_Stnd (for the task
field of COUNTRY).
6) Click OK to save the changes and close the Data Monitoring Properties window.
9. Select File Save Data Job.
10. Select Actions Run Data Job.
11. When the job completes, click the Log tab.
A message was written to the log for each violation that was encountered.
The number of aggregate rows read can be only 1 and the number of triggered events can be
only 0 or 1 because it is a set rule.
5.1 Business Rules 5-47
14. Select File Close to close the tab for the Monitor Viewer.
15. Select File Close to close the tab for the data job.
5-48 Chapter 5 MONITOR
Exercises
• Create a new field BRAND (Brand Name) in the Demos Exercises repository.
• Create a group-based rule named Low Brand Count that groups on distinct values of BRAND.
The rule should check for the counts of the values of BRAND being less than 5 (five).
• Create a task (named Low Brand Count) that calls the Low Brand Count rule with a
Log Error to Repository event. Log the fields BRAND and SM$BRAND$COUNT.
• Export the new task as a data job named Ch5E4_LowBrandCount by selecting
Basics Exercises batch_jobs.
• Run the new job.
Answer:
Answer:
• View the records (if there are any) that triggered the monitoring event.
5.1 Business Rules 5-49
Monitoring Dashboards
50
5
0
Displays alert trigger information for a selected repository on a separate tab. Each tab contains the
following elements:
• Triggers Per Date (Last X Days)
• Triggers Per Records Processed
• Triggers Per Source (Last X Days)
• Triggers Per User (Last X Days)
The Triggers per Date graph shows a percentage of triggers per records processed. The graph
scale is from 0 to 100, representing percent. For example, if you have 100 triggers in 5000 rows,
the triggers represent 2.0% of the records processed. If you move your cursor over the data point,
you get a tooltip with helpful information such as “... 11 triggers in 3276 rows.”
The number of days for the Last X Days is configured as part of the Monitoring options by
selecting Tools Data Management Studio Options from the menu.
5-50 Chapter 5 MONITOR
In this demonstration, you examine the accumulated information in the Basics Demos repository via the
Monitor dashboard.
1. If necessary, select Start All Programs DataFlux Data Management Studio 2.2.
DataFlux Data Management Studio appears.
2. Verify that the Home tab is selected.
3. Click the Information riser bar.
4. Click Monitor from the Information pane.
5. Click the tab labeled Basics Demos - Dashboard.
5.1 Business Rules 5-51
The Triggers per Date graph shows a percentage of triggers per records processed. The graph
scale is from 0 to 100, representing percent. For example, if you have 100 triggers in 5000 rows,
the triggers represent 2.0% of the records processed. If you move your cursor over the data point,
you get a tooltip with helpful information such as “... 11 triggers in 3276 rows.”
The Number of Days for the Last X Days is configured as part of the Monitoring options by
selecting Tools Data Management Studio Options from the menu.
5-52 Chapter 5 MONITOR
(1) Click next to the Field field. The Select Field window appears.
d) Click OR.
e) Click Compare the length of a field to a value under Get Field Information in the Step 1
area.
(1) Click next to the Field field. The Select Field window appears.
(2) Click in the Operator field and select not equal to.
3) Click in the Business rule field and select UPC Check CODE.
4) Click in the Comparison field under Alert condition and select Percentage of failed rows
is greater than.
5) Type 0 (zero) in the Value field.
6) Type UPC field - Percentage of failed rows is greater than 0 in the Description field.
7) Click OK.
m. Select File Save Profile to save the changed profile.
n. Select Actions Run Profile Report.
1) Type Third profile in the Description field.
2) Verify that Append to existing report is selected.
3) Click OK to close the Run Profile window.
The profile executes. A status of the execution is displayed.
The Report tab becomes active.
o. Review the Profile report.
(1) Click in the Select event field and select Log error to repository.
(3) Click to move the two fields (PK, UPC) from Available fields to Selected fields.
(4) Click OK to close the Log Error To Repository Event Properties window.
b) Click OK to close the Events window.
4) Click OK to close the New Task window.
e. Right-click the Sources folder and select New Source.
1) Type Breakfast Items in the Name field.
2) Type Breakfast Items Table in the Description field.
3) Click OK to close the Source Properties window.
5.1 Business Rules 5-57
4) Click in the Data Source Field field for the PK task field and select ID.
5) Click Next.
6) Type Check UPC Field in the Description field.
9) Click Finish.
h. Review the data job.
1) Click the Home tab.
2) Verify that the Folders riser bar is selected.
3) Expand Basics Exercises batch_jobs.
4) Double-click the Ch5E3_UPCCheck data job.
5) Click .
d) Click COUNT.
e) Click OK to close the Select Field window.
9) Click in the Check field and select less than.
5) Double-click BRAND to move it from the Available list to the Selected list.
6) Click Next.
7) Verify that the BRAND field from the data source is mapped to the task field of the same
name.
5.1 Business Rules 5-61
8) Click Next.
9) Type Low Brand Count Group Rule in the Description field.