0% found this document useful (0 votes)
109 views

Methodology and Best Practices Guide

Methodology and Best Practices Guide

Uploaded by

junaid5237
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Methodology and Best Practices Guide

Methodology and Best Practices Guide

Uploaded by

junaid5237
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

IBM InfoSphere Information Analyzer

Version 8 Release 5

Methodology and Best Practices Guide

SC19-2750-02

IBM InfoSphere Information Analyzer


Version 8 Release 5

Methodology and Best Practices Guide

SC19-2750-02

Note Before using this information and the product that it supports, read the information in Notices and trademarks on page 107.

Copyright IBM Corporation 2006, 2010. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents
Methodology and best practices . . . . 1
Product overview. . . . . . . . . . Business case . . . . . . . . . . . Analysis as a business management practice . Project applications . . . . . . . . . Data integration projects . . . . . . Operational improvement projects . . . Enterprise data management projects . . Analysis methodology . . . . . . . . Planning for analysis . . . . . . . Source data for analysis overview . . . Column analysis overview. . . . . . Analysis settings . . . . . . . . Data quality methodology . . . . . . Data quality analysis and monitoring . . Structuring data rules and rule sets . . Naming standards . . . . . . . . Data rules analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 2 . 2 . 4 . 4 . 4 . 5 . 5 . 5 . 7 . 9 . 50 . 53 . 54 . 55 . 57 . 60 Data rule sets. . . . . . . . . . . Metrics . . . . . . . . . . . . . Monitoring results . . . . . . . . . Deploying rules, rule sets, and metrics . . Managing a data quality rules environment . . . . . . . . . . 79 88 92 94 95

Contacting IBM

. . . . . . . . . . 101 103

Accessing product documentation

Product accessibility . . . . . . . . 105 Notices and trademarks . . . . . . . 107 Index . . . . . . . . . . . . . . . 111

Copyright IBM Corp. 2006, 2010

iii

iv

Methodology and Best Practices Guide

Methodology and best practices


You use IBM InfoSphere Information Analyzer to understand the content, structure, and overall quality of your data at a given point in time. The IBM InfoSphere Information Analyzer Methodology and Best Practices Guide provides a deeper insight into the analytical methods employed by IBM InfoSphere Information Analyzer to analyze source data and rules. The information is organized by analytical function. It gives you both in-depth knowledge and best practices for: v Data analysis, including: Applying data analysis system functionality Applying data analysis techniques within a function Interpreting data analysis results Making decisions or taking actions based on analytical results v Data quality analysis and monitoring, including: Supporting business-driven rule definition and organization Applying rules and reusing consistently across data sources Leveraging multi-level rule analysis to understand broader data quality issues Evaluating rules against defined benchmarks/thresholds Assessing and annotating data quality results

Monitoring trends in data quality over time Deploying rules across environments Running adhoc, scheduled, or command line execution options To get the most benefit from the analytical functions, you should be familiar with InfoSphere Information Analyzer, as described in the IBM InfoSphere Information Analyzer User's Guide.

Product overview
IBM InfoSphere Information Analyzer facilitates the analysis of data for knowledge acquisition and data quality management purposes. You can use InfoSphere Information Analyzer to: v Import metadata from various data environments v Configure system analysis options v Create virtual columns from physical columns v v v v v Analyze Analyze Analyze Analyze Analyze column data classification column data properties column data completeness and validity column data formats data value commonality across columns

v Analyze table primary keys v Analyze duplicate primary key values v Analyze table foreign keys
Copyright IBM Corp. 2006, 2010

v v v v

Analyze referential integrity Create analytical notes to supplement system results Capture enterprise data management (EDM) data to supplement system results Produce system reports to display system results

Business case
Organizations require a detailed knowledge and understanding of the strengths and weaknesses of their data and its inherent quality. Their ability to gain this knowledge and to apply it to their various data-related initiatives can directly affect the cost and benefits of those initiatives. In many well-publicized cases, strategic data-related projects have either exceeded planned cost and schedule while delivering less than expected return, or failed completely due to data quality defects that were either underestimated or not known until the implementation stage of the project. For these situations, IBM InfoSphere Information Analyzer can be used to conduct critical data quality assessments at the start of a project to identify and measure existing data defects. By performing this assessment early, the organization can take any necessary corrective action on the data, or circumvent any data problems that might need to be avoided. Further, InfoSphere Information Analyzer can be used to assess and measure data quality throughout the project life cycle by allowing developers to test the accuracy of their code or jobs in delivering correct and expected results, by assisting quality assurance of functional and system accuracy, and by allowing business users to gauge the success of system load processes.

Analysis as a business management practice


Organizations need a business management practice that helps to leverage information that exists across multiple systems and also assures quality. User organizations often state that they need to do a better job leveraging information. This problem manifests itself in many ways sometimes it is information complexity, or a deluge of information. The primary issue is that a great deal of valuable information is locked away in various databases and systems throughout the business, but the organization has no easy way to use this information to improve the business, to compete more effectively, or to innovate. For example, retail companies unable to use demand signals from their stores effectively to drive their supply chains. Across all industries, it is common to find that organizations are not using customer analysis to tailor their marketing and sales activities. In other cases, entire classes of information are being ignored, like free-form text fields, simply because they are too difficult and expensive to deal with. Another information issue for many organizations is that they have multiple versions of the truth across their systems. This prevents them from being able to completely understand their customers and tailor their interactions accordingly. It leads to supply chain collaboration problems, because suppliers and customers have differing concepts and definitions of products. It also causes difficulties when trying to comply with information-centric regulations like Sarbanes-Oxley or Basel II, which require definitive information with associated proof.

Methodology and Best Practices Guide

Many organizations have information issues surrounding trust and control of their data. Organizations do not have trust in their information because the quality cannot be assured, and the source of the information is often uncertain. At the same time, companies want to control who has access to information, understand how it is being used, and govern sensitive information throughout its life cycle. And lastly, organizations encounter strategic obstacles when information inflexibility inhibits their ability to respond quickly to change. They are unable to take advantage of new opportunities for innovation, and their costs of maintaining IT systems continuously escalate as the business demands change from systems that were not built for change. Data analysis is about addressing these issues through a consistent practice, and subsequently reducing project costs and risk by discovering problems early in the data integration life cycle. In many legacy systems and enterprise applications, metadata, field usage, and general knowledge have changed over time. The data might be perfectly acceptable for whatever purpose it was designed for, but it is often not until you load it into another application that you discover how inappropriate it is for what you want to do. Issues that are found include: different or inconsistent standards, missing data or default values, spelling errors, data in wrong fields, buried information, and data anomalies. The following figure describes the different types of data found in legacy and enterprise systems.
Different... Data Values that uniquely describe a business entity used to tell one from another (customer name, address, date of birth...) Identifiers assigned to each unique instance of a business entity Relationships between business entites (two customers "householded" together at the same location) Hierarchies among business entities (parent company owns other companies, different chart of accounts across operations)

Legacy

Finance

Account (Product, Location)

Account Bill To

Part Ship To

CRM

ERP

Account Contact

Product Household Contact

Vender Material Location

Figure 1. Data in legacy and enterprise systems

Companies today are continually moving toward greater integration that is driven by corporate acquisitions and customer-vendor linkages. As companies try to become more customer-centric, management realizes that data must be treated as a corporate asset and not a division or business unit tool. Unfortunately, many sources will not be in the right form or have the right metadata or even documentation to allow a quick integration for other uses. Most enterprises are running distinct sales, services, marketing, manufacturing and financial applications, each with its own master reference data. There is no one system that is the universally agreed-to system of record. In data integration
Methodology and best practices

efforts, old data must be re-purposed for new systems. Enterprise application vendors do not guarantee a complete and accurate integrated view they point to their dependence on the quality of the raw input data. However, it is not necessarily a data entry problem, an issue of data integration, standardization, harmonization, and reconciliation. This is not an issue to address after implementation, but at the beginning and then through the life cycle, to avoid untimely and expensive fixes later.

Project applications
Although there are many project contexts for using IBM InfoSphere Information Analyzer, they generally tend to fall into one of three categories. v Data integration projects v Operational improvement projects v Enterprise data management projects

Data integration projects


Projects that evaluate the quality of legacy data environments as sources for the creation of new data environments are considered data integration projects. In these projects, the new data environment might or might not replace the legacy source data environment. Common examples of this type of project include consolidation of disparate data sources, implementation of new systems (such as SAP conversions), or implementation of corporate data warehouses. For these projects, it is critical to know the structural integrity, completeness, and validity of the source data as a prerequisite for developing the data integration system itself. IBM InfoSphere Information Analyzer satisfies that need by revealing the full extent of any defects in the data before the data integration system is specified and developed. The acquired insight about the data is shared with ETL developers who can then act on its findings as they build the data integration system. This helps to eliminate any surprises that would otherwise occur during data integration system testing or implementation by doing it right the first time.

Operational improvement projects


Projects that focus on existing data environments and their effect on business operations are considered operational improvement projects. These projects are often part of a larger corporate effort that review and improve core business processes. These projects commonly perform a data quality assessment to identify business system problems or areas of opportunity for business process improvement (such as a Six Sigma effort). They might result in one-time data cleanup initiatives or business system enhancements that eliminate the root cause of the data quality defects. Their goal is to achieve a total improvement in the cost and quality of the related business processes. IBM InfoSphere Information Analyzer satisfies that need by measuring the scope of each problem and by identifying individual data records that do not meet the expected end-state of data from the business processes. This information is then shared with analysts and developers who can then act on InfoSphere Information Analyzer findings with a data cleansing tool, such as InfoSphere QualityStage. The

Methodology and Best Practices Guide

information is also used to research and trace the root causes of the data quality defects back into legacy systems or existing business process procedures, which can then be modified. A specialized case of this project type is an asset rationalization project focusing on reduction of storage and CPU costs due to processing of extraneous, redundant, or poorly formatted data. In this instance, the process improvement is usually on the properties of the data.

Enterprise data management projects


Organizations that proactively manage their corporate data by treating information as a corporate asset typically conduct enterprise data management projects. These projects are used over time to gain control of and to continuously manage individual data environments throughout the organization. Often these organizations achieve this goal by implementing a centralized data management organization that is supported by a distributed network of data stewards in the business groups. These projects require a baseline evaluation of a data environment that can accurately measure defective data from defect-free data. After a baseline is established, data stewards are responsible for attaining incremental improvement in the data over time. A reusable set of analysis capability is required to be applied to the data over time that supports trend analysis and quality certification of the data for corporate use. IBM InfoSphere Information Analyzer satisfies that need by providing a comprehensive evaluation of the data from all defect perspectives. The evaluation results can be used to highlight critical issues in the data on which data stewards can be focused. InfoSphere Information Analyzer also supports the follow-on activities of drill-down research and trend analysis as the data stewards pursue their quality objectives.

Analysis methodology
The analysis methodology information is organized by analytical function. The analysis methodology information gives you both in-depth knowledge and best practices for: v Applying data analysis system functionality v Applying internal data analysis techniques within a function v Interpreting data analysis results v Making decisions or taking actions based on analytical results To get the most benefit from the analytical functions, you should be familiar with InfoSphere Information Analyzer, as described in the IBM InfoSphere Information Analyzer User's Guide.

Planning for analysis


As with other project methodologies, the foundation for an analysis methodology is planning. In any project that requires data analysis, the analysis needs to stay focused on the goals and objectives of that project. With IBM InfoSphere Information Analyzer, it is easy to profile a broad range of data sources and to analyze in-depth a wide
Methodology and best practices

variety of data. Without appropriate project scoping, though, you can spend a lot of time analyzing data that is not relevant or needed for your project. The planning process for analysis focuses on the following steps: 1. Define the profiling and analysis requirements. 2. Confirm what data is relevant to the project goals. 3. Define an execution plan. 4. Identify further exploration activities. 5. Iterate and revise the plan as necessary. All projects have a set of goals and objectives. As an analyst, you need to align the profiling and analysis to those goals. If you do not understand the goals of the broader project, it will be difficult to identify those conditions that are important and anomalous. As a starting point in planning, take the core project goals and ask what data is necessary to support those goals, what conditions are expected of the data (if known), and what systems, databases, or files contain the data. There will be finite time and resources in which to conduct a data analysis. Assess what data is either most critical to the broader project or requires a level of review. Also identify whether attention should be focused strictly on core systems or tables, on specific types or classes of data, or on specific attributes or domains of data. There might be requirements to establish cross-source consistency or confirm key relationships. These form the scope of the analysis effort and confirm the relevance of the data to the project goals. After you have completed the data analysis, define an execution plan for the analysis. This could be a rigorous matrix of tasks or a basic checklist, depending on the scope of work. The execution plan should ensure that: v The data to analyze is available and accessible within the InfoSphere Information Analyzer project v Resources are identified and tasked with conducting the data profiling and analysis v You understand the deliverables from the analysis (such as annotations in InfoSphere Information Analyzer or specific reports to generate) Annotations should be consistent and clear so that information is communicated effectively and re-work is not required. It is critical to understand that the data profiling itself is not magic. InfoSphere Information Analyzer will surface results and statistics about the data, but it is still the task of the data analyst to review those results and statistics and make conclusions, in the context of the project's goals and objectives, about the quality and corrective action necessary for the data. By identifying the broad project scope and establishing criteria for relevance, analysts can focus their attention on what is important to the project. Those data sources that are irrelevant or extraneous should be weeded out (annotated for further reference as such) so that analysis can continue on the core areas of attention. As analysis continues across the project (or other projects), the knowledge repository about the data grows and provides information for other analysts to take advantage of.

Methodology and Best Practices Guide

The subsequent sections focus on the specific approaches for analyzing data, particularly on how to leverage the analytical results and statistics from InfoSphere Information Analyzer to address the analytical needs within your project.

Source data for analysis overview


For a given analytical project, you need one or more data sources for analysis. Based on your project objectives and expected goals, you should have a good understanding of what data sources are needed. However, it is likely that the level of knowledge across different sources will range from very high for active, operational databases to very low for sources such as out-sourced vendor packages or external data. The level of anticipated knowledge for a given system might impact the approach for analysis. A data source for IBM InfoSphere Information Analyzer might be a production database, an extract or replica of such a database, an external flat file, and so forth. It is assumed that issues of availability, access, and security have been addressed and that connectivity to the data source has been established as a precursor to analysis. To learn more about connecting to a data source or importing metadata for a data source, see the IBM InfoSphere Information Analyzer User's Guide.

Data sources
An IBM InfoSphere Information Analyzer project can register interest in any data source whose metadata was previously imported into the metadata repository. This action enables InfoSphere Information Analyzer to perform any of its analytical functions against that data source or its analytical results. Analytical results from the data source include frequency distributions of the distinct values within a column, or data samples, which are extracts of actual rows of data from a table. Once a data source's metadata is imported into the system, the data source is defined and viewed as a collection of tables and columns whether or not they are relational in their native state.

Data source subsets


During column analysis, IBM InfoSphere Information Analyzer typically uses the full data source to create column frequency distributions. You can optionally limit the source data used during column analysis by applying a where clause to the analysis. The where clause contains user-defined logic that qualifies only selected data records from the source table to be included in the construction of the column frequency distributions for that table. This feature is helpful if the data sources are extremely large or if the focus of the project is on a logical subset of the data source (such as a specific date range or account type).

Data samples
Certain IBM InfoSphere Information Analyzer analysis functions can be initially run against a data sample of a table for performance or screening purposes. These functions include: v Column analysis v Primary key analysis (multicolumn)
Methodology and best practices

An analysis of a data sample is typically used to narrow the likely candidates (for example, multicolumn primary key analysis) in an efficient manner. After running the analysis of the data sample, you can run a full data source analysis for only the likely candidates. Each table in a data source can have one data sample in an InfoSphere Information Analyzer project. The user controls the creation of each data sample and can replace an existing data sample with a new data sample whenever desired. Data samples typically are created with between 2,000 and 50,000 rows. You can specify the data sample size (for example, row count) and select a sampling technique for the system to use in creating the sample. The techniques include: v Random (randomized selection of rows) v Sequential (first n rows) v Nth (a sequential technique that uses every nth row, such as the 100th, 200th, 300th rows, and so on) Effective use of data samples, particularly for very large data sources, can streamline the analysis tasks by significantly reducing job run times and expediting user decision making. Keep in mind that the use of a data sample might affect the ability to completely assess foreign key and cross-source relationships as common data might not have been included in a data sample.

Frequency distributions
Typically, the first function applied to a new data source registered to IBM InfoSphere Information Analyzer is column analysis. When that function is performed, the system develops a frequency distribution of the distinct data values in each column based on the source data (for example, tables or columns) that were selected. Each table is read in one pass and each of its columns has a frequency distribution developed concurrently with the other columns. The frequency results are stored in the InfoSphere Information Analyzer database. Each row in the newly created frequency distribution for the column contains the following information: v Distinct data value v Frequency count v Frequency percent The remainder of the column analysis function analyzes each column's frequency distribution data values, and appends the following information to each row: v Data type v Length v v v v v Precision Scale Validity flag Format Source (column analysis extraction or manually entered)

v Type (data, null, spaces, or zero)

Performance considerations
IBM InfoSphere Information Analyzer is optimized to use source data in the most efficient manner possible to achieve the various functional analysis results.

Methodology and Best Practices Guide

Users can further increase this optimization by utilizing data subsets and data samples.

Column analysis overview


Column analysis is the component of IBM InfoSphere Information Analyzer used to assess individual columns of data. You control the scope of data subjected to column analysis at one time by selecting the database, tables, and columns to be analyzed. The system initiates the process by accessing the data source based on the user-selected data and constructing a frequency distribution for each column. The frequency distribution contains an entry for each distinct data value in a column. The system then analyzes the distinct data values in each frequency distribution to develop some general observations about each column. The remainder of the column analysis process is driven by user review of the column analysis system data. That process consists of any or all of three parts: Data classification analysis Data classification analysis allows you to segregate and organize columns categorically. Such organization can facilitate further review by focusing on core considerations (for example, numeric columns typically fall into a particular valid range). Column properties analysis Column properties analysis allows you to assess the data contents against the defined metadata, validating the integrity of the metadata for use in other systems or identifying columns that are unused or are poorly defined. Data quality controls analysis Data quality controls analysis allows you to assess the data contents for basic, atomic conditions of integrity such as completeness and validity. These are fundamental assessments of data quality, providing the foundation for assertions of trust or confidence in the data.

Analysis functions and techniques


The IBM InfoSphere Information Analyzer analysis functions and techniques are intended to guide you in the proper application of those analysis capabilities and the interpretation of results in their projects. Each of the InfoSphere Information Analyzer analysis functions is explained in terms of: v The function's purpose and description v The function's underlying analysis technique v The function's system provided capability v Any system performance considerations related to using the function v v v v The user's responsibility How to interpret the function's results What decisions need to be made to complete the function What follow-up actions can be taken for the function

Methodology and best practices

Data classification analysis overview


Data classification analysis allows you to segregate and organize columns categorically. Such organization can facilitate further review by focusing on core considerations (for example, numeric columns typically fall into a particular valid range). Data classification analysis: The data classification analysis function is the process of assigning columns into meaningful categories that can be used to organize and focus subsequent analysis work. Function The following attributes in IBM InfoSphere Information Analyzer can be used for data classification. v Data Class (system-inferred) a system-defined semantic business use categories for the column v Data Sub-Class (optional) a user-defined semantic business use category within a data class v User Class (optional) a user-defined category independent of data class For the system-inferred data class, a column is categorized into one of the following system-defined data classification designations: IDENTIFIER Columns that contain generally non-intelligent data values that reference a specific entity type (for example, a customer number). CODE Columns that contain finite data values from a specific domain set, each of which has a specific meaning (for example, a product status code). INDICATOR Similar to a code except there are only two permissible binary values in the domain set (for example, a yes/no indicator). DATE Columns that contain data values that are a specific date, time, or duration (for example, a product order date). QUANTITY Columns that contain numerical data that could be used in a computation (for example, a product price). TEXT Columns that contain freeform alphanumeric data values from an unlimited domain set (for example, a product description). LARGE OBJECT Columns that contain large-object data (for example, a product image). UNKNOWN Columns that cannot be classified into one of the above classes by the system algorithm. There are multiple system reports that can display the columns and their data class designation.

10

Methodology and Best Practices Guide

Technique The initial data classification designation for a column is inferred by the system during column analysis processing after the column's frequency distribution has been created. The system uses an algorithm that factors in the cardinality, data type, uniqueness, and length of the column's data values to infer its likely data class. Under certain conditions, the algorithm might not be able to produce a specific data class designation, in which case the system assigns a designation of UNKNOWN. During column analysis review, the system-inferred data classification of the column for review is presented. You can accept that inference as true or override that inference based on their knowledge or research of the column and its data. System capability The system automatically applies the data classification algorithm to each column whenever it performs column analysis processing. Using a system algorithm, InfoSphere Information Analyzer analyzes key information about the column to derive the most probable data class that the column belongs to. This system-inferred data classification designation is then recorded in the repository as the inferred selection and by default becomes the chosen selection, as shown in the following example.

Figure 2. An example of inferred data classification designation

User responsibility You views the system-inferred data classification designation during the column analysis review of columns. At the detailed column view, data classification has its own tab for viewing results. On this panel, you can accept the system-inferred data class or override the system's inference by selecting another data classification designation. If you override the system-inferred data class selection, the new selection is recorded in the repository as the chosen selection. The process is completed when you mark the data classification function for the column as reviewed.
Methodology and best practices

11

Interpreting results Typically, you might know each column by its name or alias so the decision to accept or override the system-inferred data classification is straightforward. However, when there is little or no familiarity with the column, a close inspection of the frequency distribution values and the defined data type will help you either confirm the system-inferred data class or choose another more appropriate data class. For example, a common situation occurs when the column is a numeric data type and the choice is between CODE or QUANTITY data class. In general, CODE tends to have a lower cardinality with a higher frequency distribution count per data value than QUANTITY does, but there can be exceptions (for example, a Quantity Ordered field). Frequently a column that is obviously an INDICATOR (the column name often includes the word/string Flag') is inferred as a CODE because the presence of a third or fourth value in the frequency distribution (for example, Y, N, and null). In those cases, it is recommended that the column's data class be set to INDICATOR and that the extra values be marked as Incomplete or Invalid from the Domain Analysis screen or even corrected or eliminated in the source (for example, convert the nulls to N). Also, sometimes choosing between CODE and IDENTIFIER can be a challenge. In general, if the column has a high percentage of unique values (for example, frequency distribution count equal to 1), it is more likely to be an IDENTIFIER data class. Another data classification issue is the need to maintain the consistency of data classification designations across columns in the data environment. A common problem is a column (for example, customer number) that is the primary key of a table and thus has all the characteristics of an IDENTIFIER. However, the same column might also appear in another table as a foreign key and have all the characteristics of a CODE, even though it is in essence still an IDENTIFIER column. This is particularly true where the set of valid codes exist in a Reference Table and the codes are simultaneously the values and the identifiers. With regard to data marked with an UNKNOWN data classification, the most common condition is that the values are null, blank, or spaces, and no other inference could be made. In this situation, the column is most likely not used and the field should be marked accordingly with a note or with a user-defined classification (such as EXCLUDED or NOT USED). In some scenarios, this might represent an opportunity to remove extraneous columns from the database; while in data integration efforts, these are fields to ignore when loading target systems. Decisions and actions You have only one decision to make for each column in data classification. You either accept the system-inferred data class or override the inferred data class by selecting another. After that decision has been made, you can mark the column reviewed for data classification; or, you can add the optional data subclass and user class designations prior to marking the review status complete.

12

Methodology and Best Practices Guide

However, data classification provides a natural organizing schema for subsequent analysis, particularly for data quality controls analysis. The following table indicates common evaluation and analysis necessary based on each system-inferred data class. You can establish particular criteria for analysis based on user-defined data classes.
Table 1. Data classification analysis considerations and actions Data class Identifier Properties analysis Evaluate for Data Type, Length, Nulls, Unique Cardinality Domain analysis Validity and format analysis

v Confirm maximum v Look for out of and minimum range conditions (if values applicable) v Validate format (if text) v Inconsistent or Invalid formats (if text/if applicable) Mark Invalid values

Indicator

Evaluate for Length, Nulls, Constant Cardinality

v Confirm Valid values v Assess Skewing and default values v Confirm Valid values v Assess Skewing and default values

Code

Evaluate for Length, Nulls, Constant Cardinality

Mark Invalid values

Date/Time

Evaluate for Data v Confirm valid Type, Nulls, Constant values Cardinality v Assess Skewing and default values

v Look for Out of range (if applicable) v Mark Inconsistent or Invalid formats (if text or number) v Look for Out of range (if applicable) v Mark Inconsistent or Invalid formats (if text or number) v Mark Invalid special characters v Mark Invalid format (if applicable)

Quantity

Evaluate for Data v Confirm Valid Type, Precision, Scale values v Assess Skewing and default values

Text

Evaluate for Data Type, Length, Nulls, Unique, or Constant Cardinality

Evaluate for Default values, format requirements, and special characters

Performance considerations There are no system performance considerations for the data classification function.

Column properties analysis overview


Column properties analysis allows you to assess the data contents against the defined metadata, validating the integrity of the metadata for use in other systems or identifying columns that are unused or are poorly defined. The column properties analysis is a component of column analysis that determines the optimal technical metadata properties of a column based on the actual data
Methodology and best practices

13

values in a column at that point in time. These system-generated properties are referred to as system inferences. The system then compares these inferred metadata properties with the defined metadata properties obtained during metadata import, and highlights where they are different for your review and decision. The basic column properties analyzed include: v v v v v v Data type Length Precision (where applicable) Scale (where applicable) Nullability Cardinality

Important: The system automatically performs the system inferences for column analysis on the underlying assumption that all of the data values in the column's frequency distribution are complete and valid. In the completeness and domain phase of column analysis, you can flag certain values as being incomplete or invalid. You can request that the system re-infers the column, which it will do while ignoring any frequency distribution data values that have been flagged as incomplete or as invalid. This will often result in different inference results for the column without the effects of incomplete and invalid data and should produce more accurate property inferences. Data type analysis: Data type analysis is used to refine the existing data type metadata definition for a column based on the actual data values that are present in the column. Function Data type analysis is useful if the original column data type was set without knowledge or regard to the actual data values that the column would contain (for example, Int32 versus Int8). If a new data type for a column is determined from analysis, the existing metadata for the column can be changed in the original data source, or can be used to define the column in a new target schema for the data. Of special note in data type analysis is that when the original metadata for the data source is imported, both the native data type and its equivalent system internal data type are recorded in the repository. The data type analysis described here is performed by using the system's internal data type. The results from data type analysis, which are in the system's internal data type form, can then be translated into a native data type of the user's choice. Technique Each data value in a column's frequency distribution is analyzed to infer the optimum data type that can be used for storing that individual data value. Then, all of the individual data type inferences for the column are summarized by data type to develop a frequency distribution of inferred data types for the column. A system heuristic is then used to determine which one of the inferred data types in that frequency distribution could be used to store all of the column's data values.

14

Methodology and Best Practices Guide

System capability During column analysis processing, the system constructs each column's frequency distribution and then analyzes each distinct value to determine which of the available internal data types is optimal for storing that particular data value. After every individual data value has been analyzed, the system summarizes the individual results to create a frequency distribution by inferred data type for that column. If there are multiple inferred data types in the frequency distribution, a system heuristic determines which one of the inferred data types in that frequency distribution could be used to store all of the column's data values. This system-inferred data type is then recorded in the repository as the inferred selection and is also defaulted at this time to be the chosen selection, as shown in the following example.

Figure 3. An example of system-inferred data type

User responsibility You view the data type analysis when the column analysis review of columns is viewed. At the detailed column view, data type analysis has its own panel for viewing results as part of the properties analysis tab. From this panel, you can accept the system-inferred data type (internal) or can use a drop-down list to override the system's inference with another data type (internal). If you override the system-inferred data type selection, the new selection is recorded in the repository as the chosen data type. The process is ultimately completed when you review all of the column properties and mark the column property function as reviewed. Interpreting results Typically, a column's optimal data type will be obvious to you by a quick view of the data type analysis summary. Many columns will result in only a single inferred data type for all of its data values. In that case, unless you are aware of some future data values outside of the capabilities of the inferred data type, you should accept the system's inference. However, when there are multiple inferred data types, you should take note of the frequency count for the selected inferred data type. If that frequency count is low relative to the row count of the table, it might be that some invalid data value or
Methodology and best practices

15

values are causing the wrong data type to be inferred. (A drill-down from the inferred data type in the summary will show what data values require that data type.) If that is the case, you can either override the inferred data type property or flag those data values as invalid and re-infer the data values. Another common situation that appears in data type analysis is when a CODE column that is defined as a string has a numeric-only domain set. In this case, the system will infer a change to a numeric data type. If your policy for columns that cannot be used in a computation also cannot be in a numeric data type, you should consider leaving the column as a string. Like other column properties, there is an advantage to maintaining the consistency of data type assignments across columns in the data environment. Decisions and actions You have only one decision to make for each column for data type. Either accept the system-inferred data type or override the inferred data type by selecting another. After that decision is made, you can continue to review the other column properties or mark the column properties review as complete. Performance considerations There are no system performance considerations for the data classification function. Length analysis: Length analysis is used to refine the existing length metadata definition for selective columns, such as data type string columns, based on the actual data values that are present in the column. Function Length analysis is useful if the original column length was set without knowledge or regard to the actual data values that the column would contain (for example, VarChar 255). If a different length for a column is determined from analysis, the existing metadata for the column can be changed in the original data source, or can be used to define the column in a new target schema for the data. Technique Each data value in a column's frequency distribution is analyzed to infer the length required for storing that individual data value. Then, all of the individual length inferences for the column are summarized by length to develop a frequency distribution of inferred lengths for the column. The system determines the longest inferred length in that frequency distribution which can store all of the columns data values. System capability During column analysis processing, the system constructs each column's frequency distribution and then analyzes each distinct value to determine what length must be used for storing that particular data value. After every individual data value has been analyzed, the system summarizes the individual results to create a

16

Methodology and Best Practices Guide

frequency distribution by inferred lengths for that column. The system uses the longest length as the inferred length for the column since it can hold all of the existing data values. This system-inferred length is then recorded in the repository as the inferred selection and becomes by default the chosen selection, as shown in the following example.

Figure 4. An example of the system-inferred length that is recorded in the repository as the inferred selection

User responsibility You view the length analysis, if applicable, when the column analysis review of columns is viewed. At the detailed column view, length analysis has its own panel for viewing results as part of the Properties Analysis tab. From this panel, you can accept the system-inferred length or can use a drop-down list to override the system's inference with another length. If you override the system-inferred length selection, the new selection is recorded in the repository as the chosen length. The process is ultimately completed when you review all of the column properties and mark the column property function as reviewed. Interpreting results A column's required length is provided in the length analysis summary. Many columns will result in only a single inferred length for all of its data values. In that case, unless you are aware of some future data values outside of the capabilities of the inferred length, you should accept the system's inference. However, when there are multiple inferred lengths, you should take notice of the frequency count for the selected inferred length. If that frequency count is low relative to the row count of the table, some invalid data values might be causing an excessive length to be inferred. (A drill-down from the inferred length in the summary will show what data values require that length.) If that is the case, you can either override the inferred length property or flag those data values as invalid and ask the system to re-inference.
Methodology and best practices

17

Another common situation that appears in length analysis is when a variable length string column is defined with a length of 128 or 255. For example, the system will infer a change to a length based on the data value with the most characters. If so, it might be a database administration design issue whether to change the column's metadata definition to the inferred length or leave it at the standard of 128 or 255. Like other column properties, there is an advantage in maintaining the consistency of length assignments across columns in the data environment. Decisions and actions You have only one decision to make for each applicable column for length. You can either accept the system-inferred length or override the inferred length by selecting another. After you make this decision, you can continue to review the other column properties or can mark the column properties review as complete. Performance considerations There are no significant system performance considerations for the length analysis function. Precision analysis: Precision analysis is used to refine the existing precision metadata definition for selective columns (for example, data type numeric columns) based on the actual data values that are present in the column. Function Precision analysis is useful if the original column precision was set without knowledge or regard to the actual data values that the column would contain. If a different precision for a column is determined from analysis, the existing metadata for the column can be changed in the original data source, or can be used to define the column in a new target schema for the data. Technique Each data value in a column's frequency distribution is analyzed to infer the precision required for storing that individual data value. Then, all of the individual precision inferences for the column are summarized by precision length to develop a frequency distribution of inferred precisions for the column. The system determines the longest inferred precision length in that frequency distribution that can store all of the columns data values. System capability During column analysis processing, the system constructs each column's frequency distribution, and then analyzes each distinct value to determine what precision length must be used for storing that particular data value. After each data value has been analyzed, the system summarizes the individual results to create a frequency distribution by inferred precision lengths for that column. The system uses the longest precision length as the inferred precision for the column because it can hold all of the existing data values. This system-inferred precision is then

18

Methodology and Best Practices Guide

recorded in the repository as the inferred selection and is also defaulted at this time to be the chosen selection, as shown in the following figure.

Figure 5. An example of inferred precision that is recorded in the repository as the inferred selection

User responsibility You can view the precision analysis, if applicable, when the column analysis review of columns is viewed. At the detailed column view, precision analysis has its own panel for viewing results as part of the properties analysis tab. From this panel, you can accept the system-inferred precision or can use a drop-down list to override the system's inference with another precision length. If you override the system-inferred precision selection, the new selection is recorded in the repository as the chosen precision. The process is ultimately completed when you review all of the column properties and mark the column property function as reviewed. Interpreting results Typically, a column's required precision will be obvious to you by a quick view of the precision analysis summary. Sometimes a column will result in only a single inferred precision for all of its data values. In that case, unless you are aware of some future data values outside of the capabilities of the inferred precision, you should accept the system's inference. However, when there are multiple inferred precisions, you should take notice of the frequency count for the selected inferred precision. If that frequency count is low relative to the row count of the table, it might be that some invalid data value(s) are causing an excessive precision length to be inferred. (A drill-down from the inferred precision in the summary will show what data values require that precision length.) If that is the case, you can either override the inferred precision length property or can flag those data values as invalid and ask the system to re-inference. Like other column properties, there is an advantage to maintain the consistency of precision assignments across columns in the data environment. Decisions and actions You can either accept the system-inferred precision or override the inferred precision by selecting another.
Methodology and best practices

19

Once that decision has been made, you can continue to review the other column properties or can mark the column properties review as complete. Performance considerations There are no system significant performance considerations for the precision analysis function. Scale analysis: Scale analysis is used to refine the existing scale metadata definition for selective columns (for example, data type decimal columns) based on the actual data values that are present in the column. Function Scale analysis is useful if the original column scale was set without knowledge or regard to the actual data values that the column would contain. After analysis, if a different scale for a column is determined, you can change the existing metadata for the column the original data source, or define the column in a new target schema for the data. Technique Each data value in a column's frequency distribution is analyzed to infer the scale required for storing that individual data value. Then, all of the individual scale inferences for the column are summarized by scale length to develop a frequency distribution of inferred scales for the column. The system determines the longest inferred scale length in that frequency distribution that can store all of the column's data values. System capability During column analysis processing, the system constructs each column's frequency distribution and then analyzes each distinct value to determine what scale length must be used for storing that particular data value. After every individual data value has been analyzed, the system summarizes the individual results to create a frequency distribution by inferred scale lengths for that column. The system uses the longest scale length as the inferred scale for the column because it can hold all of the existing data values. This system-inferred scale is then recorded in the repository as the inferred selection and is also defaulted at this time to be the chosen selection, as shown in the following figure.

20

Methodology and Best Practices Guide

Figure 6. An example of the system inferred scale that is recorded in the repository as the inferred selection

User responsibility You can view the scale analysis, if applicable, when the column analysis review of columns is viewed. At the detailed column view, scale analysis has its own panel for viewing results as part of the properties analysis tab. From this panel, you can accept the system-inferred scale or use a drop-down list to override the system's inference with another scale length. If you override the system-inferred scale selection, the new selection is recorded in the repository as the chosen scale. The process is ultimately completed when you review all of the column properties and mark the column property function as reviewed. Interpreting results Typically, a column's required scale will be obvious to you by a quick view of the scale analysis summary. Sometimes a column will result in only a single inferred scale for all of its data values. In that case, unless you are aware of some future data values outside of the capabilities of the inferred scale, you should accept the system's inference. However, when there are multiple inferred scales, you should take notice of the frequency count for the selected inferred scale. If that frequency count is low relative to the row count of the table, it might be that some invalid data value(s) are causing an excessive scale length to be inferred. (A drill-down from the inferred scale in the summary will show what data values require that precision length.) If that is the case, you can either override the inferred scale length property or can flag those data values as invalid and ask the system to re-inference. Like other column properties, there is an advantage to maintaining the consistency of scale assignments across columns in the data environment. Decisions and actions You can either accept the system-inferred scale or override the inferred scale by selecting another. Once that decision is made, you can continue to review the other column properties or can mark the column properties review as complete.
Methodology and best practices

21

Performance considerations There are no significant system performance considerations for the scale analysis function. Nullability analysis: Nullability analysis is used to refine the existing nulls-allowed indicator metadata definition for columns based on the actual data values that are present in the column. Function Nullability analysis is useful if the original column nulls allowed indicator was set without knowledge or regard to the actual data values that the column would contain. If the column's nulls allowed indicator needs to be changed, the existing metadata for the column can be changed in the original data source, or can be used to define the column in a new target schema for the data. Analysis technique The decision to infer whether the column should allow nulls is based on the percentage of nulls that is present in the column relative to a percentage threshold. System capability During column analysis processing, the system constructs each column's frequency distribution and then analyzes each distinct value to determine if the column contains null values. If the column contains nulls, the frequency percentage of those nulls is compared to the nullability threshold (for example, 1%) applicable for that column. If the actual percentage of nulls is equal or greater than the nullability threshold percentage, the system infers that nulls should be allowed. If the actual null percentage is less than the nullability threshold percentage or there are no nulls in the column, the system infers that nulls are not allowed. This system-inferred nulls allowed indicator is then recorded in the repository as the inferred selection and becomes by default the chosen selection, as shown in the following example.

Figure 7. An example of the system-inferred nulls allowed indicator

User responsibility You can view the nulls analysis when the column analysis review of columns is viewed. At the detailed column view, nullability analysis has its own panel for viewing results as part of the properties analysis tab. From this panel, you can accept the system-inferred nulls allowed indicator or can override the system's inference with the opposite choice. If you override the system-inferred nulls allowed indicator selection, the new selection is recorded in the repository as the

22

Methodology and Best Practices Guide

chosen nulls allowed indicator. The process is ultimately completed when you review all of the column properties and mark the column property function as reviewed. Interpreting results Typically, the appropriate nulls allowed indicator setting for a column will be obvious to a user by the presence of nulls in the column. For example, there can only be existing nulls in the column if the column currently is set to allow nulls. To change the column to nulls not allowed will require that the current null values be replaced by some other data value. However, if there are no existing nulls in the column, it might be because no nulls have been entered or because the column is set to not allow nulls. In this case, you are free to set the nulls allowed indicator as you wish. Like other column properties, there is an advantage to maintain the consistency of allowing nulls across columns in the data environment. Decisions and actions You can either accept the system-inferred nulls allowed indicator or override the inference with the other choice. Once that decision has been made, you can continue to review the other column properties or can mark the column properties review as complete. Performance considerations There are no significant system performance considerations for the nullability analysis function. Cardinality analysis: Cardinality analysis is used to identify particular constraints of uniqueness or constancy that exist within the actual data values. Function This function is useful in identifying and marking potential natural keys (such as Social Security or vehicle ID numbers that do not serve as primary keys) where the data is expected to be completely or highly unique. If the column's data indicates a single constant value or a highly constant value within the data, this might indicate a field that was created with the expectation of serving as a Code or Indicator, but whose usage has not been maintained. Technique The decision to infer whether the column should be constrained for unique or constant condition is based on the percentage of uniquely distinct (or constant) values that are present in the column against defined thresholds.

Methodology and best practices

23

System capability During column analysis processing, the system constructs each column's frequency distribution, and then analyzes each distinct value to determine both the column's uniqueness as well as the percentage of the most frequent (constant) value. If the column's distinct values are highly unique, the frequency percentage of uniqueness is compared to the uniqueness threshold (for example, 99%) applicable for that column. If the actual percentage of uniquely distinct values is equal to or greater than the uniqueness threshold percentage, the system infers that the cardinality type should be unique. It also assesses the percentage occurrence of the most common value as a constant. If the actual percentage of the common constant value is equal to or greater than the constant threshold percentage, the system infers that the cardinality type should be constant. If the cardinality type is neither unique nor constant, the system infers the column to be not constrained.

Figure 8. An example of cardinality type that is neither unique nor constant, the system infers the column to be "not constrained"

User responsibility You can view the cardinality type analysis when the column analysis review of columns is viewed. At the detailed column view, cardinality type analysis has its own panel for viewing results as part of the properties analysis tab. From this panel, you can accept the system-inferred constraint for cardinality type or can override the system's inference with the other choices. If you override the system-inferred cardinality type selection, the new selection is recorded in the repository as the chosen cardinality type. The process is ultimately completed when you review all of the column properties and mark the column property function as reviewed. Interpreting results Typically, the appropriate cardinality type setting for a column is indicated by the presence of either highly unique values or a highly constant value in the column. For fields classified as identifiers or text, the expectation is that the cardinality type will be unique. This can be used to record or annotate natural keys in the data source. The instance of a cardinality type of constant might mean that there is highly skewed data that needs to be assessed for accuracy, or that a cardinality type of constant, which is likely to be extraneous and completely defaulted, adds little or no value to the data source.

24

Methodology and Best Practices Guide

Decisions and actions You can either accept the system-inferred cardinality type or override the inference with another choice. Once that decision has been made, you can continue to review the other column properties or mark the column properties review as complete. Performance considerations There are no significant system performance considerations for the cardinality type analysis function.

Data quality controls analysis overview


Data quality controls analysis allows you to assess the data contents for basic, atomic conditions of integrity such as completeness and validity. These are fundamental assessments of data quality, providing the foundation for assertions of trust or confidence in the data. The data quality controls analysis is a component of column analysis that determines the suitability of data values found in a column. It focuses on three aspects of the data values in the column by using three analytical functions. Completeness A function used to identify non-significant values (missing data) in a column Domain A function that determines the validity of the data values in a column Format A function that determines the validity of the character pattern in a data value In data quality controls analysis, the system is used mainly to facilitate user analysis by presenting the data analysis results in a meaningful way and by tabulating the statistical results for the analytical functions. You can make the key decisions that establish the appropriate criteria for each analysis function that is used to control the data quality for the column. Completeness analysis: Completeness analysis is used to identify records that have data values that have no significant business meaning for the column. It is important for you to know what percentage of a column has missing data. Technique The approach to completeness analysis is that you review the column's distinct data values to mark any data value where, in your judgment, no significant business meaning for that column is being conveyed. Examples of these kinds of values include nulls, spaces, zeros, empty, or other default values (for example, Not Applicable, 999999999). When identified, you should mark these values as default whenever a significant data value is normally expected, or providing a replacement data value would improve the data quality condition of the column. If, however, any of these data values (for example, NA) is acceptable to the business, the data value would not be marked as default. When all of the column's data values have been reviewed, the total record count and percentage of
Methodology and best practices

25

all the records whose data values have been marked as default are considered incomplete, while all other records will be assumed to be complete. System capability During Column Analysis processing, the system constructs each column's frequency distribution. During your review of the results, the system will present the distinct data values from the frequency distribution in a grid that includes a validity flag for each distinct data value. All of the validity flags in the frequency distribution will have been set to valid when the frequency distribution was created. As you set validity flags to default for distinct data values, the system keeps a running list of the incomplete data values and a total of the record count and record percentage, as well as the distinct value count and percentage for incomplete data on the user interface screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the completeness results when all of the data values have been inspected and the domain and completeness analysis flag has been marked as reviewed, as shown in the following figure.

Figure 9. An example of the statistical results and flag settings

User responsibility You can view the completeness analysis when the column analysis review of columns is viewed. At the detailed column view, domain and completeness analysis has its own tab for viewing results. From this tab, you can view a column's frequency distribution to search for default data values. If a perceived default data value is found, you should mark that data value by overriding the validity flag setting with default. This process continues until you are satisfied that all of the incomplete data values have been found and flagged. You can also undo any decision that a data value is default by resetting the validity flag to

26

Methodology and Best Practices Guide

another setting. The process is ultimately completed when you review all of the data values and mark the column domain and completeness function as reviewed. Interpreting results Typically, the incomplete data values for a column will be obvious to you. If the column is a data class of CODE or INDICATOR, the number of data values needed to be inspected is limited. However, if the column is a data class of TEXT, QUANTITY, IDENTIFIER, or DATE, there might be many data values to inspect which could be very time-consuming. Two approaches that can produce the right results in less time include looking at the high end and low end of the frequency distribution first by data value and then by frequency count. It might also be useful to pre-establish criteria by which a data value will be marked as default. For example, if the column is defined to allow nulls, null should never be marked as default. Another conflict which can arise is when to mark a data value as default versus invalid. A general rule is that if there is an attempt to enter a significant value, but it is inaccurate (for example, a spelling error), it should be considered an invalid data value. If, however, the data value provides no hint to what the real data value should be, it is considered an incomplete data value. This distinguishes the data values based on the likely manner in which they would be rectified, correction versus research. Like other column properties, there is an advantage to maintain the consistency of how incomplete data values are identified across columns in the data environment. Decisions and actions You have multiple decisions to make for each column for completeness: v Search for and mark data values deemed to have incomplete data values (for example, missing data). v Once all the decisions for a column have been made, you can continue to review the column's domain values for validity or mark the column domain and completeness review as complete. At this point, you can also create a reference table with a list of the incomplete data values for use in other system settings. Performance considerations There are no significant system performance considerations for the completeness analysis function. You should consider the column's data class and its cardinality (for example, total number of distinct data values) when reviewing the analysis. Domain analysis: Domain analysis is used to identify records that have invalid data values in the column. It is important for you to know what percentage of a column has invalid data.

Methodology and best practices

27

Function Only complete data values are considered when determining invalid data values. Data values that are judged to be incomplete take precedence over data values that are invalid. Domain analysis technique The approach to domain analysis is that you, or the system, review the column's distinct data values to mark any data value considered to be invalid. The system has multiple types of domain analysis (for example, techniques) that can be used to perform the function. The criteria differ for determining valid from invalid data values, but all result in identifying and marking the column's data values as invalid when appropriate. The system assumes which type to use based on the data class of the column in question. However, you can choose to use any of the domain analysis types for any column regardless of its data class. The three domain analysis types are: v Value, where you manually review all data values v Range, where you set minimum and maximum valid data values v Reference File, where the system uses an external validity file System capability During column analysis processing, the system constructs each column's frequency distribution. During your review of the results, the system presents the domain analysis results in one of two ways: Value or Range, based on the data class of the column. The user interface contains a domain type selection box that allows you to change to a different domain analysis type if desired. For columns that are not data class QUANTITY or DATE, the system assumes the Value domain analysis type will be used. The distinct data values from the frequency distribution are displayed in a grid that includes a validity flag for each distinct data value. All of the validity flags in the frequency distribution were set to valid when the frequency distribution was created. Note: Some data values might have already been set to default if completeness analysis was previously performed. You can examine the data values and change any data value's validity flag from valid to invalid. As you set validity flags to invalid for distinct data values, the system keeps a running list of the invalid data values and a total of the record count and record percentage, as well as the distinct value count and percentage, for invalid data on the screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the column's domain results when all of the data values have been inspected and the domain and completeness analysis has been marked as reviewed, as shown in the following figure.

28

Methodology and Best Practices Guide

Figure 10. An example of the statistical results and flag settings

For columns that are data class QUANTITY or DATE, the system assumes the Range domain analysis type will be used. The frequency distribution's outliers (a number of the lowest distinct data values and the same number of the highest distinct data values from the sorted frequency distribution) are displayed in a grid that includes a validity flag for each distinct data value. All of the validity flags in the frequency distribution will have been set to valid when the frequency distribution was created. Note: Some data values might have already been set to default if completeness analysis was performed prior to domain analysis. You visually inspect the displayed low and high values and can change the validity flag on any of those data values to minimum (for example, a low value) or to maximum (for example, a high value) to establish the range of valid values. If the required minimum or maximum data value is not in the display, you can increase the number of outliers being displayed by the system. Also, if the minimum or maximum data value required is not in the frequency distribution, you can manually enter the appropriate data values into a text field. The system can also provide additional domain analysis for Range if you need it. You can request that the system perform a quintile analysis of the column's frequency distribution. In a quintile analysis, the system divides the sorted (for example, numeric sort for numeric columns, chronological sort for date and time columns, and alpha sort for alphanumeric columns) frequency distribution into five segments with an equal number of distinct data values. Note: The fifth segment might not be equal if the cardinality is not perfectly divisible by five.

Methodology and best practices

29

After the system analysis process is completed, the system displays a graph that shows the distribution of the records across all five segments as well as the low and high distinct data values of each segment, and so on. This holistic view of the column's data might help you to determine where the minimum and the maximum values for the column should be set. As you set validity flags to minimum or maximum, the system automatically sets the validity flags for data values lower than the minimum value or higher than the maximum value to invalid unless the flag is already set to default. Both the minimum value and the maximum value are considered to be valid values. As this system operation happens, the system keeps a running list of the invalid data values and a total of the record count and record percentage, as well as the distinct value count and percentage, for invalid data on the screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the domain results when all of the data values have been inspected and the domain and completeness analysis flag has been marked as reviewed, as shown in the following figure.

30

Methodology and Best Practices Guide

Figure 11. An example of the statistical results and flag settings

The third domain analysis type is known as the Reference File type. If you have an external file known to contain all of the valid data values permissible in the column (for example, a zip code file), it can be used by the system to automatically detect invalid values in the column's frequency distribution.
Methodology and best practices

31

The system will take each distinct data value from the frequency distribution and search the external file to see if that value is in the file. Any distinct data value not found on the external file will have its validity flag automatically set to invalid by the system. As you or the system sets validity flags to invalid for distinct data values, the system keeps a running list of the invalid data values and a total of the record count and record percentage, as well as the distinct value count and percentage for invalid data on the user interface screen. These statistical results and flag settings can be saved in the system at any time and ultimately can be recorded in the repository as the domain results when all of the data values have been evaluated and the domain and completeness analysis flag has been marked as reviewed. User responsibility You view the domain analysis when the Column Analysis review of columns is viewed. At the detailed column view, domain and completeness analysis has its own tab for viewing results. The system will present either the Value Domain Type window or the Range type window based on the column's data class. You can either accept that domain analysis type and proceed, or choose a more preferred domain analysis type to be used for the column. Once the appropriate domain analysis type has been selected, you focus on different aspects of the process depending upon which domain analysis type is chosen. For Value Domain Type, the key task is the to examine the data values. This should be done carefully for columns that have a relatively small cardinality (for example, CODE or INDICATOR columns). If the column has a large cardinality (for example, TEXT or IDENTIFIER columns), it might be too time-consuming to inspect each value. In that case, two approaches that can produce the right results in less time include looking at the high end and low end of the frequency distribution first by data value and then by frequency count. If the column has data values that can be logically sorted (for example, DATE, QUANTITY, or sequentially assigned IDENTIFIER columns), Range type is the most efficient means to perform domain analysis. An appropriate minimum and maximum value should be carefully chosen to define the valid range. If a reliable external file exists, Reference File is the preferred domain analysis type. Some caution should be used in using reference files created internally in the organization as opposed to reference files made available by external organizations. An internal reference file can sometimes be inaccurate and the root cause of invalid data found in the data sources. For Range and Reference File, the domain analysis process is completed after you are satisfied with the results. For the Value Domain type, the process is completed after the visual inspection of all relevant data values has been performed. Finally, you flag the domain and completeness status as completed for that column.

32

Methodology and Best Practices Guide

Interpreting results Typically, most of the invalid data values for a column will be obvious to you upon examination. The causes of this invalid data can vary from inaccurate data capture, system processing defects or inaccurate data conversions. The significance of the amount of invalid data in a column is dependent upon the nature of the column and its impact on business processes and system operations. If the analysis of data is driven by the need to use the data as source for new data integration, it is critical that the data either be corrected or that the integration system analysts be aware of the invalid data and how it should be treated. To improve and maintain the quality of domain values, it is also important to trace problems to their root causes so that you can take corrective action and prevent further proliferation of the problems. Decisions and actions You have multiple decisions to make for each column for domain analysis. You can choose the appropriate domain analysis type to be used for the column. Then you must decide the validity of individual data values, determine the minimum and maximum valid data values, or determine the availability of an acceptable external reference file. After all the decisions for a column have been made, you can continue to review the column's domain values for validity or can mark the column domain and completeness review as complete. At this point, you can also request the creation of a validity table that will create a single-column reference file that contains all of the valid distinct data values from the column's frequency distribution. This validity table can be exported to other products or systems for application to the data source, as shown in the following figure.

Figure 12. An example of the window where you select the reference table type

Also, if you wish to initiate corrective action on the invalid data, you can begin the process for any distinct data value flagged as invalid by entering a new replacement value into a transformation value field in the frequency distribution grid. You can then request the creation of a mapping table that will create a two-column reference file with the first column containing the existing data value,
Methodology and best practices

33

and the second column containing the entered replacement value. This mapping table can be exported to other products or systems for application to the data source. Performance considerations There are two important performance considerations when conducting domain analysis. v When using value domain analysis type, you should be aware of the significant time required to inspect every data value for columns with high cardinality in their frequency distribution. v Also, when using range domain analysis type, requesting a quintile analysis can create a lengthy processing job for columns with high cardinality in their frequency distribution. Format analysis: Format analysis is used to validate the pattern of characters used to store a data value in selective columns (for example, telephone numbers, Social Security numbers) that have a standard general format. Format analysis is useful when the format of characters is critical to automated procedures or to the display of those data values. Analysis technique Each data value in a column's frequency distribution is analyzed to create a general format for that data value that represents the character pattern of that data value. Then, all of the individual general formats for the column are summarized to develop a frequency distribution of general formats for the column. You can then inspect the general formats in the summary and decide for each one if that format conforms to the format requirements of the column, or violates the format requirements of the column. System capability During column analysis processing, the system constructs each column's frequency distribution and then analyzes each distinct value to develop a general format for that data value. To create the general format from the distinct data value, the system converts every alphabetic character to an a or A, depending on the capitalization of the character; converts every numeric character to a 9; and does not change spaces and special characters, as shown in the following table.
Table 2. General Format examples Data value John Smith 7256-AGW PR-609-54 VB General format Aaaa Aaaaa 9999-AAA AA-999-99 AA

Using all of the individual general formats, the system develops a frequency distribution by general format, which is referred to as the summary. Each general format in the summary has its own conformance flag. Initially, the system assumes that each general format conforms to the format requirements. When you flag a general format as "violates the standard", you can then also request that the system

34

Methodology and Best Practices Guide

flag all of the affected data values as invalid, as shown in the following figure.

Figure 13. An example of general formats and possible status values

User responsibility You can view the format analysis when the column analysis review of columns is viewed. At the detailed column view, format analysis has its own tab for viewing results. The system presents the summary of general formats used in the column. You inspect the general formats summary to identify and flag which general formats violate the requirements of the column. If needed, you can also request that the system flag every distinct data value that uses that general format to be invalid. The process is ultimately completed when you review all of the general formats and mark the format analysis function as reviewed. Interpreting results Most columns do not have stringent format requirements that must be enforced. For columns that have format requirements that must be enforced, there will typically be a limited number of general formats to inspect. When general formats are in violation of the requirements or are unexplainable, you can gain more insight by performing a drill-down of the general format in question. A list of the actual data value or values that are in that general format is displayed. This can be helpful in deciding if the general format conforms to or violates the column's requirements. The severity of the column's requirements and the need for corrective action will usually dictate the need to flag data values with general format violations as invalid data. If you wish to initiate corrective action on the invalid data, you can begin the process for any distinct data value flagged as invalid by manually entering a new replacement value, in a conforming general format, into a transformation value field in the frequency distribution grid. You can then request
Methodology and best practices

35

the creation of a mapping table that will create a two-column reference file with the first column containing the existing data value, and the second column containing the entered replacement value. This mapping table can be exported to other products or systems for application to the data source. Like other column properties, there is an advantage to maintaining the same format requirements for the same logical column if it appears in multiple locations. Finally, in some cases, you might want to enforce the formats of TEXT columns (for example, names, addresses, descriptions). While the format analysis could be used to do this, it is not the most efficient method. The companion InfoSphere QualityStage product has additional pattern analysis capabilities that can be used to analyze those types of columns more efficiently. Decisions and actions First, you must decide if a column requires enforcement of the format of characters in the data values. If it does, use format analysis and decide for each general format whether it conforms to or violates the column's requirements. Finally, you decide if the data values that have violated general format should be treated as invalid data. Performance considerations There are no significant system performance considerations for the format analysis function.

Table Analysis overview


Table analysis is the component of IBM InfoSphere Information Analyzer that is used to analyze data from a table perspective. InfoSphere Information Analyzer focuses table analysis on two areas: v The confirmation or identification of a primary key for a table v Given a primary key, performing a duplicate check of primary key values in the table Primary key analysis: The primary key analysis function is used to identify the column or columns in a table that are qualified and suitable to be the primary key of the table. Primary keys are either single-column (preferred) or multicolumn, which use a combination of columns. The function is used for tables that do not have a defined primary key in their imported metadata or for tables where a change in the primary key is wanted. This function, used in conjunction with the duplicate check function, is used to select a new primary key. Analysis technique The analysis for primary key is based on the system identifying and ranking primary key candidates. You can review the candidates and select an appropriate primary key.

36

Methodology and Best Practices Guide

The analysis and ranking of candidates is based on the uniqueness of the data values within the primary column or columns. Because uniqueness is an attribute of every primary key, it is used to identify primary key candidates. However, since not every unique column or combination of columns is suitable to be the primary key, user judgment is required to choose the best primary key from among the candidates. The analysis is performed in two phases. v First, the analysis of individual columns to identify potential single-column primary key candidates is made. The uniqueness of every column in the table is determined directly from the column's frequency distribution. The system displays the individual columns ranked in descending order by their uniqueness. If you can select a suitable single-column primary key, the analysis is complete and there is no need to perform the second phase of analysis. However, if there are no single-column candidates or no candidate is suitable to function as the primary key, then the second phase of analysis, multicolumn analysis, needs to be performed. v The second phase of primary key analysis, multicolumn analysis, is performed whenever single-column analysis fails to produce a primary key. Multicolumn analysis automatically combines the individual columns into logical column combinations. It then tests the uniqueness of the concatenated data values produced for each combination as if the concatenated columns were a single column. The multicolumn candidates are displayed to the user by the system ranked in descending order by their uniqueness percentage. You review the multicolumn primary key candidates being displayed and select the column combination which is most suitable to function as the primary key. After a single-column or multicolumn primary key has been selected, you can perform a duplicate check to confirm and accept the selection. System capability You initiate primary key analysis by selecting a table. The system analyzes the frequency distribution for each column in the table to determine if it is qualified to be the primary key. This qualification is based on the percentage of unique distinct data values in the frequency distribution (for example, a data value with a frequency count of 1). The uniqueness percentage is calculated by dividing the total number of unique distinct data values in the frequency distribution by the total number of rows in the table. The uniqueness percentage for each column is compared against the applicable primary key threshold percentage in the analysis options. If the uniqueness percentage is equal or greater than the primary key threshold percentage than the system will flag the column as a primary key candidate. The system presents the single-column primary key analysis results on a tab. The single-column primary key candidates are displayed ranked in descending order by their uniqueness percentage. You review the single-column candidates to determine if any of the columns are both qualified and suitable (for example, business wise) to be the primary key. If so, you can select those columns for further testing in the duplicate check function. If not, you move to the multicolumn analysis phase. The following figure provides an example.

Methodology and best practices

37

Figure 14. An example of the single-column primary key analysis results

Select the tab for multicolumn analysis. The system gives you three options to define the analysis task. v The first is the capability to select a subset of columns from the table for analysis. You can remove columns from analysis that are known not to be part of a logical primary key (for example, TEXT or QUANTITY). The second option is the capability to set the composite maximum number, which controls the maximum number (for example, 2-7) of columns that can be combined in the multicolumn analysis. The third option is to use either the source table or a data sample from the source table to perform the analysis. v After you specify these options, the system begins by generating all the possible column combinations starting at 2 and up to and including the composite maximum number (for example, composite max number = 4 generates all 2, 3, and 4 column combinations). The system concatenates each column combination and makes a distinct count query to either the original source table or its data sample. The count in the query response is divided by the total number of rows in the source table or in the data sample to derive a uniqueness percentage for that column combination. The uniqueness percentage for each column combination is compared against the applicable primary key threshold percentage in the analysis options. If the uniqueness percentage is equal to or greater than the primary key threshold percentage, then the system will flag the column combination as a primary key candidate. v The system presents the multicolumn primary key analysis results on a tab. The multicolumn primary key candidates are displayed in descending order by their uniqueness percentage. Then you review the multicolumn candidates to determine if any of the columns combinations are both qualified and suitable to be the primary key. If so, you can select those column combinations for further testing in the duplicate check function. If not, return to the multicolumn analysis phase, set new options, and continue until you identify a satisfactory multicolumn primary key.

38

Methodology and Best Practices Guide

Figure 15. An example of the multicolumn primary key candidates ranked in descending order by their uniqueness percentage

User responsibility You start the process by viewing the primary key analysis results for a table. The system initially displays the single-column analysis results for that table. Then you review the table's columns, ranked in descending order by their uniqueness, to determine if any of the individual columns is qualified (for example, perfectly unique or nearly unique) and suitable to be the table's primary key. If such a column is identified, you select that column and move to the duplicate check function. If no such column is identified, select the multicolumn analysis tab. For multicolumn analysis, you define the analysis scope by selecting which of the table's columns are to be included in the analysis and by setting a composite maximum number the system should use. The composite maximum number (2-7) controls the maximum number of columns the system should include in a column combination. You also have the option to direct the system to perform the multicolumn analysis against the original source table or a sample from the source table. If a data sample for the table already exists it can be used or a new data sample can be created. With these parameter settings, you can initiate the system's multicolumn analysis request. Similar to single-column analysis results, the system displays the multicolumn analysis results. You review the columns combinations, ranked in descending order by their uniqueness, to determine if any of the column combinations is qualified (for example, perfectly unique or nearly unique) and suitable to be the table's primary key. If such a column combination is identified, you select that column combination and go to the duplicate check function. If no such column combination is identified, you can change the parameter settings (for example, selected columns or composite max) and re-initiate another instance of multicolumn analysis. This process can be repeated until a multicolumn primary key is identified and selected.

Methodology and best practices

39

Interpreting results The primary key analysis process is generally straightforward for single-column primary keys, but more complex for multicolumn primary keys. For single-column analysis results, the focus your review is typically on the uniqueness percentage of each column and its data class. To be a primary key, the column should have perfect or near perfect uniqueness. There might be duplicate data values in the column but typically very few, if any. Given acceptable uniqueness, the column's data class will usually be an IDENTIFIER or CODE or, perhaps sometimes a DATE. It is less likely that the column is TEXT or an INDICATOR, and almost never a QUANTITY. For multicolumn analysis, because the analysis can be lengthy, it is important to narrow the analysis scope. The initial step should be to limit the columns to those more likely to be part of a multicolumn key. This can be done in two ways. First, any general knowledge you have about the table and its columns should be used to eliminate unlikely primary key columns from the analysis. Secondly, the data class of the column can be used to eliminate unlikely primary key columns. TEXT and QUANTITY columns should be eliminated followed by any CODE, INDICATOR or DATE columns whose definitions make them unlikely primary key columns. Finally, because of the way most tables have been designed, columns at the beginning of the table should be given more consideration than those in the middle or end of the table. Once the columns have been selected for multicolumn analysis, the next step is to set the composite maximum number. Keep in mind that each increase in the composite maximum number (for example, from 2 to 3) results in an exponential increase in the number of column combinations that the system will generate. Accordingly, if the composite maximum number needs to be increased iteratively in the search for the multicolumn primary key, it is helpful to further reduce the number of columns participating in the analysis. In multicolumn analysis, unless you have a strong indication of what the primary key column combination is, you should work with a data sample rather than the source table if it has a large number of rows. Because uniqueness is a primary key trait that must be true for any data sample taken from the source table, use a data sample as another means to optimize the function's performance. Finally, to be a primary key, the column or the column combination should not contain any null values. Decisions and actions You make several decisions when performing primary key analysis. The primary decision is the selection of a single-column or multicolumn primary key candidate to be the primary key. The decision is based on the uniqueness percentage inherent in the candidate and the suitability of the columns to function as a primary key column. After you make primary key selection, proceed to the duplicate check function. When performing multicolumn analysis, you make important decisions that affect the scope of analysis: v The selection of columns

40

Methodology and Best Practices Guide

v The composite maximum number v Use of data source or data sample Performance considerations There are no significant system performance considerations for the single-column primary key analysis. However, for multicolumn primary key analysis, the system's analysis workload is determined by the number of column combinations to be analyzed and the use of the data source versus a data sample. The number of column combinations is based on the number of columns selected and the composite maximum number. As the composite maximum number increases, the number of column combinations to be analyzed increases exponentially. Duplicate check analysis: Duplicate check analysis is used to test defined or selected primary keys for duplicate primary key values. Duplicate primary key data values should be researched and corrected to maintain the basic integrity of the table. Duplicates are more likely if the column or columns are part of a selected primary key that was not previously defined. Analysis technique For single-column primary keys, duplicate check is performed directly against the column's frequency distribution by searching for any distinct data values that have a frequency count greater than one (for example, duplicate values). For multicolumn primary keys, the system builds a frequency distribution of the concatenated data values from the primary key's column combination. After the frequency distribution is created, the duplicate check can be performed directly against the primary key's frequency distribution by searching for any distinct data values that have a frequency count greater than one (for example, duplicate values). The frequency distribution is saved for possible later use in the referential integrity function. System capability When you request that the system perform the duplicate check function, it is in the context of a defined or selected primary key. If the primary key is an individual column, the system uses the column's existing frequency distribution to search for duplicates and nulls. Any distinct data values that have a frequency count greater than one are considered duplicate values. The duplicate values, summary statistics, and a count of any nulls are displayed on the results screen. If you are satisfied with the duplicate check results, you can select the column as the primary key if you have not already done so. However, if the primary key is a column combination, the system first creates a frequency distribution of the concatenated data values from the column combination. The system uses that frequency distribution to search for duplicates and nulls as it does for an individual column frequency distribution. Any distinct data values (concatenated) that have a frequency count greater than one are considered duplicate values. The duplicate values, summary statistics, and a count of any nulls are displayed in the results. If you are satisfied with the duplicate
Methodology and best practices

41

check results, you can select the column combination as the primary key if you have not already done so. The following figure provides an example.

Figure 16. An example of duplicate values, summary statistics, and a count of any nulls on the results screen

User responsibility You initiate the duplicate check function for a defined or selected primary key. The system displays the results, including a statistical summary of unique and duplicate primary key data values. Any duplicate data values are also listed in the results. If the duplicate check results are satisfactory, you can accept the column or column combination as the primary key. If not, you can return to the primary key analysis function to search for a different primary key column or columns. Interpreting results Generally, the results of duplicate check are that there are no duplicate values or nulls. These results are usually sufficient confirmation that the proper column or columns were selected to be the primary key. Occasionally, duplicate check shows an extremely low number of duplicate or null values and should be further researched to determine their root cause and to see if they are correctable. If they are correctable, it might be acceptable to select the column or columns as the primary key. However, any needed corrections in the data needs to be made before the primary key can actually be implemented. If there are more than a few duplicate values or nulls, it is likely that this column should not be the primary key. In that case, you should return to the primary key analysis function to continue the search.

42

Methodology and Best Practices Guide

Decisions and actions You have only one basic decision to make for each primary key candidate subjected to duplicate check. You can either accept the column or columns as the primary key or not. In the case of multicolumn primary key candidates, you also can sequence the columns in the combination into any order. This becomes important later in that any foreign key candidates to this primary key should also be sequenced in the same column order. Performance considerations There are no significant system performance considerations for the duplicate check function when it involves a single-column primary key. For multicolumn primary keys, the system needs to create a new frequency distribution by using concatenated data values from the primary key candidate column combination. The runtime for that task is determined by the number of rows in the source table.

Cross-table analysis overview


Cross-table analysis is the component of IBM InfoSphere Information Analyzer used to analyze data from across tables. InfoSphere Information Analyzer focuses table analysis on two areas. v The commonality of column domain values v The referential integrity of foreign keys to primary keys Commonality analysis: The commonality analysis function identifies pairs of columns that have a significant number of common domain values. The columns might or might not be in the same data source, and might or might not have the same column name. The function is used to find like columns and redundant columns, and is used within the foreign key analysis function. Analysis technique The commonality analysis approach is that the system compares the frequency distribution data values for pairs of columns to determine the percentage of common data values one column has that are also in the other column. The user defines the tables or columns to participate in the analysis. The system generates all the possible column pairings. (For example, pair A-B is included, but pair B-A is not because it is redundant with A-B.) The total list of generated column pairs is subjected to a column pair compatibility test which eliminates any pair from the list whose columns can be predetermined not to be capable of having common data values. The system uses several column properties (for example, data type, length, general formats, and so forth). Those column pairs that pass the column pair compatibility test continue into the next analysis step. For each compatible column pair, the frequency distribution data values from the first column are compared to the frequency distribution data values of the second column. The process is repeated again with the second column compared to the first column. The analysis is completed with the recording of the A to B and the B to A commonality percentage. The computed percentages are then compared to the
Methodology and best practices

43

applicable common domain threshold percentage in the analysis options. If the percentage is equal or greater than the common domain threshold percentage, the column pair (for example, A-B or B-A) is flagged as being in common. System capability The commonality analysis is initiated by the user selecting the tables or columns to participate in the analysis. The system uses these selections to generate the list of every possible column pair combination. The system then proceeds to perform the column pair compatibility test on the full list of column pairs. When that step is completed, the system displays the intermediate results including summary statistics of how many column pairs were generated and how many of those passed or failed the compatibility test. The system also displays a detailed list of the column pairs that passed and would be used in the next analysis step. Review this information and decide whether to continue the process or to return to the original data selection step for modifications. If you choose to continue, the system begins the process of comparing frequency distribution data values for each column pair in turn until all of the column pairs have been analyzed. The process captures the commonality percentages for each pair and any commonality flags that were set. One feature of the analysis is that if the two columns have had their domain values previously compared, it was recorded in a history file with the date the comparison was performed. The system will use the percentages from the history file unless one of the column's frequency distributions has been updated since that date. When you initiate the commonality review, the system presents the results with a tab for each table used in the process. Each tab displays a table's columns and any paired columns (for example, any other column from any other table) that was flagged as having common domain values. If you choose, the system can also display the non-flagged pairs and can flag a column pair as being redundant. The following figure provides an example.

44

Methodology and Best Practices Guide

Figure 17. An example of a table's columns and any paired columns that was flagged as having common domain values

User responsibility You initiate the commonality analysis process by selecting tables or columns to be analyzed. After generating the column pairs and performing the column pair compatibility tests, the system displays the interim results to the user. After reviewing those results, decide whether to continue the process or return to the data selection step. When the system completes the commonality analysis, view the results. Each table and its columns are displayed on a tab that shows all other columns that have commonality with the columns in that table. If needed, you can flag any of the column pairs as containing a redundant column. Interpreting results When viewing the interim results, focus on two areas. v First, evaluate the total number of column pairs to be processed in the next step. The number generally should be 20% or less of the total number of column pairs generated. It should also be a reasonable workload for the computing resources being used. v Second, verify that any important column pairs have successfully passed the compatibility test. When the actual commonality results are reviewed, keep in mind that some of the column pairs flagged as common are to be expected. Every foreign key column should result in commonality with its corresponding primary key column. Those column pairs should not be flagged as redundant. For column pairs that do not

Methodology and best practices

45

have a key relationship, your judgment is needed to identify columns that are truly redundant. Decisions and actions You have several key decisions to make during commonality analysis. v Selecting tables and columns to be analyzed v Proceeding with analysis based on interim results v Identifying columns that are truly redundant Performance considerations The most significant system performance consideration for the commonality analysis function is the total number of column pairs that can be reasonably processed as one job. Foreign key analysis: The foreign key analysis function identifies foreign key candidates that refer to the primary key of a table. Foreign keys are either single-column or multicolumn that use a combination of columns. The function is used for tables that do not have defined foreign keys in their imported metadata or for tables where a change in the foreign key is required. This function, used in conjunction with the referential integrity function, is used to select a new foreign key. Function The foreign key analysis function, used in conjunction with the referential integrity function, is used to confirm existing foreign keys or to select new foreign keys. Foreign key analysis technique The approach to foreign key analysis is based on a variation of the commonality analysis function. The important difference is that the system-generated column pairs are limited to those pairs where the first column is a defined or selected primary key column in its table. The system searches for other columns whose data values are in common with the primary key column. First, define the tables and columns to participate in the analysis. The system generates a restricted list of column pairings where the first column must be a defined or selected primary key column for a table. The restricted list of generated column pairs are then subjected to a column pair compatibility test which will eliminate any pair from the list whose columns can be predetermined not to be capable of having common data values. The system uses several column properties (for example, data type, length, general formats, and so forth). Those column pairs that pass the column pair compatibility test continue into the next analysis step. For each compatible column pair, the frequency distribution data values from the second column are compared to the frequency distribution data values of the first column (for example, a primary key column). The analysis computes the percentage of the second column's frequency distribution data values that can also be found in the first column's frequency distribution. The computed percentages are then compared to the applicable common domain threshold percentage in the analysis options. If the percentage is equal to or greater than the common domain threshold percentage, the column pair is flagged as being in common.

46

Methodology and Best Practices Guide

When the commonality analysis process is completed, the system will have identified every column that has commonality with a primary key column whether it is a single-column key or part of a multicolumn key. Columns that have commonality with single-column primary keys are flagged as a foreign key candidate. For multicolumn primary keys, a table must have a combination of columns with commonality to every column that is part of the primary key in order to be flagged as a foreign key candidate. System capability The foreign key analysis is initiated by selecting the tables or columns to participate in the analysis. The system uses these selections to generate a restricted list of column pair combinations where the first column in the pair must be a primary key column. The system then proceeds to perform the column pair compatibility test on the restricted list of column pairs. When that step is completed, the system displays the intermediate results, including summary statistics of how many column pairs were generated and how many of those passed or failed the compatibility test. The system also displays a detailed list of the column pairs that passed and would be used in the next analysis step. Review this information and decide whether to continue the process or to return to the original data selection step for modifications. If you choose to continue, the system begins the process of comparing frequency distribution data values for each column pair in turn until all of the column pairs have been analyzed. The process captures the commonality percentage for the second column in the pair against the first column (for example, primary key column) and any commonality flags that were set. One feature of the analysis is that if the two columns have had their domain values previously compared, they are recorded in a history file with the date the comparison was performed. The system will use the percentage from the history file unless one of the column's frequency distributions was updated since that date. When you initiate the foreign key review, the system presents the results with a tab for each table used in the process. Each tab displays a table's primary key columns and any paired columns (for example, any other column from any other table) that were flagged as having common domain values. If the primary key is a single-column, the system displays any other columns with commonality to the primary key column and is flagged as a foreign key candidate. If the primary key is multicolumn, the system displays any other column combinations from a table where a column in the combination has commonality with one of the primary key columns. All columns of the primary key must be accounted for by a commonality column in order to flag that column combination as a foreign key candidate. User responsibility Initiate the foreign key analysis process by selecting the tables or columns to be analyzed. After generating the restricted column pairs and performing the column pair compatibility tests, the system displays the interim results. After reviewing those results, decide whether to continue the process or to return to the data selection step.
Methodology and best practices

47

When the system completes the foreign key analysis, view the results. Each table and its primary key columns are displayed on a tab that shows all other columns that have commonality with the primary key columns. Review the entries flagged as foreign key candidates and select the candidates that should be further tested for referential integrity with the primary key. Interpreting results When viewing the interim results, focus on the total number of column pairs to be processed in the next step. Also, ensure a reasonable workload for the computing resources being used. Review of the foreign key candidates carefully. In addition to the commonality of columns, the definition of the two tables involved should also be taken into account in determining if there is an actual logical relationship in the data. For single-column primary keys, it is usually obvious if the commonality column is in fact a foreign key. Primary key columns that are IDENTIFIER are typically straightforward decisions. For other data class types, you must ensure that the commonality of domains is deliberate and not accidental. For multicolumn primary keys, ensure that every commonality column meets the same requirements as already described for single-column keys. It is more likely that a column from the foreign key column combination has an accidental common domain. This is particularly true if the column combination includes DATE, INDICATOR, or CODE columns. Keep in mind that the foreign key candidacy for multicolumn primary keys is based on the collective commonality of individual columns. The true test occurs in the next step, referential integrity analysis, where the concatenated data values of the foreign key candidate are tested against the concatenated data values of the primary key. Decisions and actions You have several key decisions to make during foreign key analysis, such as: v Selecting tables and columns v Proceeding with analysis based on interim results v Selecting foreign key candidates for referential integrity analysis Performance considerations The most significant system performance consideration for the foreign key analysis function is the total number of column pairs that can be reasonably processed as one job. Referential integrity analysis: The referential integrity analysis function tests that every foreign key data value, single-column or multicolumn (concatenated), can access the primary key of a related table. Function The function is used to confirm the referential integrity for tables that have defined foreign keys in their imported metadata and to test referential integrity for foreign key candidates prior to their selection as a key. This function, used in conjunction

48

Methodology and Best Practices Guide

with the foreign key analysis function, is used to select a new foreign key. Technique The approach to referential integrity analysis is based on comparing the distinct data values in the foreign key's frequency distribution against the distinct data values found in the primary key's frequency distribution. If the primary key is single-column, the system uses the frequency distributions of the respective columns, foreign key and primary key, to make the comparison. However, if the primary key is multicolumn, the system first creates a frequency distribution for the foreign key based on the concatenated data values in the foreign key column combination. Then the system uses the frequency distributions (for example, concatenated values) of the foreign key and primary key to make the comparison. In the comparison, the system determines if any of the foreign key distinct data values cannot be found in the primary key's frequency distribution (for example, a referential integrity violation). The system also computes the number of primary key distinct data values that are not used as a foreign key data value. This might or might not be a problem based on the logical relationship (for example, 1 to n) of the two tables. System capability You initiate the referential integrity analysis by selecting a foreign key candidate or a defined foreign key. If the foreign key is a single-column, the system performs a comparison of the foreign key column's frequency distribution distinct data values against the primary key column's frequency distribution distinct data values. However, if the foreign key is multicolumn, the system first creates a frequency distribution of concatenated data values for the foreign key. This is needed to compare against the frequency distribution of concatenated data values for the primary key that was created in the duplicate check function. The comparison of frequency distributions produces summary statistics of the foreign key to primary key integrity, and the primary key to foreign key situation. The statistics are also represented graphically in a Venn diagram produced by the system. Finally, the system produces a list of the foreign key data values that are referential integrity violations. User responsibility You initiate the referential integrity function for a defined or selected foreign key. If the foreign key to be analyzed is a multicolumn key, the system will alert you that a multicolumn frequency distribution of concatenated data values must be created. At this point, you can sequence the column combination into a specific column order. This should be done, if required, so that the foreign key concatenated data values match the primary key concatenated data values that were created during the duplicate check function.

Methodology and best practices

49

The system displays the results, including a statistical summary of foreign key to primary key referential integrity, primary key to foreign key information, and a listing of the referential integrity violations. If the referential integrity results are satisfactory, you can accept the column or column combination as the foreign key. If not, you can return to the foreign key analysis function to search for a different foreign key column. Interpreting results Review of the referential integrity analysis is fairly straightforward. If there are no referential integrity violations between the foreign key candidate and the primary key, it is a good indication that the foreign key candidate can be accepted as the foreign key. If there are relatively few referential integrity violations, they should be researched to determine their root cause, and if they are correctable. However, if the number of referential integrity violations is significant, you should not accept the candidate to be a foreign key. Also, if you are aware that there should be a 1-to-1 or 1-to-n relationship between the primary key table and the foreign key table, the primary key to foreign key results should be inspected for its integrity. Decisions and actions You have only one basic decision to make for each foreign key candidate subjected to referential integrity analysis. You can either accept the columns as the foreign key or not. In the case of multicolumn foreign key candidates, you can also sequence the columns in the combination into any order that you want. This should be done, when needed, to ensure that the foreign key data values are concatenated in the same way as the primary key was. Performance considerations There are no significant performance considerations for the referential integrity function when it involves a single-column foreign key. For multicolumn foreign keys, the system needs to create a new frequency distribution that uses concatenated data values from the foreign key candidate column combination. The runtime for that task is determined by the number of rows in the source table.

Analysis settings
The analysis settings provide a parameter-driven control for how certain analysis is to be performed by the system. The system is installed with default settings for these analysis options, which can be changed by users as required.

Available analysis settings options


Analysis settings can be made applicable at the following levels: v System v Project v Database

50

Methodology and Best Practices Guide

v Table v Column The following table lists the available analysis options.
Table 3. Analysis settings Analysis area Analysis Function Analysis Option Default Setting Nullability Threshold 1.0% Setting Range: 0.01 10.0% Description If the column has null values with a frequency percent equal to or greater than the nullability threshold, inference is that the nullability flag is YES. If nulls do not exist in the column or the frequency percent is less than threshold, the nullability flag is inferred NO. If the column has a percentage of unique distinct values equal to or greater than the uniqueness threshold, inference is that the uniqueness flag is YES. If the column has a percentage of unique distinct values less than threshold, the uniqueness flag is inferred NO.

Column Analysis Properties

Properties

Uniqueness Threshold

99.0% Setting Range: 90.0 100%

Methodology and best practices

51

Table 3. Analysis settings (continued) Analysis area Analysis Function Properties Analysis Option Default Setting Constant Threshold 99.0% Setting Range: 90.0 100% Description If the column has a single distinct value with a frequency percent equal to or greater than the constant threshold, inference is that the constant flag is YES. If there is no distinct value with a frequency count greater than the threshold, the constant flag is inferred NO. If a column has a cardinality percentage equal to or greater than the primary key threshold, the column is inferred to be a primary key candidate as a single column. Data sample size controls the number of records to be included in a data sample of the data collection.

Table Analysis

Primary Key Identification

Primary Key Threshold

99.0% Setting Range: 90.0 100%

Primary Key Identification

Data Sample Size

2,000 records Setting Range: 1 999,999 records

52

Methodology and Best Practices Guide

Table 3. Analysis settings (continued) Analysis area Analysis Function Primary Key Identification Analysis Option Default Setting Data Sample Method Sequential Description Data sample method indicates to the system which technique of creating the data sample of the specified data sample size should be used: v Sequential first n rows. v n every nth row. v Random based on random record number generator. Primary Key Identification Composite Key Maximum Columns 2 Setting Range: 90.0 100% The common domain threshold is the percentage of distinct values appearing in one data field's frequency distribution that match distinct values in another data field's frequency distribution. If the percentage of matching distinct values is equal to or greater than the threshold, the two data fields are inferred to have a common domain.

Data quality methodology


The data quality methodology information is organized by analytical function and provides in-depth knowledge and best practices for your data quality strategy. There are a number of key concepts with data quality analysis and monitoring that include capabilities to: v Support business-driven rule definition and organization v Apply rules and reuse consistently across data sources v Leverage multi-level rule analysis to understand broader data quality issues
Methodology and best practices

53

v v v v v

Evaluate rules against defined benchmarks/thresholds Assess and annotate data quality results Monitor trends in data quality over time Deploy rules across environments Run data quality activities on either an impromptu or scheduled basis using either the user interface or the command line

Data quality analysis and monitoring


Data quality analysis and monitoring is a set of capabilities within IBM InfoSphere Information Analyzer designed to evaluate data against specific customer criteria. These rule evaluation criteria can be used repeatedly over time to see important changes in the quality of the data being validated. You can control the functionality of the data quality rule. The functionality can vary from a simple single column test to multiple rules evaluating multiple columns within and across data sources. You develop the data rules needed for analysis based either on prior profiling results (for example, column analysis) or on the data quality rules defined from the customer's business processes. After a data rule is designed and created, you can set up a rule definition that defines the logic of a data test irrespective of the data source. The rule definition is created by using logical variables or references that will ultimately link to the customer's data. Then you link or bind these logical variables to actual data (for example, data source, table and column or joined tables) to create an executable form of that data rule. This two step process enables the same logical data rule definition to be re-used by binding it to many different sources of data. Each binding creates a new executable form of the same logical data rule, achieving consistency across the data sources. After a data rule definition with binding has been created, the data rule can be executed (for example, tested against actual data). The system executes the data rule which produces relevant statistics and captures designated source records in a system generated output table whenever a source record either satisfies or fails the logic condition of the data rule, depending on user preference. Such data rules can also be shared with the larger community allowing other quality projects to take advantage of this common understanding. These data rules can also be combined into larger units, called rule set definitions, when the rules analyze common data. Such rule sets can contain one or many data rules, or even other rule sets, thus serving as building blocks for broader data evaluation. By evaluating multiple data rules in this larger context, you gain greater insight into the overall quality of the data sources. For example, there might be 50 specific data rules that define good quality data for a customer record. Evaluating each rule in isolation only provides insight into the exceptions for the specific data. But by combining the data rules together into a rule set, not only can you see the exceptions to each rule, but you can also view each record that contains one or more exceptions, so that truly problematic data can be addressed and resolved systematically. Further, by evaluating the rules together, patterns of associated problems emerge where failures in one rule either cause failures in other rules, or indicate underlying issues, facilitating root cause analysis. Finally, by understanding both rule and record issues together, a picture

54

Methodology and Best Practices Guide

of the overall quality of the data source is visible allowing comparison of current quality to established baselines and quick determination of whether data quality is improving or degrading. With either the data rules or rule sets, you can establish benchmarksor levels of tolerance for problems. Additionally, metrics can be created that look at the statistics generated by the data rules, rule sets, or other metrics that help establish costs or weights to the data quality problems, or facilitate comparison of results over specified intervals. By executing the data rules, rule sets, and metrics over time, the quality of many data sources can be monitored and tracked, problems can be noted, and reports can be generated for additional review and action.

Structuring data rules and rule sets


The data quality analysis capabilities of IBM InfoSphere Information Analyzer leverage some key concepts that are at the foundation of the ability to reuse data rules across multiple data sources and evaluate data rules in a larger set-based concept.

Logical definitions versus physical rules


Rule definitions represent a logical expression. As logical expressions, rule definitions can include any number of variables representing source or reference data. Such variables can be drawn from a number of sources, but remain logical variables. From logical definitions, one or more physical data rules can be generated. It is at this point that logical variables are linked or bound to actual data for evaluation. They can come from: Physical data sources Standard metadata imported that uses IBM InfoSphere Information Analyzer or other IBM InfoSphere Information Server modules Terms Business glossary references created via IBM InfoSphere Business Glossary Global variables User-created, defined logical placeholders that can be mapped to one or many physical sources Local variables User-created words that represent a concept for a rule definition Rule definitions are like templates and can be associated with one or many executable data rules. Executable data rules represent an expression that can be processed or executed. Executable data rules require that the variables within a rule definition be bound to a specific physical data source (for example, a source or reference column must be directly linked to one specific physical column within a specific table that has a data connection). Executable data rules are the objects that are actually run by a user and that will produce a specific output of statistics and potentially exceptions. These levels can be viewed graphically or visually. For instance, your business wants to validate tax identifiers, so you create a conceptual rule that simply says: Validate Tax Identifier (or Validate SSN). This is done from a business perspective, outside of InfoSphere Information Analyzer.

Methodology and best practices

55

From the conceptual rule definition, you can see that there are two logical rules that express this concept. For example, to validate tax identifiers, there is the condition that the tax identifier must exist, and the condition that the tax identifier must meet a defined format. For actual implementation, these rules are bound and deployed against multiple physical data sources, with different columns names. They can be combined into a rule set as well which encapsulates the two conditions within the broader concept of validating the tax identifier.

Conceptual Definition Concept: Validate SSN

Logical Definition Rule Definition: SSN exists Rule Definition: SSN matches_format 999-99-9999

Physical Definition Rule: ORA.SSN exists Rule: DB2.SocSecNum exists

Figure 18. Conceptual differences of logical and physical definitions

Rules and rule sets


A data rule is a single object that is executed against a specific set of records (either from one source or a set of conjoined sources) and generates a single pass or fail type statistic. This means that, for a rule that tests whether a tax identifier exists, a value is either true (meaning, it exists) or false (meaning, it does not exist). Rules generate counts of exceptions, details of exceptions, and user-defined output results. A rule set is a collection of one or many rules that are executed together as a single unit against a specific set of records (either from one source or a set of conjoined sources) and generate several levels of statistics. Rule sets generate: v Rule level exceptions (for example, how many records passed or failed a given rule) v Record level statistics (for example, how many rules did a specific record break and what is the confidence in the record) v Rule pattern statistics (for example, x number of records broke rules 1, 3, and 10) v Data source level statistics (for example, what's the average number of rules broken per record) By grouping rules into sets, you are able to assess rules as separate conditions with several dimensions of output instead of only seeing one piece of the whole picture. You can assess multiple rules in context, providing evaluation and confidence at the each level of assessment. By knowing why your record failed and how many times or ways it failed you reduce your analysis time and correction effort in the long run. By looking for patterns of rule failure across records, you get a view into dependent conditions and problems and can trace multiple conditions to specific failure points, helping to establish root causes.

56

Methodology and Best Practices Guide

Local and global variables


All rule definitions contain variables that represent what is evaluated. These variables can simply be typed (for example, to create any word to represent the variable such as sourcedata', customer_type', or column1'), drawn from the listing of available or known data sources (for example, an actual column name), or chosen from the listing of available terms (if IBM InfoSphere Business Glossary is in use) that represent common business terminology in your organization. Variables that are created and chosen from the lists are considered local, meaning that they are part of that specific rule definition. Similarly, named variables can occur in other rule definitions, but they do not have any relationship to each other. If you change the variable sourcedata' in one rule definition, it has no effect on a separate rule definition that also contains a variable called sourcedata'. However, you can create variables that are global. Global (or global logical) variables, when used in multiple rule definitions, do represent the same information and changes to the global variable and applied consistently to all rules that incorporate them.

Business Problem
The business needs a rule that: v Identifies records that have a balance below a specific value v The value can be variable over time, but has broad applicability v The value has a business name: Minimum Balance' Solution: v Create a Global Variable called Minimum Balance'. v Assign a literal numeric value such as 0.00 to the Global Variable. v When building a rule definition to test against a minimum balance, construct it using the Global Variable option. Global variables provide reusable values that are shared across rules. These can be standard literal values (for example, M' for male, F' for female, 0, 1000000) or standard reference sources (for example, MASTER_STATE.StateCode). Consequently, there is only one place you need to make a change to update rules with global variables. Important: A change in the global variable is immediately applied to all rules that utilize it. You do not have to modify the rule definitions or rules directly.

Naming standards
All rule and rule set definitions, all executable data rules and rule sets, all metrics, and all global variables require a name. A name can either facilitate or hinder reuse and sharing of these components, so while names can be freeform and be quickly created when simply evaluating and testing conditions, it is critical to identify an effective naming standard for ongoing quality monitoring. Naming standards are intended to: v Establish standard conventions for naming rules and other quality components v Make it easy to figure out what a rule or rule set definition or a metric is about v Make it easy to figure out what source a rule or rule set is applied to v Find a rule definition, rule, or other quality control when you need it
Methodology and best practices

57

v Promote reusability Consistency and clarity are the two key balancing factors in naming standards. It is easy to get wide variation in names, as shown in the following figure.
ValidUSTaxID Account Data Quality IsYesOrNo GreaterThanZero BankAccountRuleSet TransformAcctValid BankBalance TransformSavings TransformOther TransformChecking
Figure 19. An example of simple naming standards

Data Rule Definition Rule Set Definition Data Rule Definition Data Rule Definition Rule Set Definition Rule Set Definition Data Rule Definition Data Rule Definition Data Rule Definition Data Rule Definition

Valid US Tax ID has 9 digi Contains several account Check if column contains Checks if a # is a positive Bank Account Rule Set Account Type Transform Validate Savings Acct Cha Validate Other Acct Chang Validate Checking Acct Ch

It is also easy to adopt a strong set of conventions, but, rigid naming conventions reduce clarity. If you cannot easily interpret the meaning of a rule definition, you will most likely create something new instead of reusing a valid one.
FMT_HS_SSN_1 FMT_DM_ACT_BAL Data Rule Definition Data Rule Definition Example cryptic name Cryptic account balance definition

Figure 20. An example of more rigid naming standards

One thing to avoid doing is to embed items that can be elsewhere (for example, your initials). Other fields such as Description and Created By can store these types of references, facilitate clarity and organization, and can be used for sorting and filtering. A common naming approach is to use a structure like Prefix Name Suffix. Prefix Values can be used to facilitate sorting and organization of rules. Prefix with something that expresses a Type/Level of Rule. For example, the following is a breakdown of different types or levels of rules: v Completeness (Level 0), use CMP, L0 v Value Validity (Level 1), use VAL, L1 v Structural Consistency (Level 2), use STR, PK, FMT, L2 v Conditional Validity (Level 3), use RUL, CON, L3 v Cross-source or Transformational Consistency (Level 4), use XSC, TRN, L4 The use of a schema like L0 to L4 allows easy sorting, but might be too cryptic. The use of abbreviations is clearer, and will sort, but does not necessarily sort in sequenced order. Name Values help to identify the type of field and the type of rule (for example, SSN_Exists, SSN_Format, AcctBalance_InRange). The choice of a name will typically be based on the type of object (for example, Rule Definition, Rule Set, and so on).

58

Methodology and Best Practices Guide

Rule definitions v Type of field evaluated provides for better understanding v Names can range from generic template' to specific Data Exists could be a template for common reuse SSN_InRange is specific to typical data Rule set definitions These typically range from a group of rules for the same data to a group of rules for some larger object. v SSN_Validation would include rule definitions for existence, completeness, format, valid range, and other specific field type v Customer_Validation would include rule or rule set definitions for all fields related to the customer such as name, SSN, date of birth, and other values Rules v The table and column evaluated provides for better understanding. It might be necessary to include the schema, database, or file name as well. v Include the definition test type to readily identify how the rule is applied: AcctTable.Name_Exists CustTable.SSN_InRange Rule sets The schema and table, possibly with column evaluated, provides better understanding of the data source. v CustTable.SSN_Validation identifies the relevant table or column v CustDB.CustTable_Validation identifies the relevant schema or table only as it evaluates multiple columns Metrics v Can be applied to single or multiple rules or rule sets or other metrics v Include the type of measure evaluated v Include the measure interval (for example, day, week, month) if relevant v Where applied to a single rule or rule set, including the name of rule or rule set helps understanding AcctTable.Name_Exists_CostFactor_EOM identifies that there is a cost applied to this rule at end of month CustDB.CustTable_Validation_DailyVariance_EOD identifies that there is a test of variance at end of day against prior day Global variables v Are individual instances of variables used in multiple rule definitions v Distinguish these from specific local variables including a prefix that helps users who are reading the rule definition understand that this is a global variable (for example, GLBL_) v Include the type of value or reference (for example, Balance, StateCode) v Include an identifier that conveys what information is associated with the value or reference (for example, Minimum_Balance, Master_StateCode) Suffix values can help with filtering, clarity of the type of rule, or to establish iterative versions.
Methodology and best practices

59

Data rules analysis


At the heart of data rules analysis, the business wants to build data rules to test and evaluate for specific data conditions. The data rule analysis function within IBM InfoSphere Information Analyzer is the component by which you develop a free-form test of data. Collectively, data rules can be used to measure all of the important data quality conditions that need to be analyzed. You can also establish a benchmark for data rule results against which the system will compare the actual results and determine a variance. Data rule analysis can be used on a one-time basis but is often used on a periodic basis to track trends and identify significant changes in the overall data quality condition. Some common questions need to be answered to establish effective data rules analysis and ongoing quality monitoring: v What data is involved? Are there multiple parts or conditions to the validation? Are there known qualities' about the data to consider? What are the sources of data (for example, external files)? Are there specific data classes (for example, dates, quantities, and other data classes) to evaluate? v Are there aspects to the rule' that involve the statistics from the validation? v Are there aspects to the rule' that involve understanding what happened previously? v v v v As you address these questions, it is also critical to follow some basic guidelines. Know your goals Data quality assessment and monitoring is a process and not everything happens in a product. Data is used by the business for specific business purposes. Ensuring high quality data is part of the business process to meet business goals. Understanding which goals are most critical and important, or which goals can be most easily addressed, should guide you in identifying starting points. Keep to a well-defined scope Projects that try to do it all in one pass generally fail. You should keep in mind that it is reasonable and acceptable to develop the data rule incrementally to drive ongoing value. Key business elements (sometimes called KBEs or critical data elements) are often the first targets for assessment and monitoring as they drive many business processes. Identify what is relevant v Identify the business rules that pertain to the selected or targeted data elements. v Document the potential sources of these elements that should be evaluated. These can start with selected systems or incorporate evaluations across systems. v Test and debug the data rule against identified sources to help ensure the quality of the data rule and more importantly the quality and value of the output. v Remove the extraneous information produced. Not all information generated by a data rule is necessarily going to resolve data quality issues.

60

Methodology and Best Practices Guide

Important: A field might fail a rule testing for whether it contains Null values. If it turns out that the field is optional for data entry, the lack of this information might not reflect any specific data quality problem (but might indicate a business or data provisioning problem). Identify what needs further exploration Expand the focus with new and targeted rules, rule sets, and metrics. As you create more rules and evaluate more data, the focus of data quality assessment and monitoring will expand. This will often suggest new rules, broader rule sets, or more specific metrics be put into process beyond the initial set of quality controls. Periodically refresh or update Utilize the knowledge that you already gained. Just as business processes and data sources change, so should the rules, rule sets, and metrics evaluated.

Data rules function


Often after source data has been profiled, the next level of data analysis is performed in the data rule analysis function. Up-front data profiling and analysis is not required to begin a program of data rule analysis and ongoing quality monitoring. Such data quality assessments provide details and insights where few exist, but where sources are well-known and ongoing monitoring is critical. These can be the immediate focus of a data quality program. You can create a data rule in one of two ways in the system. The IBM InfoSphere Information Analyzer user interface includes: v A data rule building capability that guides you in developing data rule logic in a declarative form. v A free-form expression builder to create a data rule in a more procedural form.

Data classes and data rule types


Different types of data will need different data rules and variations of these typical data rule types. The data classifications inferred through the data quality assessment and investigation are good starting points to focus on the types of data rules you will need. Common data classes and typical examples include: v Codes State abbreviation Procedure code Status code v Indicators Order completion status Employee activated flag v Identifiers Social security number Customer ID Order ID v Quantities List price
Methodology and best practices

61

Order quantity Sales amount v Dates/Times Shipped date Order date v Text or free-form descriptive data Product name Customer name Street address The following table describes the inferred data classes and examples of validation methods.
Table 4. Common rule validations by data class Data Class Identifier Validation method v Existence, assumed required v All unique, no duplicates (if used as a primary key or uniqueness expected) v Maximum and minimum values, format (if text) Indicator v Existence if required data v Valid values, no default values Code v Existence if required data v Valid values, no default values Date/Time v Existence if required data v Maximum and minimum values, format (if text) v Valid values, no default values Quantity v Existence if required data v Maximum and minimum values, format (if text) v Valid values, no default values Text v Existence if required data, though many are not v Default values, format requirement, and special characters

The following table further describes the data classes, related sample validation methods, and sample conditions.
Table 5. Methods for assessing Value Validity by Data Class Data Class Identifier Sample validation methods Sample conditions

v Out of range (if applicable) v Field < minimum AND Field > maximum v Inconsistent or Invalid formats (if text/if applicable) v Field matches_format 99-99 Field in_reference_list {A', B'}

Indicator

Invalid values

62

Methodology and Best Practices Guide

Table 5. Methods for assessing Value Validity by Data Class (continued) Data Class Code Sample validation methods Invalid values Sample conditions v Field in_reference_list {A', B'} v Field in_reference_column XXX Date/Time v Out of range (if applicable) v Field < min_date v Inconsistent or Invalid v Field IS_DATE formats (if text or number) Quantity v Out of range (if applicable) v Field < min_value v Invalid format (if text) Text v Invalid special characters v Invalid format (if applicable) v Field IS_NUMERIC v Field NOT in_reference_list {A'} v Field matches_format AAA

Data rule types


Data rules can be grouped into two types: quality control rules and business rules. Data rules tend to originate from: v Follow-up on specific data issues revealed in the profile analysis. v Validation of established business rules pertinent to the user's business operations and practices. Quality control data rules: Quality control data rules are usually developed from profiling related issues that are identified during the profiling analysis. Issues typically include, but are not limited to, columns and tables whose profiling results are either unsatisfactory or suspect. The profiling analysis components lead to corresponding quality control rules that help to manage data as a raw material for producing effective user information. Types of quality control data rules include: v Column property values v Column completeness v Column validity v Column format v Primary key integrity v Foreign key integrity Column property values Column property values include data type, length, precision, scale and nullability. The most common data rule of this type is to determine the number of nulls in a column, as shown in the following example. The other column properties are more typically monitored by re-profiling the data.

Methodology and best practices

63

Table 6. Example of column property testing in a data rule Source Column X Column X Validation test EXISTS NOT EXISTS Reference Operand Condition column exists column does not exist

Column completeness Testing for column completeness involves testing for nulls, spaces, zero or default values conditions that were identified during profiling.
Table 7. Example of column completeness testing in a data rule Source Column X Column X Validation test NOT EXISTS REFERENCE LIST DEFAULT VALUES LIST Reference Operand OR Condition column does not exist Column has a default value

Column validity Testing for valid values in a column ensures that only allowable data values are used in the column (for example, codes). As a data rule it is usually set up in one of three ways: v A single valid value as a literal within the data rule logic v Multiple valid values defined in a list within the data rule logic v Multiple valid values stored in a table externally from the data rule logic To test for valid values use one of the following data rule forms:
Table 8. Example of a test for valid values Source Column X Column X Validation test NOT = NOT REFERENCE LIST NOT REFERENCE COLUMN Reference USA VALID VALUES LIST VALID VALUES TABLE Operand Condition Column value is not "USA Column value not in valid values list Column value not in valid values table

Column X

Column format Format testing is used on columns that have specific data formatting requirements (for example, telephone numbers). To test that a column conforms to a specific format:

64

Methodology and Best Practices Guide

Table 9. Example of test where a column conforms to a specific format Source Column X Validation test MATCHES Reference VALID FORMAT Operand Condition Column format matches valid format

To test that a column violates a specific format:


Table 10. Example of test where a column violates a specific format Source Column X Validation test Reference Operand Condition Column format does not match valid format

NOT MATCHES VALID FORMAT

Primary key integrity A data rule can be used to test the integrity of primary key column or concatenated primary key columns. Integrity is defined as the primary key has no duplicate data values. To test that primary key data values are unique (integrity):
Table 11. Example of test where primary key data values are unique (integrity) Source Column X Validation test UNIQUENESS Reference Operand Condition Primary key values unique

To test that primary key data values contain duplicate data values (integrity violation):
Table 12. Example of test where primary key data values contain duplicate data values (integrity violation) Source Column X Validation test NOT UNIQUENESS Reference Operand Condition Primary key values not unique

For tables that have multi-column primary keys, concatenate the columns as follows to apply the appropriate UNIQUENESS test.
Table 13. Example of test where tables have multi-column primary keys Source Column X + Column Y + Column Z Validation test UNIQUENESS Reference Operand Condition X + Y + Z is the multi-column primary key

Foreign key integrity A data rule can be used to test the integrity of single column or multi-column foreign keys. Integrity is defined as every data value of a foreign key matches a corresponding data value of a primary key referenced by the relationship.
Methodology and best practices

65

To test that foreign key data values have referential integrity:


Table 14. Example of test where foreign key data values have referential integrity Source Column X (foreign key) Validation test REFERENCE COLUMN Reference PRIMARY KEY Operand Condition Foreign key value matches a primary key value

To test that foreign key data values have duplicate data values (integrity violation):
Table 15. Example of test where foreign key data values have duplicate data values (integrity violation) Source Column X (foreign key) Validation test NOT REFERENCE COLUMN Reference PRIMARY KEY Operand Condition Foreign key value does not match a primary key value

For tables that have multi-column foreign keys, concatenate the columns as follows in order to apply the appropriate REFERENCE COLUMN test.
Table 16. Example of test where tables that have multi-column foreign keys Source Column X + Column Y + Column Z Validation test REFERENCE COLUMN Reference Operand Condition X + Y + Z is the multi-column foreign key

Business rules: Business rule are the other primary source for data rule analysis. Applicable business rules might already exist or they can evolve as data analysis progresses to more complex criteria. Typically, a business rule in its data rule form will be designed to evaluate whether an expected end-state of data exists following some business or system process (for example, the proper capture of customer information following the enrollment of a new customer). Types of data rules created from business rules include: Valid Value Combination An example of a valid value combination might be certain medical types of service that can only be performed at certain places of service and must be compatible with certain city, state, and zip code for USA addresses. One of the most common types of business data rule is one that validates the combination of data values stored in multiple columns within a logical record. There are many business policies and practices that can be represented by the combination of data values in more than one column. Usually the valid combinations of data values represent expected end-states of data based on a business system process.
Table 17. Example of valid values combinations Type of Service Surgery Place Of Service Hospital Validity Valid

66

Methodology and Best Practices Guide

Table 17. Example of valid values combinations (continued) Type of Service Surgery Xray Xray Place Of Service Ambulance Hospital Pharmacy Validity Invalid Valid Invalid

There are generally two ways to develop a valid value combination rule. The first is to detail the combinations within the data rule logic:
Table 18. An example of one way to develop a valid value combination rule Source TYPE OF SERVICE Column PLACE OF SERVICE Column TYPE OF SERVICE Column Validation test = Reference SURGERY Operand AND Condition

HOSPITAL

OR

XRAY

AND

CE OF SERVICE = Column

OUTPATIENT

A second approach that might be more efficient when there are a large number of combinations is to use an external table to store the valid values combinations. The external table can be created by: v Making a virtual column from all the columns in the combination v Running Column Analysis to create a frequency distribution of the virtual column values (that is, the combinations) v Marking each combination in the frequency distribution as valid or invalid v Creating a reference table of valid values (that is, combinations) from the frequency distribution v Using the reference table in a valid values combination data rule Given a reference table, the validation can be performed in a single line data rule.
Table 19. An example of a single line data rule with a given reference table Source TYPE OF SERVICE Column + PLACE OF SERVICE Column Validation test REFERENCE COLUMN Reference TOS+POS VALID VALUES TABLE Operand Condition TOS + POS value combination matches a TOS + POS valid values combination in the reference table

Or, to find invalid value combinations:

Methodology and best practices

67

Table 20. An example of a single line data rule to find invalid value combinations Source TYPE OF SERVICE Column + PLACE OF SERVICE Column Validation test NOT = Reference TOS+POS VALID VALUES TABLE Operand Condition TOS + POS valid value combination does not match a TOS + POS valid values combination in the reference table

Computational Another common type of business data rule is one that mathematically validates multiple numeric columns that have a mathematical relationship These computations can be in an equation form (for example, the hourly rate multiplied by the number of hours worked must equal the gross pay amount) or in a set form (for example, the sum of detailed orders must equal the total order amount). There are usually many prescribed computations among numeric columns in a typical database. These business defined computations can be verified by using a computational data rule.
Table 21. An example of a computational business rule Source HOURLY RATE Column x HOURS WORKED Column Validation test = Reference GROSS PAY AMOUNT Column Operand Condition Validate calculation of the gross pay amount Validate calculation of the total orders amount

SUM(DETAILED = ORDER AMOUNTS Columns

TOTAL ORDER AMOUNT Column

The computations performed in a computational data rule can vary from simple to complex by creating data expressions that use the appropriate scalar functions and numeric operators. Chronological (also called Ordered Values) Business rules that validate time and duration relationships are known as chronological data rules. These rules can define chronological sequence (for example, a project activity date must be equal or greater than the project start date and equal or less than the project completion date) or chronological duration (for example, a customer payment must be within 30 days of billing to avoid late charges). These time-based relationships can be validated by using a chronological data rule.
Table 22. One example of a chronological business rule Source PROJECT ACTIVITY DATA Column Validation test >= Reference Operand Condition Project activity cannot occur before start date

PROJECT START AND DATE Column

68

Methodology and Best Practices Guide

Table 22. One example of a chronological business rule (continued) Source PROJECT ACTIVITY DATE Column Validation test =< Reference PROJECT COMPLETION DATE Column Operand Condition Project activity cannot occur after completion date

Table 23. Another example of a chronological business rule Source PAYMENT DATE Column Validation test <= Reference BILLING DATE Column + 30 days Operand Condition Payments must be made in 30 days or less from billing

Conditional Business rules that don't conform to a valid values combination, computational or chronological data rule are generally referred to as conditional data rules. These business rules typically contain complex if...then...else logic that might include valid values combinations, computational and chronological conditions (for example, if the customer has an active account, and the last order date is more than one year old then the catalog distribution code should be set to quarterly).
Table 24. Example of column property testing with this business rule Source Customer Activity Code Column CUSTOMER LAST ORDER DATE Column CUSTOMER CATALOG CODE Validation test = Reference ACTIVE Operand AND Condition Customer is active Last order is within the last year Customer is scheduled to receive a catalog every three months

>=

TODAYS DATE - AND 365 days QUARTERLY

Or, to test for non-compliance with this business rule.


Table 25. Example of non-compliance testing with this business rule Source Customer Activity Code Column CUSTOMER LAST ORDER DATE Column CUSTOMER CATALOG CODE Validation test = Reference ACTIVE Operand AND Condition Customer is active Last order is within the last year Customer is scheduled to receive a catalog every three months
Methodology and best practices

>=

TODAYS DATE - AND 365 days QUARTERLY

69

There are endless possibilities for creating conditional data rules. Many business rules that form the basis for conditional data rules will already exist in the legacy of application systems. Others will evolve from the data analysis itself as more of the data's relationships and issues are revealed. You should strive to develop the most effective set of data rules applicable to the current situation.

Data rule analysis techniques


Business or data rules are often presented in a complex, compound manner. This is often the product of approaching rules from the standpoint of an existing technical evaluation, such as SQL. To establish effective data rule definitions, it is useful to start by looking for building blocks for the rule (such as noting that a quantity must be in a specific range). Working from those building block conditions, you can test and debug pieces to ascertain results, then incrementally add conditions as needed, or take advantage of rule sets to combine conditions instead of building all into one rule. This last note is of particular importance. Technical tools or languages such as SQL often require putting many compound conditions together to understand whether a record passes or fails a series of conditions. However, many of these tools do not provide the ability to break down and evaluate the individual conditions and how they relate together. The rule set support in IBM InfoSphere Information Analyzer allows you to look for the individual components and then assess together so that problems with any or all conditions emerge. Building data rules: Rules are often presented in a complex, compound manner, so breaking the rule down to these building blocks is a key step in rule construction and analysis. Data rules are created to solve a business problem. Business problem The business needs a rule that identifies missing data: v For Factory Worker' only. All other professions should be ignored for this rule. v Factory Worker' might be in upper, lower, or mixed case. Identify the percent of records that are valid. Solution To solve the business problem, the desired rule will identify missing data for the profession of Factory Worker' and will ignore all other professions: v Look for conditional situations like IF...THEN.
IF profession = 'Factory Worker'

v Look for alternate conditions (for example, ignore all others). v Look for the specific type of test, such as Exists (for example, missing data' is a signal for this type of test). v Build an initial solution, such as:
IF profession = 'Factory Worker' THEN sourcedata EXISTS

v Test and review the output by using a sample data set. v Look for specific data conditions. For example:

70

Methodology and Best Practices Guide

Professor', Factory Worker' that might be upper, lower, or mixed case Factory Worker' that might have leading or trailing spaces Data that might exist but be blank (that is, has spaces only) v Use functions to address actual data. v Update the solution, for example:
IF ucase(profession) = 'FACTORY WORKER' THEN sourcedata len(trim(sourcedata)) <> 0 EXISTS AND

Positive and negative expressions, and valid and invalid data: Data rule definitions can be created either as a positive statement or expression or as a negative one (for example, SourceData EXISTS or SourceData NOT EXISTS). It is easier and natural to express a rule definition from a positive perspective, particularly when investigating suspicious conditions from a data quality assessment (for example, SourceData CONTAINS #'). However, when focusing on an ongoing data quality monitoring program, you should express and define rule definitions from the standpoint of the following question: What is my valid business condition? There are two reasons for doing so. v First, the statistical results will show the percent or count of data that meets that condition, and the expectation is that what meets the condition is what is valid. v Second, when organizing multiple rules together and considering the overall quality of data, you want all tests to produce a consistent set of results (that is, all rules should show output that meets the business condition). By expressing rules consistently based on the question what is my valid business condition, you will create rules that produce consistently understood results. Business problem The business needs to validate that the Hazardous Material Flag (abbreviated HAZMAT) is valid: v Identify the percent of records that are valid. v A valid flag occurs when the field is populated, contains one of a given set of values, and does not contain extraneous characters. Solution v Create three rule definitions to test that: The field exists (for example, Hazmat EXISTS). The field conforms to the list of valid values Y' or N' (for example, Hazmat IN_REFERENCE_LIST {Y', N'}). The field does not include a bad character, specifically a hash (#) character (for example, Hazmat NOT CONTAINS #'). v Combine the three rule definitions into a Rule Set called HAZMAT_VALID. Note: The three rule definitions include both positive and negative expressions, but they all reflect the conditions that the business has determined to be a valid piece of data. Creating a data rule definition: You create a data rule definition and then generate a data rule executable.

Methodology and best practices

71

The following information provides a high-level overview of the process you will use to create a data rule definition. 1. Define the data rule definition, including the data rule definition name, a description of the data rule definition, and an optional benchmark.

Figure 21. Example of the Open Data Rule Definition window with the Overview tab selected

2. On the Rule Logic tab (as shown in Figure 22 on page 73), you define the rule logic line-by-line by using pre-defined system capabilities and terms or variables to represent real data references. Enter the required test conditions logic, line by line. Each line of logic contains the following elements: v A Boolean operator (AND or OR) that is the logical connection to the previous line (not required on first line) v Open parenthesis (optional) v Source data (logical variable) v Condition (optional NOT) v Type of check (system defined) v Reference data (literal or logical variable) v Closed parenthesis (optional)

72

Methodology and Best Practices Guide

Figure 22. Example of the Open Data Rule Definition window with the Rule Logic tab selected

The result of this task is a data rule definition. This step can be repeated multiple times binding a single data rule definition to multiple data rules, each applying different actual data (for example, a data rule to verify city-state-zip code used for customer, billing, shipping and vendor addresses). Once you save the data rule definition, the system assembles the elements defining the rule logic into a contiguous statement that is stored in the repository. However, this contiguous statement is parsed back to its elements whenever you view the data rule logic. Generating a data rule executable: You generate a data rule executable from the data rule definition. To create an executable data rule from the data rule definition, you will set the bindings for the data rule, by connecting each local variable used in the rule logic to physical data sources in your project. If the data references involve multiple data sources or data tables, you will specify the required joining of data to generate the data rule. The following sample window shows the Bindings and Output tab for the new data rule. The following information provides a high-level overview of the process you will follow to generate a data rule executable. 1. Open the data rule and select the Bindings and Output tab.

Figure 23. Example of the Open Data Rule window with the Bindings and Output tab selected

Methodology and best practices

73

2. Select the actual data references with which you want to replace the logical variables in the data rule definition. Note: The data rule definition can be used to create multiple executable data rules by binding the logical variables to different source data. This enables the same data rule logic to be used by different data sources (tables) where applicable. If required by the data rule logic, you can also enables joins of data from different data sources (tables) during this binding process. 3. Finally, you define the output table conditions and output columns required for producing an output table during data rule execution. System capabilities All data rules are executed in the same manner. The system retrieves all of the source data records, including the join of records if required, which are then tested against the logic of the data rule. Each source record either meets or does not meet the logical conditions expressed in the data rule. The statistics are updated accordingly and if the source record matches the designated output conditions of the data rule, it is added to the output table for that data rule. If the data rule has a benchmark for results, the actual results are compared against the benchmark and a variance is calculated. This information is included in the statistics update. The statistics generated from the execution of a data rule include: v Number of records tested v Number of records that met the data rule conditions v Number of records that did not meet the data rule conditions v Percentage of records that met the data rule conditions v Percentage of records that did not meet the data rule conditions v Number of records in the variance from the data rule benchmark (optional) v Percentage of records in the variance from the data rule benchmark (optional) User responsibilities: You have significant responsibilities in the development of data rules. If you expect to create many data rules, give some forethought into the data rule naming conventions. This helps to organize the data rules into the appropriate category and to easily locate specific data rules. When documented data rules do not already exist, you might conduct a focused business rules brainstorming session with a team of business users and IT staff to effectively discover and prioritize business rules. This is a good team task where decisions are made by consensus ensuring wide acceptability of a data rule. Also, brainstorming should work through a series of key steps: v Identify the core objects for analysis and monitoring (for example, customer). v Identify the primary elements (fields) that comprise the initial focus. v Identify the classes of the data to evaluate, as this will suggest typical rules for completeness, validity, structure, and so on. v Look at the range of data rule types to identify other key factors. There should be a good distribution of business rules by type (for example, valid values combinations, computational, chronological, and conditional). The business rules identified through this process should be documented, follow-up research

74

Methodology and Best Practices Guide

should be conducted, and identified conditions translated into data rule definitions before actually building the data rules into the system. Before you start to create a data rule definition, you should be prepared with the required data rule logic information. For simple, single line data rules, the preparation will be minimal. For complex data rules (such as, multiple lines and conditions), it is helpful to have some notes or logic schematics for the data rule before entering the data rule logic into the system. It is critical that you always test the data rule logic thoroughly before it is used in an actual data analysis application. This is especially true for complex data rules that will often execute successfully, but do not necessarily reflect the intended data rule logic. Careful examination of output records and checking the logic are often very useful to ensure that the data rule logic is correct. The art of creating accurate and well-performing data rules will improve over time with added experience. Typically, more data rules are developed than are actually useful to a user. Many times a data rule's usefulness cannot be established until initial results can be examined. It is recommended that you keep these extraneous data rules for possible future use while you focus on the most effective data rule sub-set to perform the actual data analysis. Interpreting results: Data rule results include statistical information and the output table containing source data records that met the conditions of the data rule. At a glance, the data rule results can show: v The number or percentage of records that met the conditions in the data rule v The number or percentage of records that did not meet the conditions in the data rule v Any variance from a benchmark established for the data rule You should review the results as a specific indication of the quality of the source data in relation to the data rule. When looking at the data rule results, you must be aware of how the data rule logic was constructed. For example: Data rule A tests that a column must contain a valid value.
Table 26. An example of a data rule where a column must contain a valid value. Source Column A Validation test REFERENCE COLUMN Reference VALIDITY TABLE A Operand Condition Column A value matches a value in validity table A

Data rule B tests that a column does not contain an invalid value.

Methodology and best practices

75

Table 27. An example of a data rule where a column does not contain an invalid value. Source Column B Validation test NOT REFERENCE COLUMN Reference VALIDITY TABLE B Operand Condition Column B value does not match a value in validity table B

Since data rule A tests for valid data values, perfect results (meaning, all valid values) would show 100% of source records met the conditions of the data rule. On the other hand, data rule B tests for invalid data values, perfect results (meaning, no invalid values) would show 0% of source records met the conditions of the data rule. While either approach can be taken, when expanding on a data quality assessment, tests for valid data conditions should be utilized only when building an ongoing data quality monitoring program. Additional capabilities: Several features in IBM InfoSphere Information Analyzer extend the capabilities of data rule definitions. The additional capabilities include: Regular expressions Regular expressions can be tested with the MATCHES_REGEX data check. Regular expressions are a standard convention for evaluating string data, whether for formats, specific characters, or particular sequences of text data. InfoSphere Information Analyzer uses the Perl Regular Expression Library for these types of evaluations. Many regular expressions can be found online via standard search engines that address typical data including text-based dates, email addresses and URLs, standard product codes. These regular expressions can often be copied and pasted into the InfoSphere Information Analyzer user interface when building these types of rules. For example, to test for a valid credit card, the data might contain these conditions: length 16, prefixed by a value of 4', and dashes are optional is expressed as:
Credit_card MATCHES_REGEX '/^4\d{3}-?\d{4}-?\d{4}-?\d{4}$/'

Wide range of functions The range of functions provide the ability to manipulate data ranging from strings to dates and numerics or vice versa. You can leverage the functions that work with particular types of data, such as the 'Datediff' function that allows you to compare and identify the difference in days between two distinct dates. Conditional and compound evaluations You can form conditional and compound evaluations. Conditional evaluations are the IF...THEN constructions, which provide the ability to evaluate only a subset of records when a particular condition occurs. Compound evaluations use AND and OR expressions to provide the ability to put multiple conditions into consideration. Note: When you find multiple IF clauses, you should leverage rule sets and test each condition both independently as its own rule and together as distinct evaluations for all records.

76

Methodology and Best Practices Guide

For example, to validate that records from Source A contain a gender of M' or F' and do not have an account type of X' is expressed as:
IF source = 'A' THEN gender IN_REFERENCE_LIST {'M', 'F'} AND account_type NOT = 'X'

Note: This condition could also be expressed through two distinct rules in a rule set, one to evaluate the gender in the reference list and one to evaluate the account type. This creates a greater reusability of each rule in distinct situations or distinct data sources. Remember: It is optimal to develop rules from a standpoint of what data is valid. In the previous example, the above expression indicates that the gender should be either M' or F' and the account type should be something other than X'. Virtual tables You can define unique views into data sources without requiring changes to the database itself. These are in effect, spontaneous slices of the data, filtering out what the user does not care or need to evaluate. Virtual tables support filtering the data in three ways: v Horizontal or record-based filter. This is a Where clause applied to the source (for example, limit the data selected to a specific date range). v Vertical or column-based filter. This is an explicit column selection applied to the source (for example, only include columns A, B, and C). v Both horizontal and vertical filtering With virtual tables, you can perform all functions as if the virtual table was a regular table. Virtual tables can be used in all data profiling functions such as column analysis. These virtual tables are processed independently and do not affect any prior results by using the base table itself. Virtual tables can be used to generate data rules for execution. Binding of variables occurs against the virtual table instead of the regular table, and such generated data rules are always applied to the virtual table, however it is defined. Defining benchmarks: Benchmarks represent the threshold or tolerance for error associated with a specific condition such as the validation of a data rule. If a benchmark is not established or set at either the level of the rule definition or the data rule executable, the generated statistics will simply reflect how many records met or did not meet the rule. There is no notation to indicate whether this result is acceptable or not. By establishing a benchmark, you are indicating at what point errors in the data should trigger an alert condition that there is a problem in the data. These marks will be visible in the data rule output. There will be additional indications of variance that can be used for subsequent evaluation. You will set benchmarks based on how you want to track the resulting statistics. If a data rule is set to test the valid business condition, the benchmark can either reflect the percent or count of records that met the data rule, or those that do not meet the data rule. For example: The business expects that the Hazardous Material flag is valid. The tolerance for error is .01% This can be expressed either as:
Methodology and best practices

77

Benchmark: or Benchmark:

did not meet % < .01% met % > 99.99%

As with naming standards, it is best to establish a standard for benchmark measurement to help ensure consistency of experience when monitoring ongoing data quality. Using a positive value (met %) for the benchmark has the value that users will view as 99% of target. Making data rule definitions available for reuse: A key aspect of building logical business-driven data rule definitions is reusability. The business wants to share a rule definition with others to: v Provide a standard, consistent rule form v Leverage already established knowledge v Utilize a typical definition pattern as a base for new definitions You can make your data rule definitions available for reuse in several ways. v You can publish a rule definition that provides an easy way to share rule definitions across projects. The types of rule definitions to publish include: Basic templates, which include standard instances of specific checks. For example:
sourcedata EXISTS and len(trim(sourcedata)) <> 0

Specific data templates, which include good tests for typical data. For example:
email_addr matches_regex '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}\b'

- These are likely to have some complexity. - They have been tested with representative data sources. Note: Some types of rule definitions that you might not want to publish include those rule definitions that contain sensitive data and rule definitions that test for sensitive data conditions. v You can copy a rule definition within a project that allows users to use a specific rule definition as a basis for building and testing new conditions or as a means to create new versions while protecting the original definition. v Rule definitions can be applied against any number of data sources. Each data rule executable generated and derived from the rule definition ensures consistency and reusability. Remember: Rule definitions are not specific to a data source, even though the variable starts with a data source selection. Data rules always link back to a single specific rule definition. Changing the rule definition changes the validation logic for all rules derived from the rule definition. The Usage tab on the Data Rule Definition screen provides visibility into how many data rules are derived from the definition and what will be impacted by a change to the definition. Decisions and actions: In developing your data rule analysis application, you must make some key decisions. You must decide which data rules are most important to your business requirements and which critical data areas need to be tested the most.

78

Methodology and Best Practices Guide

You should review a data rule's results to identify any results that vary from expectations (for example, benchmarks) or that have unusual variance amounts. Assuming that the data rule involved is accurate and does not require any maintenance updates, you should focus on the nature of the failed data and its root causes. Corrective data cleanup action should be initiated as required and the data retested with the same data rule.

Data rule sets


IBM InfoSphere Information Analyzer also provides a mechanism where you can group multiple data rules that use the same source data into a data rule set. Based on all of the individual data rule results, the data rule set also produces overall statistics that apply to that source data in addition to the statistics resulting from the individual data rules. You can also establish benchmarks for data rule set results. When you run a data set rule, the actual results are compared against the benchmarks to determine the relevant variances, all of which can be monitored and trended. Rule sets provide a number of key advantages in evaluation of overall data quality: v Rule sets provide support for evaluating data based on many data rule conditions. By using this building block approach, you can adapt to differing requirements of evaluation between different systems, sources, or even business processes. In one case, the business condition for valid tax identifier (ID) might be that the data exists and that it meets a particular format. In another case, the valid tax ID might require both those two conditions and a third that the tax ID is within a specific range or in a reference source. Each of these three conditions can be represented as a unique data rule definition. The conditions can be combined into larger rule sets that test for each individual instance without impacting the underlying rule definitions and creating broader reusability. v Rule sets provide scoring of all rules for each record in the set so that results can be viewed in multiple dimensions. This allows you to more readily assess problems associated with both individual rules and individual records, as well as finding patterns of association between rules that would otherwise remain hidden. v By evaluating multiple rules simultaneously against each record in the targeted data source, the underlying engine will optimize the rule evaluation both for execution, as well as for processing by identifying conditions that cannot occur together. This ensures that the execution is done efficiently and speeds time to delivery of results. Where data rule definitions can be divided into a range of types applied to different classes of data, rule sets most commonly fall into one of three primary patterns.
Table 28. Three primary patterns for rule sets Pattern Multiple rules for one field Examples Multiple rules for date of birth: v Date of Birth is a date v Date of Birth > 1900-01-01 v If Date of Birth Exists AND Date of Birth > 1900-01-01 and < TODAY Then Customer Type Equals 'P'

Methodology and best practices

79

Table 28. Three primary patterns for rule sets (continued) Pattern Case-based conditions Examples Multiple conditions for a source-to-target mapping: v IF sourcefield = 'A' THEN targetfield = '1' v IF sourcefield = 'B' THEN targetfield = '2' v IF sourcefield inRefList {'C', 'I', 'X'} THEN targetfield = '3' v IF sourcefield NOT inRefList {'A', 'B', 'C', 'I', 'X'} THEN targetfield = '9'

Rules for data entities (with multiple fields)

Multiple conditions for multiple fields that describe a customer: v Name EXISTS v SSN EXISTS v SSN matches_format '999999999' v Date of Birth is a date v Date of Birth > 1900-01-01 v Address1 EXISTS v Address2 NOT CONTAINS 'Test' v StateCd inRefColumn MasterStateCd v ZipCode inRefColumn MasterZipCode

Data rule set function


When a number of data rules have been developed for the same source data (for example, a table), you can create a data rule set that groups these rules together. A defined rule set is a retrievable object. The primary benefits for grouping rules into rule sets include: v Complete evaluation and review of exception data at multiple levels v Improved performance (that is, all the data rules use one pass of data) v Additional statistics about the source data table These additional statistics include: v Number of records that met all rules v Number of records that failed one or more rules v Average number of rule failures per record v Standard deviation of the number of rule failures per record v v v v Percentage of records that met all rules Percentage of records that failed one or more rules Average percentage of rule failures per record Standard deviation of the percentage of rule failures per record

Positive and negative expressions, and valid and invalid data


Data rule definitions can be created either as a positive statement or expression or as a negative one (for example, SourceData EXISTS or SourceData NOT EXISTS).

80

Methodology and Best Practices Guide

By expressing rules consistently based on the question What is my valid business condition?, you will create rules that produce consistently understood results. Important: For the meaningful development of rule sets, it is critical that all rule definitions included adhere to this approach of building rule definitions from the same (usually Valid Data) perspective. If this consistent approach is not applied, the results of rule definitions included in the rule set will be a meaningless grouping of valid and invalid data conditions and statistics.

Creating a rule set


You create a rule set and then generate a rule set executable. A data rule set is developed by using a two-step process. 1. From the Rule Set Definition window, you define the rule set.

Figure 24. Example of the Open Rule Set Definition window with the Overview tab selected

2. You select the data rule definition and the executable data rules to be included in the rule set.

Methodology and best practices

81

Figure 25. Example of the Open Rule Set Definition window with the Quality Controls tab selected

Generating a rule set executable


You will generate a rule set executable from the rule set definition. Next, you generate a rule set executable by selecting actual data to be used for each term or variable in the rule set variables. If the data references involve multiple data sources or data tables, you can specify the required joining of data to perform the rule set execution. Use the Bindings and Output tab to associate rule logic variable information, join keys, and output criteria for the rule set. Important: A rule set constraint is that every data rule definition or executable data rule selected must use the same source data (that is, table or joined table). You can also define the output conditions that cause a source data record to be added to the output table for the data rule set. The following information provides a high-level overview of the process you will use to generate an executable rule set. 1. Open the rule set definition and select the Bindings and Output tab.

82

Methodology and Best Practices Guide

Figure 26. Example of the Open Rule Set window with the Binding and Output tab selected

2. Select the actual data references with which you want to replace the logical variables in the rule set definition. Note: The rule set definition can be used to create multiple executable rule sets by binding the logical variables to different source data. This enables the same rule logic to be used by different data sources (tables) where applicable. If required by the data rule logic, you can also enables joins of data from different data sources (tables) during this binding process. 3. Finally, you define the output table conditions and output columns required for producing an output table during rule set execution. The result is the executable rule set. You can repeat this task multiple times binding a single data rule set definition into multiple executable rule sets each by using different actual data (for example, a data rule to verify city-state-zip code used for customer, billing, shipping and vendor addresses).

System capabilities
All rules sets are executed in the same way. The system retrieves all of the source data records, including the join of records if required, which are then tested one-by-one against the logic of each data rule in the rule set. Each source record either meets or does not meet the logical conditions expressed in a data rule. The statistics (by individual data rule and for the rule set) are updated accordingly. If the source record matches the output conditions of the rule set, it is added to the output table for that rule set. If the rule set has a benchmark for results, the actual results are compared against the benchmark and a variance is calculated. This information is included in the statistics update. The statistics generated from the execution of a data rule set include: v Data Rule Set record-based statistics Number of records tested Number of records that met the data rule conditions Number of records that did not meet the data rule conditions Percentage of records that met the data rule conditions Percentage of records that did not meet the data rule conditions Number of records in the variance from the data rule benchmark (optional)
Methodology and best practices

83

Percentage of records in the variance from the data rule benchmark (optional) v Rule set source-based statistics Number of records that met all rules Number of records that failed one or more rules Average number of rule failures per record Standard deviation of the number of rule failures per record Percentage of records that met all rules Percentage of records that failed one or more rules Average percentage of rule failures per record Standard deviation of the percentage of rule failures per record

User responsibilities
You have significant responsibilities in the development of rule sets. Your primary responsibility is the selection of data rules to be included in the rule set. One consideration is that every data rule selected will be evaluated against the same set of data whether that is a single table or joined tables. Another data rule set consideration is that you want to ensure that all of the individual data rules in the set test for conditions on the same basis. For example: v Data Rule A tests that a column must contain a valid value.
Column A, IN_REFERENCE_COLUMN, Validity Table A

v Data Rule B tests that a column does not contain an invalid value
Column B, NOT_IN_REFERENCE_COLUMN , Validity Table B

The additional statistics produced for the data rule set will be meaningless and most likely misleading.

Interpreting results
Rule set results include the statistical information about individual data rules and the results related to the source data analyzed by all the data rules in the rule set. At a glance, the rule set results can show: v The number or percentage of records that met the conditions in all of the data rules in the set v The number or percentage of records that did not meet the conditions of one or more of the data rules in the set v Any variance from a benchmark established for the data rule This is an indication of the overall quality of the source data, particularly if it is to be used by corporate systems to produce user information. Rule set results are multi-level outputs with multi-dimensional sets of statistics. They provide insight in three ways: Validity Shows how many records violated one or multiple rules. Validity can be seen either based on the individual rule or the individual record. You can: v View by rule - summary and detail of rule-level exceptions v View by record - summary and detail of record-level exceptions You can think of these views like two directions or two distinct pivots of a spreadsheet. The records are the rows, the rules are distinct columns, and

84

Methodology and Best Practices Guide

the intersections are the points of issue or exception.

Figure 27. Example of the Rule Set View Output window with the Result tab, with the By Record View selected

Confidence Is a measure of whether a record conformed or did not conform to expectations. The expectation is generally that the record met all rules. However, this is also a dimension that supports its own benchmark or threshold. Confidence can be seen based on the distribution of exceptions based on all the rules evaluated, or in the associations between rules. You can: v View by distribution of exceptions - summary of the number of records violating some count or percentage of rules v View by pattern - when rule A is violated, rule B is also violated You can think of these views as a summary level and a detail level. The summary shows the extent to which there are either many records with few problems, or many records with many problems. The detail shows the extent to which there are relationships between those problems based on frequency of occurrence together.

Methodology and best practices

85

Figure 28. Example of the Rule Set View Output window with the Result tab, with the By Distribution View selected

Baseline Comparison Shows the degree of change between an established point in time and the current condition. This is a graphical and statistical representation of how well the entire data source conformed, improved, or degraded versus an established expectation. You can view by distribution of exceptions, meaning a comparison to a established base set. You can consider this comparison as yielding a summary evaluation. The relationship between the two time intervals shows one of the five typical conditions: v The current graph is narrower and to the left of the baseline. This indicates that there are fewer records with problems and those records have fewer incidents of rule violations. Important: This is the optimal, ideal change. v The current graph is wider but to the left of the baseline. This indicates that there are fewer records with problems, but that those records have more incidents of rule violations. v The current graph is more-or-less the same as the baseline. This indicates that the state of the data has remained largely constant even if the prior problem records have been corrected, as is the case in Figure 29 on page 87. Important: This is a steady-state condition. v The current graph is narrower but to the right of the baseline. This indicates that there are more records with problems though those records have fewer specific incidents of rule violations. v The current graph is wider and to the right of the baseline. This indicates that there are more records with problems and that those records have more incidents of rule violations. Important: This is the worst-case change.

86

Methodology and Best Practices Guide

Baseline Set Comparison Run Date/Time Executed Total Records Mean Rules Not Met Standard Deviation Similarity Degradation 12/13/2009 12:02:07 PM 209 1.7544 % 7.4432 % 99.7677 % 0.0000 % Baseline 12/13/2009 8:26:43 PM 207 1.7713 % 7.4771 %

Records
0%

02%

% Rules Not Met Baseline Similarity Degradation

Run

Figure 29. Example of the Baseline Set Comparison graphical representation of how well the entire data source conformed, improved, or degraded versus an established expectation

Decisions and actions


In developing your rule set analysis application, you must make some key decisions. You should review the rule set results and the individual data rule results within the rule set to identify any results that vary from expectations (for example, benchmarks) or that have unusual variance amounts. Assuming that the data rule involved is accurate and does not require any maintenance updates, you should

Methodology and best practices

87

focus on the nature of the failed data and its root causes. Corrective data cleanup action should be initiated as required and the data retested with the same data rules.

Metrics
Metrics are user-defined objects that do not analyze data but provide mathematical calculation capabilities that can be performed on statistical results from data rules, data rule sets, and metrics themselves. Metrics provide you with the capability to consolidate the measurements from various data analysis steps into a single, meaningful measurement for data quality management purposes. Metrics can be used to reduce hundreds of detailed analytical results into a few meaningful measurements that effectively convey the overall data quality condition. At a basic level, a metric can express a cost or weighting factor on a data rule. For example, the cost of correcting a missing date of birth might be $1.50 per exception. This can be expressed as a metric where: v The metric condition is:
Date of Birth Rule Not Met # * 1.5

v The possible metric result is:


If Not Met # = 50, then Metric = 75

At a more compound level, the cost for a missing date of birth might be the same $1.50 per exception, whereas a bad customer type is only $.75, but a missing or bad tax ID costs $25.00. The metric condition is:
(Date of Birth Rule Not Met # * 1.5 ) + (Customer Type Rule Not Met # * .75 ) + (TaxID Rule Not Met # * 2.5 )

Metrics might also be leveraged as super rules that have access to data rule, rule set, and metric statistical outputs. These can include tests for end-of-day, end-of-month, or end-of-year variances. Or they might reflect the evaluation of totals between two tables such as a source-to-target process or a source that generates results to both an accepted and a rejected table, and the sum totals must match.

Metrics function
When a large number of data rules are being used, it is recommended that the results from the data rules be consolidated into meaningful metrics by appropriate business categories. A metric is an equation that uses data rule, rule set, or other metric results (that is, statistics) as numeric variables in the equation. The following types of statistics are available for use as variables in metric creation: v Data rule statistics Number of records tested Number of records that met the data rule conditions Number of records that did not meet the data rule conditions Percentage of records that met the data rule conditions Percentage of records that did not meet the data rule conditions Number of records in the variance from the data rule benchmark (optional) Percentage of records in the variance from the data rule benchmark (optional)

88

Methodology and Best Practices Guide

v Rule set statistics Number of records that met all rules Number of records that failed one or more rules Average number of rule failures per record Standard deviation of the number of rule failures per record Percentage of records that met all rules Percentage of records that failed one or more rules Average percentage of rule failures per record Standard deviation of the percentage of rule failures per record v Metric statistic, which includes metric value A key system feature in the creation of metrics is the capability for you to use weights, costs, and literals in the design of the metric equation. This enables you to develop metrics that reflect the relative importance of various statistics (that is, applying weights), that reflect the business costs of data quality issues (such as, applying costs), or that use literals to produce universally-used quality control program measurements such as errors per million parts.

Creating a metric
You create a metric by using existing data rules, rule sets, and metric statistical results. A metric set is developed by using a two-step process. 1. In the Open Metric window, you define the metric, which includes the metric name, a description of the metric, and an optional benchmark for the metric results. 2. In the Open Metric window Measures tab, you define the metric equation line-by-line, by selecting a data rule executable, a data rule set executable, another metric or a numeric literal for each line. Then you apply numeric functions, weights, costs, or numeric operators to complete the calculation required for each line of the metric.

Methodology and best practices

89

Figure 30. Example of the Open Metric window with the Measures tab selected

You can then test the metric with test data before it is used in an actual metric calculation situation. Metrics produce a single numeric value as a statistic whose meaning and derivation is based on the design of the equation by the authoring user.

Sample business problems and solutions


The following are examples of typical business problems and metric solutions: Business problem The business defines a data quality issue as: v Records with blank genders or blank addresses v Blank genders are fives times more serious than blank addresses Solution Create a metric to assess the results of these two data rule validations together:
( Account Gender Exists # Not Met * 5 ) + ( Address Line 2 Exists # Not Met )

Business problem The business wants to evaluate and track the change of results in a data rule called AcctGender between one day and the next. Solution v There is one existing data rule to measure. v You create three metrics: one to evaluate current end of day, one to hold the value for the prior end of day, and one to assess the variance between the current and prior end of day values. AcctGender_EOD - (AcctGender_%Met)

90

Methodology and Best Practices Guide

- Run at end of day after rule. AcctGender_PriorEOD - (AcctGenderEOD Metric Value) - Run next day prior to rule. AcctGender_EOD [same as metric above] - (AcctGender_%Met) - Run after new end of day after rule. AcctGender_EODVariance - (AcctGender_EOD Metric Value AcctGender_PriorEOD Metric Value) - Run after EOD Metric. v A Benchmark applied to the AcctGender_EODVariance can be used to trigger alerts.

System capability
The system provides functionality to build simple or complex equations by using pre-defined arithmetic capabilities on the user interface screens. The system can then perform the actual calculation of a metric value based on that equation.

User responsibility
In developing metrics, your primary responsibility is understanding or interpreting the measures desired and their intended use. Common metrics include those that calculate sums, averages, or deviations of data rule and rule set results that are typically compared against a benchmark defined for that metric. Because metrics involve communication of their results and trends to a wide business audience, it is useful to design the metrics as a team effort with representation from many constituencies. The meaning of the metric value should be well documented along with a general understanding of how the metric value is calculated. In general, the use of metrics will evolve over time with ongoing data quality management practices. Existing metrics might require refinement over time and some might be replaced with new ones as users become more focused on critical data quality issues.

Interpreting results
The interpretation of a metric value result is directly related to your understanding of the design of the metric and its equation. The usage of percentile scales (values from 0-100), parts per million, or cost factors (a currency value) will generally have higher meaning. Ensure that the name of the metric reflects the measure used to facilitate understanding. It might be necessary to review the metric results over a period of time to establish a reasonable range of normal results. However, if a metric result is unsatisfactory, it will usually require an investigation into the individual data rule or rule set results used in the metric to determine where the unsatisfactory metric value originated from in the metric's calculation.

Decisions and actions


In developing metrics and their applications, you must make some key decisions.

Methodology and best practices

91

You should review the metric results to identify any results that vary from expectations (for example, benchmarks) or that have unusual values. Assuming that the metric involved is accurate and does not require any maintenance updates, the user should focus on the nature of the failed data and its root causes. This will typically be in one of the data rules or rule sets used in the metric. Corrective data cleanup action should be initiated as required and the data retested with the same metric.

Monitoring results
Data rules, rule sets, and metrics are all executable objects that can be run as needed or scheduled (either internally or via a command line interface). Data rules, rule sets, and metrics each generates historical events, statistics, and detail results. As these objects run repeatedly, they create a series of events which you can track, annotate, report, and trend over time.

Figure 31. Example of the Quality Dashboard window that shows alerts

Monitoring function
When reviewing results, you can choose several monitoring approaches. v You can work entirely within the user interface, reviewing and annotating information. v You can choose to export results for additional analysis. v You can choose to report results and deliver those results to others, either on an as needed or scheduled basis, and through various alternative formats.

Monitoring technique
You develop a report by using standard out-of-the-box report templates. There are a wide range of report templates, some for the data quality analysis functions and some associated with data rules, rule sets, and metrics. The following chart shows the general structure of report development, working from a template, to define and save a specific report, and then run that report to generate ongoing output.

92

Methodology and Best Practices Guide

Report

Report Output

Report template

Report

Figure 32. Report development structure

IBM InfoSphere Information Analyzer provides extensive reporting capabilities.


Table 29. Reporting capabilities Report Templates v Include report creation parameters v Include report runtime parameters v Are defined for each product Reports v Include report runtime parameters v Can be scheduled or run anytime as needed v Can be formatted Report Results v Can be output as: HTML, PDF, Microsoft Word rich text format (RTF), and XML v Can be added to a favorite folder

v Include history (replace, v Does not allow users to keep, expiration) define their own templates v Include access rights v Share the same graphical template v Include 80+ templates for InfoSphere Information Analyzer

When you define a report, the report is an instance of a report template associated with a project and particular objects, such as specific data rule results. Once the report is saved, it can be run and will produce standard output as defined in the report.

System capabilities
The system provides functionality to build reports by using pre-defined templates in the user interface. The system can then perform the actual report execution based on the defined report parameters.

User responsibility
In monitoring your data quality application, you are responsible for the design and intended use of reports, including the choice of output options. These selections determine how results will be made available to other users. Results can be made available through several places, both in the user interface and the reporting console. The options include: v IBM InfoSphere Information Analyzer (rich client)
Methodology and best practices

93

Home Page presentations View saved report results v Reporting console (browser client), where you can view saved report results v Additional browser-based options, including generating reports as HTML or XML and making the reports available via a portal v Utilizing report outputs, including: Generating reports as XML and build your own XSLT stylesheet Generating reports as XML or TXT where the data can be moved and used elsewhere

Decisions and actions


You should review the data quality plan to identify which users need to review the results and how best to deliver the results based on the overall business goals. You should focus on the nature of the results that you want to capture and how to make the results accessible by other users. This is likely to mean selection of optimal reports, selection of a standard delivery mechanism, and depending on the goals, potentially pushing report results out in a consumable format such as XML for downstream activity.

Deploying rules, rule sets, and metrics


Data rules, rule sets, and metrics in a data quality monitoring environment are typically targeted at production data, which requires more explicit control over tasks, such as definition and change. Typically, design, development, and initial testing occur in a non-production environment, with actual production data quality monitoring occurring in a separate production environment. The same approach can be utilized to share objects between two distinct IBM InfoSphere Information Analyzer environments.

Function
To deploy rules, rule sets, metrics, and global variables from one environment to another, the rule administrator will export the necessary rules, rule sets, metrics, and global variables from the initial environment and then import the package of those items into the second environment.

Technique
The rule administrator will use the Import and Export tasks to deploy the data quality object. Export occurs in a project context. The rule administrator (which is the required role for this function) selects the objects to export, and then chooses the Export task. You enter the target file location (this will always be a location on the domain server). You select the types of items to include in the Export task, which can include: v Project folders v Bindings of variables v Global variables v Output configurations v Result history Once these are selected, you can proceed with the export.

94

Methodology and Best Practices Guide

Import also occurs in a project context in the second environment. The rule administrator in the second environment (which is the required role for the function), selects the Import task, enters the file location where the import file is located, selects the file for import, and proceeds with the import. Audit events are noted in all objects related to export or import.

System capability
The system provides functionality to export and import objects either through the user interface or through a command line interchange function. The system then can perform the actual export or import based on your selected parameters. The system will attempt to re-link or reestablish all connections in objects that are exported and imported. For example, in the case of a data rule with two variables bound to specific tables and columns where the tables and columns used do not exist in the target environment, the system will not be able to reestablish the linkage, though the information pertinent to the linkage or binding will be brought in.

User responsibility
You are responsible for validating that the rules, rule sets, and metrics imported into a new environment can be executed.

Decisions and actions


You should review and validate imported objects such as data rules, rule sets, and metrics to ensure they function in the new environment. Frequently, you will need to re-bind the variables if the basic naming of schemas, tables, or columns varies from one environment to the next. Some re-binding can be avoided, or limited to a single point, through the use of global variables. Best practices indicate that the approach for rule definition, testing, and deployment be clearly established. Naming standards become important in moving from initial design and testing to deployed rules, rule sets, or metrics. Take advantage of copy functions as needed to take a loosely defined rule definition and formalize it to standard naming conventions. Since environments are typically on separate servers, you must establish a standard method to move the package of rules, rule sets, and metrics from one server to another. This should be strictly controlled if the target is a production environment. Note: The exported package of objects can be added into a source control system to help facilitate standard practices and control.

Managing a data quality rules environment


As the users in your environment focus on more business areas and more systems, creating more rule definitions, and sharing information across IBM InfoSphere Information Analyzer projects, there is a greater need for management of this data quality environment.

Organizing your work


Most work in IBM InfoSphere Information Analyzer occurs in the context of a project, which includes development, testing, and monitoring of rules, rule sets, and metrics.
Methodology and best practices

95

The self-contained project provides the authorized user a selected view of the repository and the activities performed against it. Any number of projects might exist in an InfoSphere Information Analyzer environment. Such projects can have: v The same or different data sources v The same or different users, which can have the same or different roles in different projects. v The same or different rules, depending on whether they are developed in the project or drawn from the shared rules Use the project structure to: v Create a boundary for your work. v Incorporate data sources that are relevant and useful. v Secure analysis by including the right users with the right roles. v Apply configuration settings that meet your needs. Within a project, you can establish user-defined folders (much like folders or directories in Microsoft Windows) in which to place and organize rules, rule sets, and metrics. Business problem The business needs to logically group rules and metrics associated with a specific employee profession (in this example called Factory Workers') to facilitate ongoing data quality monitoring of an existing rule definition and its associated data rule (with more to come). v The data rule definition is called Data Exists Factory Worker'. v The data rule is called Factory Worker Gender Exists'. Solution v Create a new folder in the project called 'Factory Workers'. v Move the first Data Rule Definition to the new 'Factory Workers' folder by opening the Data Rule Definition Data Exists Factory Worker' and: Select Folders. Select Add. Select the Factory Worker folder, Add, and then OK. Click Save and Close to save the Data Rule Definition. v Repeat the above for the Data Rule Factory Worker Gender Exists'. Note: The components you added to the folder will be visible in both the general project folder and the new 'Factory Workers' folder. They can be added to other folders as well. Regardless of which folder you open the rule definition or rule from, you are always working with the same item. The folders simply allow you to organize items together to help find or review.

Decisions and actions


The project administrator should work with the groups participating in rule development to identify what data sources, users, and folders might be needed in the project. Most likely, folders will be added after the initial development work is started or once work has progressed to the level of ongoing data quality monitoring to facilitate user review.

96

Methodology and Best Practices Guide

Security, users, and roles


IBM InfoSphere Information Analyzer security leverages the common IBM InfoSphere Information Server security environment. A number of roles are used in InfoSphere Information Analyzer. Projects incorporate user, role, and privilege assignment. Users only see and use the projects that their userid or group has rights to see or use. There are three InfoSphere Information Analyzer product level roles: v Project administrator v Data administrator v User There are four InfoSphere Information Analyzer specific project level roles: v Business analyst v Data steward v Data operator v Drilldown user Within a project, a group or user can be associated with one or more InfoSphere Information Analyzer roles. These roles are used to decide which functions are available to each user in the project. For the data rules analysis and monitoring, there are four product level roles: v Rules administrator v Rules author v Rules manager v Rules user You must have the rules user role to work with rules. The rules manager, rules author, and rules administrator roles are derived from rules user so if you have one of the others, you are also a rules user. If you are not a rules user, you are not able to see data about rules on the home page or dashboard, and you will not be able to see the Data Quality workspace within the project. While you will be able to configure the home page and dashboard components if you are not in the rules user role, you will not be able to see any of the quality components either in the configuration screens or in the portals on the home page and dashboard. Rules user A rules user can: v View the definitions, rules, rule sets, and metrics. v Test definitions and metrics, and view the test results. v View rules, rule sets, and metrics results. Rules author A rules author can: v Create new definitions, rules, rule sets and metrics.

Methodology and best practices

97

v Edit components as long as they are not in the following states: Standard, Accepted, Deprecated. The rules author cannot set components to the Standard, Accepted, or Deprecated state. v Set and remove baselines and remove runs. v Delete any component they created. Rules manager A rules manager can: v Delete any components. v Change the status of any component to any other status. Only the rules manager can set components to Standard, Accepted or Deprecated statuses. Rules administrator A rules administrator can import and export components. InfoSphere Information Analyzer data operator v A project-level data operator can run rules, rule sets, and metrics. v If you do not have data operator role, you can only test rules within a project.

Business problem example


The business wants to control edits to rules once they have been accepted and approved. Solution Establish a specific user within a project as a rules manager: v The rules manager can review and approve data quality objects (for example, data rule definitions) by changing the status to Approved or Standard. v Once changed, the data quality objects can no longer be edited. v Only the rules manager can the change the status back to Draft or Candidate for additional editing. If there are multiple InfoSphere Information Analyzer environments (such as development and production), remember that users and roles can differ between those environments. A given user might be part of related projects in each environment; however, that user is a rules author in development, responsible for designing and testing new rules. In production, that same user is a rules user, able to monitor changes and trends in the rule, but cannot make any changes to rules there. Decisions and actions: The security and project administrators should work together with the groups participating in rule development to identify the users and roles that might be needed in the project. Most likely, initial work will start with a group of rules authors. As work progresses, additional levels of control can be added, including rules managers to control status and editing of rules, and rules administrators to facilitate deployment of rules across environments. In production environments, execution of rules might be restricted to selected individuals with the data operator role.

98

Methodology and Best Practices Guide

Usage and audit trails


IBM InfoSphere Information Analyzer records several levels of information pertaining to the data rule definitions, data rules, rule sets, and other quality control components. This includes usage and audit trails of the particular quality control. Use these components to manage the data quality environment. You can use the Usage view of the Open Rule Set window to identify the components of a given quality control and where the object is used.

Figure 33. Example of the Usage view of the Open Rule Set window

For the rule and rule set definitions, you can see the local and global variables and terms used. You can also see which data rules and rule sets are based on which definition. For the rules and rule sets, you can also see the underlying definition, the sources bound to the variables, and what other rule sets and metrics are using the rules and rule sets. The Audit Trail view identifies when specific events occurred related to the quality control. This includes Export and Import, Create and Update, and for the executable objects, when they were Generated and Executed.

Methodology and best practices

99

Figure 34. Example of the Audit Trail view of the Open Rule Set window

From an auditing standpoint, the specific historical executions also provide additional details. The results show: v What the underlying definition was v What data source was used and the total records processed v When and how the job ran Start and end times Use of sampling v What logic was enforced in the rule or rule set at the time of execution Decisions and actions: Designated individuals such as the project administrator, the rules administrator, or the rules manager will likely have responsibility to oversee the project environment. This can include periodic review of the usage and audit information of particular quality controls to gauge reuse or compliance to standards.

100

Methodology and Best Practices Guide

Contacting IBM
You can contact IBM for customer support, software services, product information, and general information. You also can provide feedback to IBM about products and documentation. The following table lists resources for customer support, software services, training, and product and solutions information.
Table 30. IBM resources Resource IBM Support Portal Description and location You can customize support information by choosing the products and the topics that interest you at www.ibm.com/support/ entry/portal/Software/ Information_Management/ InfoSphere_Information_Server You can find information about software, IT, and business consulting services, on the solutions site at www.ibm.com/ businesssolutions/ You can manage links to IBM Web sites and information that meet your specific technical support needs by creating an account on the My IBM site at www.ibm.com/account/ You can learn about technical training and education services designed for individuals, companies, and public organizations to acquire, maintain, and optimize their IT skills at http://www.ibm.com/software/swtraining/ You can contact an IBM representative to learn about solutions at www.ibm.com/connect/ibm/us/en/

Software services

My IBM

Training and certification

IBM representatives

Providing feedback
The following table describes how to provide feedback to IBM about products and product documentation.
Table 31. Providing feedback to IBM Type of feedback Product feedback Action You can provide general product feedback through the Consumability Survey at www.ibm.com/software/data/info/ consumability-survey

Copyright IBM Corp. 2006, 2010

101

Table 31. Providing feedback to IBM (continued) Type of feedback Documentation feedback Action To comment on the information center, click the Feedback link on the top right side of any topic in the information center. You can also send comments about PDF file books, the information center, or any other documentation in the following ways: v Online reader comment form: www.ibm.com/software/data/rcf/ v E-mail: [email protected]

102

Methodology and Best Practices Guide

Accessing product documentation


Documentation is provided in a variety of locations and formats, including in help that is opened directly from the product client interfaces, in a suite-wide information center, and in PDF file books. The information center is installed as a common service with IBM InfoSphere Information Server. The information center contains help for most of the product interfaces, as well as complete documentation for all the product modules in the suite. You can open the information center from the installed product or from a Web browser.

Accessing the information center


You can use the following methods to open the installed information center. v Click the Help link in the upper right of the client interface. Note: From IBM InfoSphere FastTrack and IBM InfoSphere Information Server Manager, the main Help item opens a local help system. Choose Help > Open Info Center to open the full suite information center. v Press the F1 key. The F1 key typically opens the topic that describes the current context of the client interface. Note: The F1 key does not work in Web clients. v Use a Web browser to access the installed information center even when you are not logged in to the product. Enter the following address in a Web browser: http://host_name:port_number/infocenter/topic/ com.ibm.swg.im.iis.productization.iisinfsv.home.doc/ic-homepage.html. The host_name is the name of the services tier computer where the information center is installed, and port_number is the port number for InfoSphere Information Server. The default port number is 9080. For example, on a Microsoft Windows Server computer named iisdocs2, the Web address is in the following format: http://iisdocs2:9080/infocenter/topic/ com.ibm.swg.im.iis.productization.iisinfsv.nav.doc/dochome/ iisinfsrv_home.html. A subset of the information center is also available on the IBM Web site and periodically refreshed at publib.boulder.ibm.com/infocenter/iisinfsv/v8r5/ index.jsp.

Obtaining PDF and hardcopy documentation


v PDF file books are available through the InfoSphere Information Server software installer and the distribution media. A subset of the PDF file books is also available online and periodically refreshed at www.ibm.com/support/ docview.wss?rs=14&uid=swg27008803. v You can also order IBM publications in hardcopy format online or through your local IBM representative. To order publications online, go to the IBM Publications Center at www.ibm.com/shop/publications/order.

Copyright IBM Corp. 2006, 2010

103

Providing feedback about the documentation


You can send your comments about documentation in the following ways: v Online reader comment form: www.ibm.com/software/data/rcf/ v E-mail: [email protected]

104

Methodology and Best Practices Guide

Product accessibility
You can get information about the accessibility status of IBM products. The IBM InfoSphere Information Server product modules and user interfaces are not fully accessible. The installation program installs the following product modules and components: v IBM InfoSphere Business Glossary v IBM InfoSphere Business Glossary Anywhere v IBM InfoSphere DataStage v IBM InfoSphere FastTrack v IBM InfoSphere Information Analyzer v IBM InfoSphere Information Services Director v IBM InfoSphere Metadata Workbench v IBM InfoSphere QualityStage For information about the accessibility status of IBM products, see the IBM product accessibility information at http://www.ibm.com/able/product_accessibility/ index.html.

Accessible documentation
Accessible documentation for InfoSphere Information Server products is provided in an information center. The information center presents the documentation in XHTML 1.0 format, which is viewable in most Web browsers. XHTML allows you to set display preferences in your browser. It also allows you to use screen readers and other assistive technologies to access the documentation.

IBM and accessibility


See the IBM Human Ability and Accessibility Center for more information about the commitment that IBM has to accessibility:

Copyright IBM Corp. 2006, 2010

105

106

Methodology and Best Practices Guide

Notices and trademarks


This information was developed for products and services offered in the U.S.A.

Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web

Copyright IBM Corp. 2006, 2010

107

sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to

108

Methodology and Best Practices Guide

IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved. If you are viewing this information softcopy, the photographs and color illustrations may not appear.

Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml. The following terms are trademarks or registered trademarks of other companies: Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. The United States Postal Service owns the following trademarks: CASS, CASS Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS and United States Postal Service. IBM Corporation is a non-exclusive DPV and LACSLink licensee of the United States Postal Service. Other company, product or service names may be trademarks or service marks of others.

Notices and trademarks

109

110

Methodology and Best Practices Guide

Index A
analysis functions and techniques analysis settings 50 analysis, planning 5 analytical function 5 asset rationalization project 4 9 data sample primary key analysis (multicolumn) 7 Data Sample Method option 53 Data Sample Size option 52 data samples 7 data sources 7 data type analysis 14 deploying metrics 94 deploying rule sets 94 deploying rules 94 deploying rules, rule sets, and metrics 94 deploying, user decisions and actions domain analysis 28 duplicate check analysis 41, 43 monitoring, user decisions and actions 94 monitoring, user responsibility 93

N
nullability analysis 22 nullability threshold 22 Nullability Threshold option 51

B
business case information analysis 2 business management practice information analysis 2

O
95 organizing data quality work 96

C
cardinality analysis 23 column analysis 7 cardinality analysis 23 column properties analysis 13 data quality controls analysis 25 data sample 7 data type analysis 14 frequency distribution 8, 9 length analysis 16 nullability analysis 22 precision analysis 18 scale analysis 20 column frequency distribution 7 column properties analysis 9, 13 completeness analysis 25 Composite Key Maximum Columns 53 Constant Threshold option 52 cross-table analysis 43 duplicate check analysis 43 foreign key analysis 46 referential integrity analysis 48 customer support 101

P
5 performance considerations 9 planning for analysis 5 precision analysis 18 primary key analysis (multicolumn) data sample 7 Primary Key Threshold option 52 product accessibility accessibility 105 project applications 4

E
enterprise data management projects

F
foreign key analysis 46 format analysis 34 frequency distribution column analysis 8, 9 frequency distributions 7

R
referential integrity analysis 48 rule analysis and monitoring 54 rule set definitions, user decisions and actions 87 rules methodology building data rules 70 business rules 66 creating a data rule 72 creating a data rule executable 73 creating a rule set 81 creating a rule set executable 82 data classes and data rule types 61 data quality, user decisions and actions 97 data rule analysis function 61 data rule analysis techniques 70 data rule analysis, additional capabilities 76 data rule analysis, defining benchmarks 77 data rule analysis, interpreting results 75 data rule analysis, threshold 77 data rule analysis, user responsibilities 74 data rule definitions, reuse 78 data rule definitions, user decisions and actions 79 data rule set, function 80 data rule sets 79 data rule types 63

I
import metadata 7 information analysis best practices 1 business case 2 business management practice data sources 7 functions and techniques 9 methodology 1, 5

D
data classification analysis 9, 10 data integration projects 4 data quality assessment 4 data quality column analysis 9 data quality controls analysis 25 completeness analysis 25 domain analysis 28 format analysis 34 data quality management 1 data quality, user decisions and actions 97 data rule definitions, user decisions and actions 79 data rule types rules methodology 63 data rules and rule sets, structuring 55

L
legal notices 107 length analysis 16

M
managing data quality rules environment 95 metadata import 7 methodology, rule versus rule set 56 metrics function 88 metrics techniques 89 metrics, user decisions and actions 92 metrics, user responsibility 91 monitoring function 92 monitoring results 92 monitoring technique 92

Copyright IBM Corp. 2006, 2010

111

rules methodology (continued) data rules analysis 60 deploying rules, rule sets, and metrics 94 deploying, user decisions and actions 95 deploying, user responsibilities 95 global variables 57 local variables 57 logical definitions 55 managing data quality rules environment 95 metrics 88 metrics function 88 metrics techniques 89 metrics, interpreting results 91 metrics, user decisions and actions 92 metrics, user responsibility 91 monitoring function 92 monitoring results 92 monitoring technique 92 monitoring, user decisions and actions 94 monitoring, user responsibility 93 organizing data quality work 96 positive and negative expressions 71, 81 quality control rules 63 rule set analysis, interpreting results 84 rule set definitions, user decisions and actions 87 rule sets, user responsibilities 84 security, user decisions and actions 98 usage and audit trails 99 usage and audit trails, user decisions and actions 100 valid and invalid data 71, 81 rules methodology, naming standards 57

W
where clause 7

S
samples, data 7 scale analysis 20 security, user decisions and actions software services 101 support customer 101 98

T
table analysis 36 duplicate check analysis 41 primary key analysis 36 trend analysis 5

U
Uniqueness Threshold option 51 usage and audit trails 99 usage and audit trails, user decisions and actions 100

112

Methodology and Best Practices Guide

Printed in USA

SC19-2750-02

Spine information:

IBM InfoSphere Information Analyzer

Version 8 Release 5

Methodology and Best Practices Guide

You might also like