Data Profiling White Paper1003-Final
Data Profiling White Paper1003-Final
Figures
Figure 1: Metadata report on a character field. ............................................................................... 4 Figure 2: Pattern frequency report for telephone numbers. ............................................................ 5 Figure 3: Statistics on a column of loan data. ................................................................................. 6 Figure 4: Frequency distribution on state data. ............................................................................... 8 Figure 5: Outlier report on product weight....................................................................................... 9 Figure 6: Results of primary key/foreign key analysis. .................................................................. 11
Executive summary
Current data quality problems cost U.S. businesses more than $600 billion per year.1 Not so long ago, the way to become a market leader was to have the right product at the right time. But the industrial and technological revolutions of the last century created a market with many companies offering the same products. The path to market leadership required companies to design and manufacture products cheaper, better and faster. And as more businesses entered the market with lower barriers of entry, products showed fewer distinguishing characteristics that define a market leaderresulting in a more commodity-based marketplace. With narrow margins and constant competition, organizations realized that a better product no longer guaranteed success. In the last 10 years, organizations have concentrated on the optimization of processes to bolster success. Profits are as much the result of controlling expenses as from generating additional revenue. To realize significant savings from expenses, companies throughout the world are implementing two primary enterprise applications: enterprise resource planning (ERP) and customer relationship management (CRM). Each of these applications focus on driving increased efficiencies from core business processes, with ERP systems focused on holding expenses in check, and CRM systems working to build more profitable relationships with customers. Successfully implemented, ERP systems help companies optimize their operational processes and help reduce processing costs. On the opportunistic, customer-facing side of profit-seeking, companies realize that customers are expensive to acquire and maintain, leading to the deployment of CRM systems. At the same time, organizations have developed data warehouses in an effort to make more strategic decisions across the enterprisespending less and saving more whenever possible. But a new age in enterprise management is here. The very foundation of ERP and CRM systems is the data that drives these implementations. Without valid corporate information, enterprise-wide applications can only function at a garbage in, garbage out level. To be successful, companies need high-quality data on inventory, supplies, customers, vendors and other vital enterprise information. Or their ERP or CRM implementations are doomed to fail. In their most recent Global Data Management Survey, PricewaterhouseCoopers writes that, The new economy is the data economy. The survey adds, Companies are entering a crucial phase of the data age without full control or knowledge of the one asset most fundamental to their successdata.2 The successful organizations of tomorrow are the ones that recognize that data (or, more accurately, the successful management of corporate data assets) will determine the market leaders in the future. If your data is going to make you a market leader, it must be consistent, accurate and reliable. Achieving this level of prosperity requires solid data management practices including data profiling, data quality, data integration and data augmentation. And any data management initiative begins with profiling, where you analyze the current state of your dataand begin to build a plan to improve your information. This paper discusses data profiling in detail, what it is and how it can be deployed at your organization. The paper will also look at how data profiling fits into the broader data management process of your organization. Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data. The Data Warehousing Institute. Report Series 2002. 2 Global Data Management Survey. PricewaterhouseCoopers. 2001. 1
1
Many business and IT managers face the same problems when sifting through corporate data. Often, organizations do notand worse yet, cannotmake the best decision because they cant get access to the right data. And just as often, a decision is made based on data that is faulty or untrustworthy. But regardless of the state of the information within your enterprise, the King in Alices Adventures in Wonderland had the right idea: Begin at the beginning. Data profiling is a fundamental, yet often overlooked, step that should begin every data-driven initiative. Every ERP implementation, every CRM deployment, every data warehouse development and every application rewrite should start with data profiling. Industry estimates for ERP and data warehouse implementations show these projects fail or go over-budget 65-75% of the time. In almost every instance, project failures, cost overruns and long implementation cycles are due to the same problema fundamental misunderstanding about the quality, meaning or completeness of the data that is essential to the initiative. These are problems that should be identified and corrected prior to beginning the project. And by identifying data quality issues at the front-end of a data driven project, you can drastically reduce the risk of project failure. To address information challenges at the outset, data profiling provides a proactive approach to understanding your data. Data profiling, also called data discovery or data auditing, is specifically about discovering the data available in your organization and the characteristics of that data. Data profiling is a critical diagnostic phase that arms you with information about the quality of your data. This information is essential in helping you determine not only what data is available in your organization, but how valid and usable that data is. Profiling your data is based on the same principle that your mechanic uses when you take your car to the shop. If you take your car in and tell the mechanic that the car has trouble starting, the mechanic doesnt say, Well, wed better change the timing belt. The mechanic goes through a series of diagnostic steps to determine the problem: he checks the battery, checks the fluids, tests the spark plugs and checks the timing. After a thorough diagnostic review, the mechanic has validated the reliability of different parts of the engine, and he is ready to move forward with the needed changes.
Starting a data-driven initiative (ERP system, CRM system, data warehouse, database consolidation, etc.) without first understanding the data is like fixing a car without understanding the problems. You may get lucky, but chances are you will waste time and money doing work that is neither complete nor productive. And you are likely to become just another failure statistic for ERP, CRM or data warehousing implementations. With proper data profiling methodologies, you can also gain valuable insight into your business processes and refine these procedures over time. For instance, a data analyst conducts a profiling routine on a CRM database and finds that over 50% of the product information is inaccurate, incorrect or outside of the standard parameters. The data analyst can then go to other departments, such as sales and business development, to find out how product data is entered into the systemand find ways to refine and enhance this process. To help you begin at the beginning, data profiling encompasses many techniques and processes that can be grouped into three major categories: Structure discovery Does your data match the corresponding metadata? Do the patterns of the data match expected patterns? Does the data adhere to appropriate uniqueness and null value rules? Data discovery Are the data values complete, accurate and unambiguous? Relationship discovery Does the data adhere to specified required key relationships across columns and tables? Are there inferred relationships across columns, tables or databases? Is there redundant data?
The next section will look in detail at the structure discovery, data discovery and relationship discovery routinesand how you can use these profiling techniques to better understand your data.
Engaging in any data initiative without a clear understanding of these issues will lead to large development and cost overruns or potential project failures. GartnerGroup estimates that through 2005, more than 50 percent of business intelligence and customer relationship management deployments will suffer limited acceptance, if not outright failure, due to lack of attention to data quality issues.3 From a real-world perspective, the effect can be incredibly costly; one company spent over $100,000 in labor costs identifying and correcting 111 different spellings of the company AT&T. Because companies rely on data that is inconsistent, inaccurate and unreliable, large-scale implementations are ripe for failure or cost overruns. More disturbing, the organizations usually A Strategic Approach to Improving Data Quality, by Ted Friedman. GartnerGroup. June 19, 2002. 3
3
do not understand the magnitude of the problem or the impact that the problems have on their bottom line. Data problems within your organization can lead to lost sales and wasted expenses. Poor decisions. Sub-standard customer relations. And ultimately, failed businesses. Data profiling is the first step to help you diagnoseand fixthe problem. Now, lets look in more detail at the types of discovery techniques you should consider during the data profiling process.
Figure 1: Metadata report on a character field. Metadata analysis helps determine if the data matches the expectations of the developer when the data files were created. Has the data migrated from its initial intention over time? Has the purpose, meaning and content of the data been intentionally altered since it was first created? By answering these questions, you can make decisions about how to use the data moving forward.
Pattern matching: Typically, pattern matching is used to determine if the data values in a field are in the expected format. This technique can quickly validate that the data in a field is consistent across the data sourceand that information is consistent with your expectations. For example, pattern matching would analyze if a phone number field contains all phone numbers. Or if a social security field contains all social security numbers. Pattern matching will also tell you if a field is all numeric, if a field has consistent lengths and other format-specific information about the data. As an example, consider a pattern report for North American phone numbers. There are many valid phone number formats, but all valid formats consist of three sets of numbers (three numbers for area code, three numbers for exchange, four numbers for station). These sets of numbers may or may not be separated by a space or special character. Valid patterns might include: 9999999999 (999) 999-9999 999-999-9999 999-999-AAAA 999-999-Aaaa
In these examples, 9 represents any digit, A represents any upper case alpha (letter) character and a represents any lower case alpha character. Now, consider the following pattern report on a phone number field.
Figure 2: Pattern frequency report for telephone numbers. The majority of the phone data in this field contains valid phone numbers for North America. There are, however, some data entries that do not match a valid phone pattern. A data profiling tool will let you drill through a report like this to view the underlying data or generate a report containing the drill-down subset of data to help you correct those records. Basic Statistics: You can learn a lot about your data just by reviewing some basic statistics about the data. This is true for all types of data, especially numeric data. Reviewing statistics such as minimum/maximum values, mean, median, mode and standard deviation can give you insight into the validity of the data. Figure 3 shows statistical data about personal home loan values from a financial organization. Personal home loans normally range from $20,000 to $1,000,000. A loan database with incorrect loan amounts can lead to many problems, from poor analysis results to incorrect billing of the loan customer. Lets take a look at some basic statistics from a loan amount column in the loan database.
Figure 3: Statistics on a column of loan data. This report uncovers many potential problems with the loan amounts (see arrows above). The minimum value of a loan is a negative value. The maximum value for a loan is $9,999,999. There are two loans with missing values (Null Count). The median and standard deviations are unexpectedly large numbers. All of these indicate potential problems for a personal home loan data file. Basic statistics give you a snapshot of an entire data field. As new data is entered, tracking basic statistics over time will give you insight into the characteristics of new data that enters your systems. Checking basic statistics of new data prior to entering it into the system can alert you to inconsistent information and help prevent adding problematic data to a data source. Metadata analysis, pattern analysis and basic statistics are a few of the techniques that profiling tools use to discover potential structure problems in a data file. There are a variety of reasons that these problems appear in files. Many problems are caused by incorrectly entering data into a field (which is most likely the source of the negative value in the home loan data). Some problems occur because a correct value was unknown and a default or fabricated value is used (potentially the origin of the $9,999,999 home loan). Other structure problems are the result of legacy data sources that are still in use or have been migrated to a new application. Often during the data creation process for older mainframe systems, programmers and database administrators designed shortcuts and encodings that are no longer used or understood. IT staff would overload a particular field for different purposes. Structure analysis can help uncover many of these issues.
Each of these values all have the same meaning, but they are represented differently. The analytical and operational problems of this non-standard data can be very costly, as you cannot get a true picture of the customers, businesses or items in your data sources. For instance, a life insurance company may want to determine the top ten companies that employ their policyholders in a given geographic region. With this information, the company can tailor policies to those specific companies. If the employer field in the data source has the same company entered in several different ways, inaccurate aggregation results are likely. In addition, consider a marketing campaign that personalizes its communication based on a household profile. If there are a number of profiles for customers at the same address, the addresses are represented inconsistently. Variations in addresses can have a nightmare effect on highly-targeted campaigns, causing improper personalization or the creation of too many generic communication pieces. These inefficiencies waste time and money both on material production and creative efforts of the group, while alienating customers by ineffectively marketing to their preferences. While these are simple data inconsistency examples, these and other similar situations are endemic to databases worldwide. Fortunately, data profiling tools can discover these inconsistencies, providing a blueprint for data quality technology to address and fix these problems. Frequency Counts and Outliers: When there are hundreds or even thousands of records that need to be profiled, it may be possible for a business analyst to scan the file and look for values that dont look right. But, as the data grows, this quickly becomes an immense task. Many organizations spend hundreds of thousands of dollars to pay for manual validation of data. This is not only expensive and time-consuming, but manual data profiling is inaccurate and prone to human error.
Frequency counts and outlier detection give you techniques that can limit the amount of business analyst fault detection required. In essence, these techniques highlight the data values that need further investigation. You can gain insight into the data values themselves, identify data values that may be considered incorrect and drilldown to the data to make a more in-depth determination about the data. Consider the following frequency distribution of a field containing state and province information.
Figure 4: Frequency distribution on state data. The frequency distribution shows a number of correct state entries. But, the report also shows data that needs to be corrected. Incorrect state spellings, invalid state abbreviations and multiple representations of states can all cause problems. California is represented as CA, CA., Ca., and California. Non-standard representations will have an impact any time you are trying to do state-level analysis. The invalid state entries may prevent you from contacting certain individuals; the missing state values make communication even more problematic. Outlier detection also helps you pinpoint problem data. Whereas frequency count looks at how values are related according to data occurrences, outlier detection examines the (hopefully) few data values that are remarkably different from other values. Outliers show you the highest and lowest values for a set of data. This technique is useful for both numeric and character data. Consider the following outlier report (showing the 10 minimum and 10 maximum values for the field). In Figure 5, the field is product weight, measured in ounces, for individual-serving microwaveable meals. A business analyst would understand that the valid weights are between 16 and 80 ounces.
Figure 5: Outlier report on product weight. However, as you can see, there are many outliers on both the low end and the high end. On the low end, the values were probably entered in pounds instead of ounces. On the high end, potentially these are case or pallet weights instead of individual serving weights. Outlier detection allows you to quickly and easily determine if there are gross inconsistencies in certain data elements. Data profiling tools can let you drill through to the actual records and determine the best mechanism for correction. Business Rule Validation: Every organization has basic business rules. These business rules cover everything from basic lookup rules: Salary Grade 20 21 22 Salary Range Low $25,000 $32,000 $40,000 Salary Range High $52,000 $60,000 $80,000
To complex, very specific formulas: Reorder_Quantity = (QuantPerUnit*EstUnit) [Unit_type] -Inventory_onHand You can check many basic business rules at the point of data entry and, potentially, recheck these rules on an ad-hoc basis. Problems that arise from lack of validation can be extensive, including over-paying expenses, running out of inventory and undercounting revenue. Since business rules are often specific to an organization, you will seldom find data profiling technology that will provide these types of checks out-of-the-box. These pre-built business rules may provide domain checking, range checking, look-up validation or specific formulas. In addition to the canned data profiling validation techniques, a robust data profiling process must be able to build, store and validate against an organizations unique business rules. Applications today need the ability to store, access and implement these basic business rules for data validation. Data profiling should use these same data validation rules to monitor and identify violations of these business rules.
Relationship discovery provides you with information about the ways that data records relate. These records can be multiple records in the same data file, records across data files or records across databases. With relationship discovery, you can profile your data to answer the following questions: Are there potential key relationships across tables? If there is a primary/foreign key relationship, is it enforced? If there is an explicit or inferred key relationship, is there any orphaned data (data that does not have a primary key associated with it)? Are there duplicate records?
Relationship discovery starts with metadata to determine relationships, using any metadata available about key relationships. The documented metadata relationships need to be verified. Relationship discovery should also determine, in the absence of metadata, what fields (and therefore, what records) have relationships. Once potential relationships are determined, further investigation is needed. Does my relationship provide a primary/foreign key? If so, is my primary key unique? If not, which records prevent it from being unique? With my key relationships, are there any outstanding records that do not adhere to the relationship? Figure 6 shows the results of a primary key/secondary key analysis, where two products listed in the sales data did not exist in the products table.
10
Figure 6: Results of primary key/foreign key analysis. Data profiling has many different aspects. This section has covered some of the more basic types of profiling techniques. Any solid profiling initiative should cover the structure, data and relationship aspects and generate the reports and business rules you need to fully understand (and repair) your data.
The most effective data management tools can address all of these initiatives. Data analysis reporting alone is just a small part of your overall data initiative. The results from data profiling serve as the foundation for data quality and data integration initiativesallowing you to automatically transfer this information to other data management efforts without losing the context or valuable details of the data profiling.
11
The first part of the process to achieve a high degree of quality control is to perform routine audits of your data as discussed in this paper. A list of these audits follows, along with an example of each. Type of audit Domain checking Range checking Cross-field verification Address format verification Name standardization Reference field consolidation Format consolidation Referential integrity Basic statistics, frequencies, ranges and outliers Example In a gender field, the value should be M or F. For age, the value should be less than 125 and greater than 0. If a customer orders an upgrade, make sure that customer already owns the product to be upgraded If Street is the designation for street, then make sure no other designations are used. If Robert is the standard name for Robert, then make sure that Bob, Robt. and Rob are not used. If GM stands for General Motors, make sure it does not stand for General Mills elsewhere. Make sure date information is stored yyyymmdd in each applicable field. If an order shows that a customer bought product XYZ, then make sure that there actually is a product XYZ. If a company has products that cost between $1,000 and $10,000, you can run a report for product prices that occur outside of this range. You can also view product information, such as SKU codes, to view if the SKU groupings are correct and in line with the expected frequencies. If an inactive flag is used to identify customers that are no longer covered by health benefits, make sure all duplicate records are also marked inactive. If UPC or SKU codes are supposed to be unique, make sure they are not being reused. If there is a defined primary key/foreign key relationship across tables, validate it by looking for records that do not have a parent. If closed credit accounts must have a balance of zero, make sure there are no records where the closed account flag is true and the account balance total is greater than zero.
Duplicate identification
Auditing the data as it stands in the systems is not enough. Data profiling needs to be a continuous activity. Your organization is dynamic and evolving. New business initiatives and new business rules continuously generate and incorporate new data into your systems. Each of these new elements brings the potential for more data problems and additional integration headaches. The rules that you create as part of your initial data profiling activities should be available throughout the data management processes at your organization. As you monitor the consistency, accuracy and reliability of your data over time, you need to apply these same rules
12
to these ad-hoc data checks. As you investigate data profiling tools, look for tools that can integrate rules and technology into scheduled data profiling processes to track the changes in data quality over time. Finally, you must also maximize the relationships between data elements, data tables and databases. After you get an overall view of the data within your enterprise, data management solutions must provide the ability to: Fix business rule violations. Standardize and normalize data sources. Consolidate data across data sources. Remove duplicate data and choose the best surviving information.
As part of your initial profiling activities, you can develop and implement all required business and integration rules. A robust data management tool will provide the ability to integrate the data validation algorithms as part of standard applications at your organization.
13
Data integration requires powerful matching technology that can locate less obvious members of a related group. Data integration technologies will recognize that Jim Smith at 100 Main Street and Michelle Smith at 100 Main St are members of the same household. A good solution will also recognize that two people with different last names living at the same address could be spouses or members of the same household. In addition, data integration technology can determine that two items are the same. A good data integration tool can determine that 1/4 x 3 wood screw zinc and screw, wood (zinc) x 3 inches are the same product. Data integration gives you the ability to join data based on similar concepts as well as exact data matches.
Conclusion
So, you want to make data a strategic asset at your organization. You understand that your data must be consistent, accurate and reliable if you want your organization to be a leader. As the King said in Alices Adventures in Wonderland, Begin at the beginning. The most effective approach to consistent, accurate and reliable data is to begin with data profiling. And the most effective approach to data profiling is to use a tool that will automate the discovery process. But data profiling, while a critical piece of your efforts to strengthen your data, is only the first step. In addition, youll need a methodology that ties these process steps together in a cohesive fashion. A comprehensive strategy requires technology in the four building blocks of data managementdata profiling, data quality, data integration and data augmentationto achieve success.
14
Getting started
A pioneer in data management since 1997, DataFlux is a market leader in providing comprehensive, end-to-end data management solutions. DataFlux products are designed to significantly improve the consistency, accuracy and reliability of an organizations business-critical data, enhancing the effectiveness of both customer- and product-specific information. The process of data management begins with a discovery or data profiling phase that asks one critical question: What points of data collection might have relevant, useful information for your data-based applications and initiatives? Once you begin to understand your data, you can correct errors and use this information to build more productive CRM, ERP, data warehousing or other applications. DataFlux provides a total solution for your data management needs, which encompasses four building blocks: Data Profiling Discover and analyze data discrepancies Data Quality Reconcile and correct data Data Integration Integrate and link data across disparate sources Data Augmentation Enhance information using internal or external data sources
DataFlux's end-user product, dfPower Studio, brings industrial-strength data management capabilities to both business analysts and IT staff. dfPower Studio is completely customizable, easy to implement, intuitive and usable by any department in your organization. With dfPower Studio, you can identify and fix data inconsistencies, match and integrate items within and across data sources and identify and correct duplicate data. dfPower Studio also provides data augmentation functionalities that allow you to append existing data with information from other data sources, including geographic or demographic information. Blue Fusion SDK, a software developer kit, is a packaged set of callable libraries that easily integrates the core DataFlux data management technology into every aspect of your systems, including operational applications and analytical applications. dfIntelliServerTM is a software developer kit built on Blue Fusion SDK and provides a client/server architecture that allows for real-time data validation and correction as data is entered into Web pages or other applications. Working independently or together, dfPower Studio, Blue Fusion SDK and dfIntelliServer ensure a comprehensive data management environment, allowing for better business decisions and improved data-driven initiatives. DataFlux is a wholly-owned subsidiary of SAS, the market leader in providing business intelligence software and services that create true enterprise intelligence.
DataFlux and all other DataFlux Corporation product or service names are registered trademarks or trademarks of, or licensed to, DataFlux Corporation in the USA and other countries. indicates USA registration. Copyright 2003 DataFlux Corporation, Cary, NC, USA. All Rights Reserved. 10/03
15