BI SiperianResearchReport

Data Profiling, Data Integration and Data Quality
THE PILLARS
OF
MASTER DATA MANAGEMENT
Business Intelligence Network™

Research Report
Prepared for Siperian
Data Profiling, Data Integration and Data Quality:
THE PILLARS OF MASTER DATA MANAGEMENT
By David Loshin
EXECUTIVE SUMMARY
The wave of workgroup and desktop computing in the 1980s led to distributed data management,
resulting in applications supporting line of business operations with similar requirements yet
variant models, representations and management of information objects. Data replication across
mainframe, servers and the desktop has led to ambiguity in representation and semantics
associated with implementing business concepts.
Initiatives in centralization (such as data warehousing) intend to consolidate organizational data

into an information asset to be mined for actionable knowledge. Although centralization of
information for analysis and reporting has great promise, a new challenge emerges: as data sets
are integrated and transformed for analysis and reporting, cleansing and corrections applied at
the warehouse imply that the analysis and reports may no longer be synchronized with the source
data, suggesting the necessity for having a single source of truth for all applications – not just
analysis and/or reporting.
Over the past ten years, data profiling, data cleansing and matching, and data integration tools
have matured in concert with a desire to aggregate and consolidate “master data,” but today’s
master data management (MDM) initiatives differ from previous attempts at enterprise data
consolidation. An MDM program creates a synchronized, consistent repository of quality master
data to feed enterprise applications. Successful MDM solutions require quality integration of
master data from across the enterprise, relying on:
Inventory and identification of candidate master data objects;
• Resolution of semantics, hierarchies and relationships for master entities;
• Seamless standardized information extraction, sharing and delivery;
• A migration process for consolidating the “best records” for the master repository;
• A service-oriented approach for accessing the consolidated master directory;
• Managing enterprise data integration using a data governance framework.
Copyright © May 2007, Powell Media, LLC and David Loshin

All rights reserved. 1
These tasks depend on traditional data quality and integration techniques: data profiling for
discovery and analysis; parsing; standardization for data cleansing; duplicate
analysis/householding and matching for identity resolution; data integration for information
sharing; and data governance, stewardship, and standards oversight to ensure ongoing
consistency. Essentially, data profiling, data integration and data quality tools are the three pillars
upon which today’s MDM solutions are supported. Vendor and customer analyses indicate that:
• Many master data programs have evolved from customer data quality, product data
quality, data assessment and validation, and data integration activities.
• MDM solutions are triggered by the introduction of data quality activities to support
technical infrastructure acquired for a specific purpose (e.g., enterprise resource planning
or customer relationship management).
• Data governance is a common success theme for MDM.
During the conversations and interviews with both vendors and their customers, recurring themes
led us to draw some conclusions about the evolution of successful master data management
initiatives:
1. There is a significant bidirectional influence between data quality and master data
management.
2. Customer data still is the main focus of MDM activities, but product information
management is rapidly growing in importance.
3. Formalizing data governance is a critical success factor for MDM.
4. Master data management is not always about consolidation of data.
5. The need for semantic integration has driven users to adapt existing tools for broader
purposes than originally intended.
As organizations increasingly focus on master data integration, their reliance on readily available
technologies, couched within an enterprise governance framework, will continue to drive both
analytic and operational productivity improvement for the foreseeable future.

Data Profiling, Data Integration and Data Quality:
THE PILLARS OF MASTER DATA MANAGEMENT
By David Loshin
INTRODUCTION
Over the past ten years, data profiling, data cleansing and matching, and data integration tools
have matured in concert with a recognized desire on behalf of senior managers to aggregate and
consolidate “master data” – replicated or duplicated copies of common or shared data objects
such as customers and products that are peppered across disparate or distributed enterprise
systems. What differentiates today’s master data management (MDM) initiatives from previous
attempts at enterprise data consolidation? One might question whether there is any significant
difference at all. On the surface, MDM appears to be yet another attempt at consolidating data
into one single “system of record.”
However, consider customer data integration, product data integration and enterprise dimension
analysis. All of these ideas have been introduced within the context of data warehousing,
business intelligence, sales force automation, customer relationship management, etc. Yet, to
some extent, the promise of many technical applications (such as CRM) has not been realized;
and, over time, there has been growing skepticism as to their success. For example, many have
been critical of the inability to effectively exploit a CRM system for its intended benefits. The
reasons for this may not lie in the technologies per se, but rather in the ability to embed the
technologies within a set of business processes guided by policies for data governance, data
quality and information sharing.
These kinds of projects focus on integrating data from multiple sources into a single core
repository, and each reflects some aspects of a master data management project. However, what
differentiates MDM from previous enterprise integration efforts is that rather than having
primarily a technology focus, MDM initiatives typically have a business focus, concentrating on
the process of entity identification and validation with the business clients. While MDM uses
tools and technology, the combination of that technology with sound business and data
management practices is what provides hope for a resounding success.
The intention of an MDM program is to create a single repository of high quality master data that
subsequently feeds applications across the organization with a synchronized, consistent view of
enterprise data. The most critical aspects of a successful MDM solution require high quality
integration of master data instances from across the enterprise, and this relies heavily on:
• Inventory of data objects used throughout the enterprise;

• Identification of key data objects that are candidates for migration into a master
repository;
• Resolution of semantics for these entities, as well as hierarchies and object relationships;
• The ability to seamlessly facilitate standardized information extraction, sharing, and

delivery;
• A quality-directed migration process for consolidating the “best records” for the master
repository;
• A services-oriented approach to exposing the consolidated master directory for enterprise

access; and
• A governance framework for managing continued integration of enterprise data into the
master repository.
Not surprisingly, the tasks enumerated here depend on the traditional data quality and data
integration tools and methods that most likely are already in place: data profiling for discovery
and analysis; parsing; standardization for data cleansing; duplicate analysis/householding and
matching for identity resolution; data integration for information sharing; and data governance,
stewardship, and standards oversight to ensure ongoing consistency.
In fact, many of the organizations that are implementing MDM programs have solutions that
have evolved out of these traditional techniques. Data profiling, data integration, and data quality
tools are essentially the three pillars upon which today’s MDM solutions are supported. In
discussions with vendors and their customers, we have seen some common themes:
• It is very common that their master data integration programs have evolved from
customer data quality, product data quality, data assessment and validation, and data
integration activities.
• While targeted solutions have been developed to support MDM functionality, they are
often triggered by the introduction of data quality activities to support technical
infrastructure acquired for a specific purpose (e.g., enterprise resource planning or
customer relationship management).
• A common success theme is the introduction of data governance across data integration
functions.

This research paper explores the origins of the need for master data integration, provides a
description of master data and master data management, explores the technical aspects of master
data management and their reliance on the three pillars, and discusses some of the oversight and
governance challenges. Lastly, the research paper explores case studies to demonstrate best
practices and to provide guidelines for evaluating different ways that MDM solutions employ
data governance strategies to meet an organization’s needs. The case study for LexisNexis, a
Siperian customer, as well as the Siperian Master Data Management Solution Overview are
included in subsequent sections of this report.
DATA AND MASTER DATA MANAGEMENT

The Origins of Master Data
In the mid 1980s, the introduction of workgroup computing, coupled with desktop applications,
ushered in an era of information management distribution. Administrative control of a business
application along with its required resources brought a degree of freedom and agility to business
managers. However, by virtue of that distribution, the managers of a line of business could
dictate the processes without having any constraints associated with the development of their
own vertical applications to run their own lines of business, leading to variance in the ways that
business concepts and objects are defined.
Not only that, the increase in both power and functionality at the desktop has engendered an even
finer granularity of data distribution, allowing greater freedom in describing and modeling
business information. Whether it is in the mainframe files, the database server, or in desktop
spreadsheets, we start to see a confusing jumble of concepts, along with creative ways of
implementing those concepts.
Over the past ten years or so, the pendulum has swung back to centralized computing (such as
data warehousing) for applications that help improve the business, with the intention of
consolidating the organization’s data into an information asset to be mined for actionable
knowledge. While the centralization of information for analysis and reporting has great promise,
it introduces a different challenge: as data sets are integrated and transformed for analysis and
reporting, cleansing and corrections applied at the warehouse imply that the analysis and reports
may no longer be synchronized with the source data. In essence, this just clarifies the benefit of
creating a single source of truth for all enterprise applications, not just for analysis or reporting,
and this embodies the concept of master data.

For example, consider your company’s customers. Each customer may participate in a number of
business operations: sales, support, billing, or service. In each of these contexts, the customer
may play a different role; and, in turn, the business may value some attributes over others
depending on the context. Clearly, you want to ensure that your business processes don’t fail
because the customer appears multiple times in different data sets. In addition, you want to be
confident that the customer’s activities are accurately portrayed in management reports.
In other words, different business applications record transactions or analysis regarding entities
and their activities, and it is desirable for all the business applications to agree on what those
entities and activities are. We can summarize two objectives:
• Integrate the multiple variations of the same business entities into a single (perhaps
virtualized) source of truth, and then
• Enable enterprise applications to share that same view of the business objects within the
enterprise.
Defining Master Data
In any organization, there is recognition that there are abstract “entities” that fuel the
organization’s operation: customers, products, suppliers, vendors, employees, finances, policies,
etc. The fact is, though, that over the past twenty years, as line-of-business silos rely on
workgroup application frameworks, disparity has crept into organizational systems, introducing
duplication and distribution of variant representations of the exact same “things.” There has been
a growing desire for enterprise integration projects, such as those driven by the need for
customer relationship management (CRM), the “customer 360° view,” or various enterprise
reference repositories. In each of these instances, the underlying objective is to create a
synchronized, consistent view of an organization’s core business entities.
In this research paper, we have used terms such as “business concepts” or “business entities”
when referring to master data, but what are the characteristics of master data? Master data
objects are those core business objects that are used in the different applications across the
organization, along with their associated metadata, attributes, definitions, roles, connections, and
taxonomies. Master data objects are those “things” that we care about – the things that are logged
in our transaction systems, measured and reported on in our reporting systems, and analyzed in
our analytical systems. Common examples of master data include:
• Customers
• Suppliers
• Parts

• Products
• Locations
• Contact mechanisms
Of course, even though many MDM activities focus on customer or product data, there are many
other potential business concepts and terms that could also be master data.
What is Master Data Management?
As opposed to being a technology or a shrink-wrapped product, master data management is

comprised of the business applications, methods, and tools that implement the policies,
procedures and infrastructure to support the capture, integration, and subsequent shared use of
accurate, timely, consistent, and complete master data. The key to an effective return on
investment for master data management lies in the recognition that MDM is a program, not a
project. Data integration may appear to be a one-shot deal, but consolidating enterprise data into
a single master data repository implies that the organization is committed to the ongoing
stewardship and management of the quality of that data. Master data management is a program
intended to, among other things:
• Assess the use of fundamental enterprise information objects, data value domains and
business rules in the range of applications across the enterprise,
• Determine whether data objects that are relevant to business success used in different
application data sets would benefit from centralization,
• Manage a canonical information model to manage those key data objects in a shared (and
perhaps centralized) repository,
• Collect and harmonize unique entity instances to populate the shared repository,
• Integrate the harmonized view of data object instances with existing and newly developed
business applications via a service-oriented approach, and
• Institute the proper data governance policies and procedures at the corporate or
organizational level to ensure the continuous maintenance of the master data repository.
What has been borne out by our customer interviews and case studies is that most of these
processes are not new to MDM, but have been used, either together or separately, for many years
in many different applications. While almost all of the programs we reviewed qualify as master

data management (or customer data integration or product information management) systems,
many of them evolved out of initiatives to improve data quality, to eliminate duplicate data and
consolidate multiple data sets, or to transform data from multiple external data sources into a
canonical internal model. By virtue of reviewing the asset that had been created through these
processes, organizations have latched onto the single master repository and by combining data
governance with best practices have created master data management success stories.
The Emergence of Process from Tools
When laying the groundwork for an MDM program, there are some fundamental questions that
need to be asked and comprehensively explored. These questions are associated with identifying
critical data elements, determining which data elements constitute master data, locating and
isolating master data objects that exist within the enterprise, and reviewing and resolving the
variances between the different representations in order to consolidate instances into a single
view.
Because of the ways that diffused application architectures have evolved across different project
teams and lines of business, it is likely that while there are a relatively small number of core
master objects used, there are going to be many different ways that these objects are named,
modeled, represented, and stored. For example, any application that must manage contact
information for individual customers will rely on a data model that maintains the customer’s
name, but one application may maintain the individual’s full name, while others might break up
the name into its first, middle, and last parts. Even when different models are conceptually
similar in structure, there will still be naming differences, such as CUST_NUM vs.
CUSTOMER_NUMBER vs. CUST_ID. In addition, slight variations will be manifested in data
types and lengths. What may be represented as a numeric field in one table may be alphanumeric
in another, or similarly named attributes may be of slightly different lengths. Therefore, an
important early stage of an MDM program is developing a process for creating the metadata
inventory as a prelude to identifying master data objects and sources.
Data quality and data integration tools have evolved from simple standardization and pattern-
matching into suites of tools for complex automation of data analysis, standardization, matching,
and aggregation. For example, data profiling has matured from a simplistic distribution analysis
into a suite of complex automated analysis techniques that can be used to identify, isolate,
monitor, audit and help address anomalies that degrade the value of an enterprise information
asset. Early uses of data profiling for anomaly analysis have been superseded by more complex
uses that are integrated into proactive information quality processes. When coupled with other
data quality technologies, they provide a wide range of functional capabilities. In fact, a number
of the case studies as well as vendor products employ data profiling for identification of master
data objects in their various instantiations across the enterprise.

A core capability of any MDM program is the ability to consolidate multiple data sets
representing a master data object (such as “customer”) and resolve variant representations into a
single “best record,” which is promoted into being the master copy for all participating
applications. This capability relies on consulting metadata and data standards that have been
discovered through the data profiling and discovery process to parse, standardize, match and
resolve into that “best record.” More relevant is that the tools and techniques used to identify
duplicate data and to identify data anomalies are exactly the same ones used to facilitate an
effective MDM strategy. The fact that these capabilities are available from traditional data
cleansing vendors is indicated by the numerous consolidations, acquisitions and partnerships
between data integration vendors and data quality tools vendors. This part of the research paper
will look at how data quality tools are required for a successful MDM implementation.
Most important is the ability to transparently aggregate data for populating and to provide access
for applications to interact with the central master repository. In the absence of a standardized
integration strategy (and its accompanying tools), the attempt to transition to an MDM
environment would be stymied by the need to modernize all existing production applications.
Data integration products have evolved to the point where they can adapt to practically any data
representation framework and can provide the means for transforming existing data into a form
that can be communicated into and out of a master data repository.
In each of the next sections, we review the capabilities of each tool class and explore how those
tools are used to achieve our MDM objectives.
ASSESSMENT: DATA PROFILING

Our first pillar of MDM, data profiling, originated as a set of algorithms for statistical analysis
and assessment of the quality of data values within a data set, as well as for exploring
relationships that exist between value collections within and across data sets. For each column in
a table, a data profiling tool will provide a frequency distribution of the different values,
providing insight into the type and use of each column. Cross-column analysis can expose
embedded value dependencies, while inter-table analysis explores overlapping values sets that
may represent foreign key relationships between entities. It is in this way that profiling can be
used for anomaly analysis and assessment. However, the challenges of master data integration
have presented new possibilities for the use of data profiling, not just for analyzing the quality of
source data, but especially with respect to the discovery, assessment, and registration of
enterprise metadata as a prelude to determining best sources for master objects, as well as
managing the transition to an MDM and its necessary data migration.

Profiling for Metadata Resolution
If the objective of an MDM program is to consolidate and manage a single centralized master
resource, then before we can materialize a single master record for any entity, we must be able
to:
1. Discover which enterprise data resources may contain entity information
2. Understand which attributes carry identifying information
3. Extract identifying information from the data resource
4. Transform the identifying information into a standardized or canonical form
5. Establish similarity to other standardized records
This entails cataloging the data sets, their attributes, formats, data domains, definitions, contexts,
and semantics, not just as an operational resource, but rather in a way that can be used to
automate master data consolidation and govern the ongoing application interactions with the
master repository. In other words, to be able to manage the master data, one must first be able to
manage the master metadata.
Addressing these aspects suggests the need to collect and analyze master metadata in order to
assess, resolve, and unify similarity in both structure and semantics. While many enterprise data
sets may have documented metadata (e.g., RDBMS models, COBOL copybooks) that reveal
structure, some of the data – such as fixed-format or character-separated files – may have little
or no documented metadata at all. The MDM team must be able to resolve master metadata in
terms of formats at the element level and structure at the instance level. Among the majority of
our cases surveyed, this requirement is best addressed by creatively applying data profiling
techniques. To best collect comprehensive and consistent metadata from all enterprise sources,
the natural technique is to employ both the statistical and analytical algorithms provided by data
profiling tools to drive the empirical assessment of structure and format metadata while
simultaneously exposing embedded data models and dependencies.
Profiling is used to capture the relevant characteristics of each data set in a standard way,
including names, source data type (e.g., RDBMS table, VSAM file, CSV file), as well as the
characteristics of each of its columns/attributes (e.g., length, data type, format pattern, among
others). Creating a comprehensive inventory of data elements enables the review of meta-model
characteristics such as frequently used names, field sizes, and data types. Managing this
knowledge in a metadata repository allows again for using the statistical assessment capabilities
of data profiling techniques to look for common attribute names (e.g., “CUSTOMER”) and their
assigned data types (e.g., VARCHAR(20)) to identify (and potentially standardize against)

commonly used types, sizes, and formats. This secondary assessment will highlight differences
in the forms and structures used to represent similar concepts. Commonalities among data tables
may expose the existence of a master data object. For example, different structures will contain
names, addresses and telephone numbers. Iterative assessment using data profiling techniques
will suggest to the analyst that these data elements are common characteristics of what ultimately
resolves into a “party” or “customer” type.
Profiling for Data Quality Assessment
The next use of data profiling as part of an MDM program is to assess the quality of the source
data that will feed the master repository. The result of the initial assessment phase will be a
selection of candidate data sources to feed the master repository, but it will be necessary to
evaluate the quality of each data source to determine the degree to which that source conforms to
the business expectations. This is where data profiling again comes into play. Column profiling
provides statistical information regarding the distribution of data values and associated patterns
that are assigned to each data attribute, including range analysis, sparseness, format and pattern
evaluation, cardinality and uniqueness analysis, value absence, abstract type recognition, and
attribute overloading analysis.
These techniques are used to assert data attribute value conformance to the quality expectations
for the consolidated repository. Profiling also involves analyzing dependencies across columns
(looking for candidate keys, looking for embedded table structures, discovering business rules, or
looking for duplication of data across multiple rows). When applied across tables, profiling
evaluates consistency of relational structure, analyzing foreign keys and ensuring that implied
referential integrity constraints actually hold. Data rules can be defined that reflect the expression
of data quality expectations, and the data profiler can be used to validate data sets against those
rules. Characterizing data quality levels based on data rule conformance provides an objective
measure of data quality that can be used to score candidates for suitability for inclusion in the
master repository.
Profiling as Part of Migration
The same rules that are discovered or defined during the data quality assessment phase can be
used for ongoing conformance as part of the operational processes for streaming data from
source data systems into the master repository. By using defined data rules to proactively
validate data, an organization can distinguish those records that conform to defined data quality
expectations and those that do not. In turn, these defined data rules can contribute to baseline
measurements and ongoing auditing for data stewardship and governance. In fact, embedding
data profiling rules within the data integration framework makes the validation process for MDM
relatively transparent.

INFORMATION SHARING THROUGH DATA INTEGRATION
Different approaches to information sharing via an MDM system are different depending on the
specific business requirements. Some applications are more lenient with respect to
synchronization (i.e., consistency and timeliness between applications and the master repository)
and latency (i.e., how quickly master data is visible across the enterprise). However, a successful
MDM initiative is driven by information sharing. Data must flow and be consolidated into the
repository, and participating applications benefit from the use of master data. Information is
shared using our second pillar of MDM, data integration tools, via three aspects:
• Data Extraction and Consolidation: Core master data attributes are brought from the
source systems into the master repository
• Data Federation: Complete master records are materialized from the participating data
sources
• Data Propagation: Master data is synchronized and shared with participating

applications
Data Extraction and Consolidation
The major aspect of the integration pillar is the ability to extract information from different data
sources and consolidate the data into a target architecture. Data extraction presupposes that the
technology is able to access data in a variety of source applications as well as being able to select
and extract data instance sets into a format that is suitable for exchange, but essentially the
capability boils down into being able to gain access to the right data sources at the right times to
facilitate ongoing information exchange. At the same time, data integration products seamlessly
incorporate:
• Data transformation: Between the time that the data is extracted from the data source
and delivered to the target location, data rules may be triggered to transform the data into
the format that is acceptable to the target architecture. These rules may be engineered
directly within the data integration tool, or may be alternate technologies embedded
within the tool.
• Data monitoring: A different aspect of applying business rules is the ability to introduce
filters to monitor the quality of the data as it moves from one location to another. The
monitoring capability provides a way to incorporate the types of data rules defined during
the data profiling phase to proactively validate data and distinguish those records that
conform to defined data quality expectations and those that do not, providing
measurements for the ongoing auditing necessary for data stewardship and governance.

• Data consolidation: As data instances from different sources are brought together, the
integration tools use the parsing, standardization, harmonization and matching
capabilities of the data quality technologies to consolidate data into unique records in the
master data model.
Each of these capabilities contributes to the creation and consistency of the master repository. As
many of the MDM programs associated with our survey were put in place either as a single
source of truth for new applications or as a master index for preventing data instance duplication,
the extraction, transformation and consolidation was applied at both an acute level (i.e., at the
point that each data source is introduced into the environment) and on an operational level (i.e.,
on a continuous basis, applied to a feed from the data source).
Data Federation
Although the “holy grail” of MDM is a single master data source completely synchronized with
all enterprise applications, most MDM implementations actually combine core identifying and
relevant data attributes into a master repository, as well as an indexing capability that acts as a
registry for the data that is distributed across the enterprise. Often, only demographic/identifying
data is stored within the master repository, implying that the true “master record” is a
compendium of attributes drawn from various sources indexed through the master registry. In
essence, accessing the master repository requires the ability to decompose the access request into
its component queries and assemble the results into the master view.
This is a type of federated information model, and is one that is often serviced via enterprise
application integration (EAI) or enterprise information integration (EII) styles of data integration
tools. This capability is important in MDM systems built on a registry framework or a
framework that does not maintain all attributes in the repository for materializing views on
demand. This style of master data record materialization relies on the existence of a unique
identifier with a master registry that carries both core identifying information and an index to
locations across the enterprise holding the best values for designated master attributes.
Data Propagation
The last component of data integration – data propagation – is applied to support the
redistribution and sharing of master data back to the participating applications. Propagation may
be explicit, with replicated, read-only copies made available to specific applications, or may be
incorporated more strategically using a service-oriented approach. MDM applications that
employ the replication approach will push data from the master repository to one or more
replication locations or servers, either synchronously or asynchronously, using guaranteed
delivery data exchange. Again, EAI products are suitable to this aspect of data integration.

The alternate approach involves the creation of a service layer on top of the master repository to
supplement each application’s master data requirements. At one extreme, the master data is used
as a way to ensure against the creation of duplicate data. At the other extreme, the applications
that participate in the MDM program yield their reliance on their own version of the data, and
instead completely rely on the data that has been absorbed into the master repository. This range
of capabilities requires that the access (both request and delivery) be provided via services,
which depends on the propagation of data out of the repository and delivery to the applications.
CONSOLIDATION VIA DATA QUALITY TECHNOLOGY

While the techniques used for data cleansing, scrubbing and duplicate elimination have been
around for a long time, their typical application was for static data cleanups, as part of
preparation for migration of data into a data warehouse or migrations for application
modernization. However, tools to help automate the determination of a “best record” clearly
meet the needs of a master data consolidation activity as well, which is the reason that data
quality tools have emerged as the third pillar of MDM.
The data quality tools that are of greatest value to the master data integration process are:
• Parsing and standardization: Data values are subjected to pattern analysis, and value
segments are recognized and then put into a standard representation
• Data transformation: Rules are applied to modify recognized errors into acceptable
formats
• Record matching and identity resolution: Used to evaluate “similarity” of groups of

data instances to determine whether or not they refer to the same master data object.
In our case studies, the use of data quality technology was by far the most critical to the
successful implementation of a master data management program. As the key component to
determining variance in representations associated with the same real-world object, data quality
techniques were used both in launching the MDM activity and its continued trustworthiness.
Parsing and Standardization
Data parsing tools enable the data analyst to define patterns that can be fed into rules engines that
are used to distinguish between valid and invalid data values. When a specific pattern is
matched, actions may be triggered. When a valid pattern is parsed, the separate components may
be extracted into a standard representation. When an invalid pattern is recognized, the
application may attempt to transform the invalid value into one that meets expectations.

Many data issues are attributable to situations where slight variance in representation of data
values introduces confusion or ambiguity. For example, consider the different ways telephone
numbers are formatted. While some have digits, some have alphabetic characters and all use
different special characters for separation, we all recognize each one as being a telephone
number. However, in order to determine if these numbers are accurate (perhaps by comparing
them to a master customer directory) or to investigate whether duplicate numbers exist when
there should be only one for each supplier, the values must be parsed into their component
segments (area code, exchange and line) and then transformed into a standard format.
The human ability to recognize familiar patterns contributes to our ability to characterize variant
data values belonging to the same abstract class of values; people recognize different types of
telephone numbers because they conform to frequently used patterns. When an analyst can
describe the format patterns that all can be used to represent a data object (e.g. Person Name,
Product Description, etc.), a data quality tool can be used to parse data values that conform to
any of those patterns and even transform them into a single, standardized form that will simplify
the assessment, similarity analysis and cleansing processes. Pattern-based parsing can automate
the recognition and subsequent standardization of meaningful value components.
As data sources are introduced into the MDM environment, the analysts must assemble a
mechanism for recognizing the supplied data formats and representations, and transform them
into a canonical format in preparation for consolidation. Developed patterns are integrated with
the data parsing components, with specific data transformations introduced to effect the
standardization. Parsing and standardizing the data instance is essentially the first step in data
consolidation.
Data Transformation
Data standardization results from applying a mapping from the source data into a target
representation. Customer name data provides a good example – names may be represented in
thousands of semi-structured forms, and a good standardizer will be able to parse the different
components of a customer name (e.g., first name, middle name, last name, initials, titles,
generational designations, etc.) and then rearrange those components into a canonical
representation that can be manipulated by other data services.
Data transformation is often rule-based – transformations are guided by mappings of data values
from their position and values in the source into their position and values in the target.
Standardization is a special case of transformation, employing rules that capture context,
linguistics and idioms that have been recognized as common over time through repeated analysis
by the rules analyst or tool vendor.

Interestingly, data integration tools usually provide data transformation ability and could perhaps
even be used to support data quality activities. The similarity in function and purpose between
data integration and standardization has not been lost in the marketplace. Since 2005, there has
been significant consolidation in the data quality tools market, with data quality companies
acquired by companies in the data integration and business intelligence space. The distinction
between data transformation engines and data parsing and standardization tools often lies in the
knowledge base that is present in the data quality tools that drive the data quality processes.
Identity Resolution and Matching
Record linkage and matching is employed in identity recognition and resolution, and
incorporates approaches used to evaluate “similarity” of records for use in duplicate analysis and
elimination, merge/purge, householding, data enhancement, cleansing and strategic initiatives
such as customer data integration or master data management. A common data quality problem
involves two sides of the same coin:
• Multiple data instances that actually refer to the same real-world entity.
• The perception by an analyst or application that a record does not exist for a real-world
entity when in fact it really does.
In the first situation, similar, yet slightly variant representations in data values may have been
inadvertently introduced into the system. In the second situation, a slight variation in
representation prevents the identification of an exact match of the existing record in the data set.
This is a fundamental issue for master data management, as the expectation is that the master
repository will hold a unique representation for every entity. This implies that the MDM
environment must include a service for analyzing and resolving object entities as part and parcel
of each participating application’s functionality. Both of these issues are addressed through a
process called similarity analysis, in which the degree of similarity between any two records is
scored, most often based on weighted approximate matching between a set of attribute values
between the two records. If the score is above a specific threshold, the two records are deemed to
be a match and are presented to the end client as most likely to represent the same entity. It is
through similarity analysis that slight variations are recognized and data values are connected –
and subsequently consolidated.
Attempting to compare each record against all the others to provide a similarity score is not only
ambitious, but also time-consuming and computationally intensive. Most data quality tool suites
use advanced algorithms for blocking records that are most likely to contain matches into smaller
sets, whereupon different approaches are taken to measure similarity. Identifying similar records
within the same data set probably means that the records are duplicated, and may be subjected to
cleansing and/or elimination. Identifying similar records in different sets may indicate a link

across the data sets, which helps facilitate cleansing, knowledge discovery and reverse
engineering – all of which contribute to master data aggregation.
There are two basic approaches to matching. Deterministic matching, like parsing and
standardization, relies on defined patterns and rules for assigning weights and scores for
determining similarity. Alternatively, probabilistic matching relies on statistical techniques for
assessing the probability that any pair of records represents the same entity. Deterministic
algorithms are predictable in that the patterns matched and the rules applied will always yield the
same matching determination. Performance, however, is tied to the variety, number and order of
the matching rules. Deterministic matching works out of the box with relatively good
performance, but it is only as good as the situations anticipated by the rules developers.
Probabilistic matching relies on the ability to take data samples for “training” purposes – by
looking at the expected results for a subset of the records and tuning the matcher to self-adjust
based on statistical analysis. These matchers are not reliant on rules, so the results may be
nondeterministic. However, because the probabilities can be refined based on experience,
probabilistic matchers are able to improve their matching precision as more data is analyzed.
SURVEY FINDINGS
During the process of this research, approximately twenty case studies were evaluated to review
how organizations are evolving successful master data management programs. Each customer
case study focused on understanding:
• The motivations for deciding to assemble an MDM program;
• The kinds of data elements deemed to be master data objects;
• Approaches, architectural techniques and technologies used;
• Risk and success factors;
• Data governance activities performed in concert with the technical solution; and
• Next steps.
In addition, the customers were also surveyed as to the general assessment of the quality of data
both before and after the deployment of the MDM program, the kinds of data quality oversight
introduced, and whether the organization changed or matured as a result. The customers were

also asked about the approaches used to assess business requirements, the determination of
potential vendors and the key decision factors in selecting a vendor solution.
Recurring themes led us to draw some conclusions about the evolution of successful master data
management initiatives:
• There is a significant bidirectional influence between data quality and master data
management.
• Customer data still is the main focus of MDM activities, but product information
management is rapidly growing in importance.
• Formalizing data governance is a critical success factor for MDM.
• Master data management is not always about consolidation of data.
• The need for semantic integration has driven users to adapt existing tools for more
exciting purposes.
The Influence of Data Profiling and Quality on MDM (and Vice-Versa)
Most of the customers interviewed specified data quality improvement as both a driver and a
byproduct of their MDM or CDI initiatives. In almost every case study, the customer specified
that improving data quality was a major driver of the program. Consider these examples:
A large software firm’s customer data integration program is driven by the need for
improvement of customer data integrated from legacy systems or migrated from acquired
company systems. As customer data is brought into their CRM system, they employ data
profiling and data quality tools to understand what data is available, whether the data
meets business requirements and to resolve duplicate identities. In turn, the master
customer system is used as the baseline for matching newly created customer records to
determine potential duplication as part of a quality identity management framework.
An industry information product compiler discussed their need to rapidly and effectively
deploy the quality integration of new data sources into their master repository because
deploying a new data source could take weeks, if not months. By using data profiling
tools, the customer can increase the speed of deploying a new data source. As a
byproduct, the customer states that one of the ways that they add value to the data is by
improving the quality of the source data. This improvement is facilitated when this
company works with its clients to point out source data inconsistencies and anomalies,
and then providing services to assist in root-cause analysis and elimination.

Product Data Grows in Importance
Even though a large number of MDM programs are for customer data, master product data is
growing in relevance. Traditional catalog and inventory applications drive existing product
master applications, but as online “e-tailing” business models continue to be developed, the need
for organizing product information should explode. Product data is interesting in that product
names often carry meaning, while (semi- and unstructured) product descriptions encompass
industry-specific lingo that is critical to semantic analysis. The difference between customer data
and product data lies in their respective uses – customer data is often used for resolution of
identity (e.g., matching), but product data is organized for consumption (e.g., e-commerce).
Some of the customers interviewed were particularly enthusiastic about their product information
integration and management initiatives, such as these examples:
A public sector organization is entrusted with managing the cross-organizational

purchasing catalog. Managing the catalog involves incorporating vendor and product
data into a comprehensive set of customized views that can be delivered based on the
specific purchasing requirements. In this situation, this customer’s main role is to extract
data from the vendor and product databases, and produce it into different kinds of
tailored data products, which involves a significant amount of conversions, filtering and
even rearranging data values as part of the standardization process for the purposes of
master data consolidation. To address this need, the organization employs data profiling
tools for standardizing the data extraction and catalog creation processes, integrating
business rules into cyclic or periodic catalog extractions, and essentially creating a
service layer for accessing the master copy of product information.
A business product retailer manages multiple suppliers with various similar product
lines. However, there is limited standardization of description across suppliers,
complicating the need for categorizing products for sale. To better enable the customer
experience, the retailer embarked on a product information management program to
rationalize and standardize product descriptions across multiple applications (ERP,
hosted catalogs, e-tailing) to make product presentation consistent and provide greater
precision in matching products to customer search requests. This rationalization employs
data profiling and quality tools to extract knowledge based on embedded semantics as a
prelude to developing the master product representation. These techniques are then used
to analyze parsed data and apply rules for transformation to correctly categorize the
data within the master repository and subsequent web-based catalog presentation.
Raised Awareness of Data Governance
Master data management involves integration and consolidation of data from across numerous
applications and data sources, and is managed as an enterprise program. Any enterprise initiative

relies on active collaboration from among the participants, and this is both encouraged and
monitored via a data governance framework. MDM will only become relevant in an organization
if the lines of business participate – both as data suppliers and as master data consumers. In other
words, MDM needs governance to encourage collaboration and participation across the
enterprise, but also drives governance by providing a single point of truth. Ultimately, the use of
the master repository as an acknowledged source of truth is driven by transparent adherence to
defined information policies specifying the acceptable levels of data quality for shared
information. Almost all of the customer case studies included some layer of governance as part
of the MDM program, incorporating metadata analysis and registration, developing “rules of
engagement” for collaboration, definition of data quality expectations and rules, monitoring and
managing quality of data and changes to master data, stewardship to oversee automation of
linkage and hierarchies, and processes for researching root causes and subsequent elimination of
sources of flawed data.
MDM is Not Always about Consolidation
In some cases, the objective was to integrate master data as a way of managing separation of
information across different applications. Regulatory constraints, organizational approaches to
governance and specific business applications may require both the identity management aspects
of MDM in concert with a segregation of information based on the business context, as these
examples demonstrate:
A leading human therapeutics company is subject to numerous compliance challenges

across multiple regulatory domains. While the company maintains a significant amount
of customer information, different regulations apply to each customer repository based
on customer type and business area, with a result that in one business areas there are
customer attributes that may not be shared with other business areas. At the same time,
their indirect sales model necessitates data synchronization with third parties. To address
this, the company has adopted a centralized master customer registry encompassing
sanitized attributes used for matching, but uses a distributed model to segregate master
attributes within each line of business. The distributed model of master data segregation
allows for effective governance for regulatory compliance purposes, as well as allowing
for semantic variance associated with critical data elements.
An electronic parts supplier’s inability to effectively catalog parts by their characteristic

features hampered the company’s ability to respond to requests for quotes when the
requested parts could not be matched exactly. Customers submit millions of line-item
requests for product quotes every month in various formats; and while manual
processing would conceptually handle the requests, the volume is so great that it would
require a huge staff. Consequently, the company wanted to rely on automation, but their
match rate hovered around 50%, indicating an opportunity for improvement. The
company has embarked on a new product information master program that will support a

parametric search engine that enables guided part navigation. This requires that part
data be analyzed to identify key characteristics, archived within the master repository
according to a set of taxonomies and integrated with e-commerce applications organized
to optimize the process of searching for parts. This approach is expected to significantly
improve the match rate and demonstrates a way of applying the data quality techniques
within an MDM framework for segregating product data across one or more taxonomies
to address the organization’s business challenge.
Adapting Data Quality Tools to Metadata Analysis is Critical
Since the MDM program is intended to create a synchronized, consistent repository of quality
master data, quality integration incorporates the profiling and quality aspects to inventory and
identify candidate master object sets, but this must be done in a way that establishes resolution of
semantics, hierarchies, taxonomies and relationships of those master objects. On the other hand,
we have seen that the benefits of imposing an enterprise data governance framework include the
oversight of critical data elements, clear unambiguous definitions and collaboration among
multiple organizational divisions.
These two ideas converge when examining the use of data profiling and data quality tools for
semantic analysis and integration. Using profiling to assess and inventory enterprise metadata
provides an automated approach to the fundamental aspects of building the master data model.
Profiling and quality together are used to parse and monitor content within each data instance.
This suggests that its use is not just a function of matching names, addresses or products, but
rather automating the conceptual understanding of the information embedded within the
representative record. This knowledge is abstracted in the metadata approach, with a semantic
registry serving as the focus of meaning for shared information.
For example, consider the approaches employed for customer match. In two of our case studies,
the search and match tools were used to support the definition and subsequent population of
master data models that could be adapted to different lines of business. In turn, these
organizations developed a service for analyzing proposed new records to determine if their
content reflected existing enterprise knowledge or could be tapped to increase that body of
knowledge.
SUMMARY
While MDM is seen as an emerging technology, it appears that many programs now referred to
as master data management or customer data integration have actually emerged as mature
versions of projects originally engineered as data quality improvement programs. The
development of a master repository as a source of truth for enterprise data is not a new idea.
However, the success of these programs has greatly improved as a factor of two notions:

• The composition of data analysis using data profiling, the exploitation of data quality
techniques, and information sharing via data integration has lowered the barriers for
MDM implementation, and
• The tight coupling of process and oversight with tools and techniques via a data
governance program has enabled the actualization of MDM as a viable enterprise
initiative that is both appealing to the business client and executable by the IT teams.

LEXISNEXIS CASE STUDY
LexisNexis, a leading provider of comprehensive information and business solutions to

professionals in a variety of areas – legal, risk management, corporate, government, law
enforcement, accounting and academic – initiated a project in 2000 to consolidate their customer
views across the organization into a single customer repository.
The planned acquisition of a third-party application to provide customer relationship

management (CRM) was initially thought to be sufficient for the task. However, after purchasing
a CRM application and attempting to model various enterprise customer data sets, it became
obvious that the CRM system was unable to adequately satisfy the requirements to consolidate
the organization’s master data.
The next approach for LexisNexis was to consider developing an internally funded application
for a central database with searching and matching logic. However, the cost estimate for building
this capability proved to be very high, and the internal development project was rejected.
By 2004, it became evident that a market was emerging for independent software vendor
providers of precisely the type of applications LexisNexis was seeking. The cost to purchase a
master data management (MDM) / customer data integration (CDI) solution, and the reliance on
a vendor for ongoing maintenance and innovation led LexisNexis to conduct some market
research. Following extensive product evaluations and proof-of-concept activities, LexisNexis
selected Siperian as their MDM/CDI vendor.
LexisNexis uses the Siperian MDM Hub solution to aggregate and manage integrated customer
views. The Hub incorporates data from book, CD, and other product sales and fulfillment
systems, as well as master data merged from online order and billing systems. The Siperian
solution is used to resolve customer master data from the various disparate source systems to
create the “best customer” view – which incorporates the best version of the numerous customer
data attributes normalized into a target data model. This data then feeds a front office system.
LexisNexis deployed Siperian Hub in an enterprise-MDM style, where the master data was
reconciled within the Hub, and a unique identifier was created to provide a link back to the
integrated systems, including the source CRM system. The result is the ability to create
actionable views of customer master data and integrate these views with applications in near
real-time.
In addition, using Siperian has allowed LexisNexis to significantly increase the ability to match
customer records, leading to improved data reliability. This provides LexisNexis’ with the ability
to increase sales, reduce lost sales opportunities and improve enterprise-wide data governance
and data management.

The company is now turning its attention to incorporating other master data sources into the Hub
and linking the repository to additional applications in order to improve the analysis of product
and sales activity. LexisNexis is seeking to further extend the solution to reduce the duplication
of master data and improve enterprise communications.

SIPERIAN MASTER DATA MANAGEMENT
SOLUTION OVERVIEW
Siperian MDM Hub – Establishing the Single Face
With a strong focus on customer data integration and in establishing a framework for master data
management (MDM) success across both the breadth of enterprise data systems and the depth of
the information life cycle, Siperian truly understands the challenges of master data management.
While Siperian’s initial focus was primarily party data (e.g., customer, employee, supplier),
Siperian’s underlying infrastructure is robust enough to support any type of master data object.
As such, some customers are using Siperian Hub to support product, contract and location data.
Siperian MDM Hub can also be used to support organizational efforts associated with party
information and complex hierarchies, such as establishing a single face to vendors and
customers, understanding the relationship between customers and products purchased and
supporting compliance constraints. Siperian Hub is particularly well-suited as an early
component of system renovations and data migrations, especially into enterprise resource
planning (ERP) systems.
Siperian’s Operational Objectives
Our review of the Siperian MDM Hub suggests that the platform has been developed to provide
an extensible platform for integrating and managing master data, including the ability to interface
with different kinds of data sources in different operational modes (e.g., batch vs. real-time). The
components provide stewardship and governance support for continual data management across
the entire master data life cycle. By allowing for reliable synchronization of master data across
different applications as well as the intuitive interface supporting data exception handling, the
Siperian MDM Hub is a platform that is rapidly adaptable to changes in the business
environment.
The Hub meets a number of the main MDM challenges, namely:
• Providing identity recognition and identity management
• Capturing identifying attributes and related metadata for lineage and history
• Optionally reconciling and consolidating matched data into a best version of the truth
• Managing hierarchies and relationships between entities
• Designing and delivering unified views of transactions from derived, calculated or

materialized values dynamically aggregated (non-persistent) with the master data

• Synchronizing master data changes reliably back to operational applications or sources
• Supplying master data to analytical applications such as data warehouses and operational
data stores
Siperian MDM Hub Architecture
The three parts that compose Siperian MDM Hub support a variety of master data management
architectural styles, including: registry style, master repository style and a master repository with
operational views. The three components are:
• Master Reference Manager, which cleanses, matches, links and optionally merges
entities.
• Hierarchy Manager, which is used to view and manage relationships between entities.
• Activity Manager, which designs and delivers unified views of entities from disparate
sources and synchronizes master data changes across systems.
Master Reference Manager
Siperian Master Reference Manager (MRM) delivers a master reference data foundation for
uniquely identifying and managing individual identities across all application touch points,
maintaining the best version of the truth for disparate and replicated entity information. MRM
becomes the single source of truth for both operational and analytical application systems. By
providing a means for consolidating and reconciling data from many different enterprise data
sources in many different formats in different operational modes (e.g., batch vs. real-time),
Siperian’s MDM Hub enables the determination of the most reliable information associated with
each entity, as well as the ability to extend availability of the master data to new applications
without customer coding.
As the foundation component of the Siperian MDM Hub, MRM provides a rules-based approach
to data life cycle management, with a flexible data model that captures rich metadata that is used
throughout the MDM life cycle. Consequently, businesses can model their business attributes
and definitions their way, from the ground up, or foundation templates can be used and
customized by industry. The data model definitions form the core of the model-driven
framework that is used to establish the master record through ensuring the survivability of the
best data for each entity attribute. MRM complements this with fully integrated MDM
capabilities including:

• Cleansing – Integrating with the most popular data quality engines on the market today
such as Trillium, Pitney Bowes Group1, BusinessObjects/Firstlogic, etc.
• Matching – Siperian has partnered with the leading identity resolution and matching
vendor, Identity Systems, to embed their capabilities within its architecture to enable
flexible matching for any master data object type.
• Linking and/or Merging – Allowing the ultimate flexibility in persisted storage or

dynamic resolution of a best version of the truth.
• SOA Compliant Web Services – Data and business services are automatically generated
based on the rich metadata that is captured in the Siperian Hub, enabling rapid
development of custom interfaces to access and add new data to the Hub in real-time.
All Hub capabilities are anchored by a powerful security framework and benchmarked in labs
and at customer sites for the highest performance and scalability. In addition, MRM facilitates
data governance by allowing stewards to manage data exceptions based on defined business
validation rules along with built-in lineage, history and audit functionality. MRM can use the
lineage and history to review the contributing data sources and evaluate the decision made during
the consolidation phase, and logs all changes to ensure the critical capability for unmerging when
necessary.
Hierarchy Manager
The management and oversight of master identities is complemented by the capture and
management of data about the relationships that associate individual parties with the
organization, to each other or to other entities. Typically, since relationship information is often
scattered across different applications throughout the enterprise, it is difficult to capture, view or
manage customer connectivity. Within each application, it is likely that there are well-defined
hierarchies that reflect the kinds of relationships that do exist. Some examples include
individuals within each customer account, customers associated with sales staff, products
associated with customers and corporate hierarchies. In addition, relationships are created to
represent reporting relationships, such as sales by region.
Siperian’s Hierarchy Manager (HM) allows organizations to view, navigate and manage
relationships across different kinds of hierarchies. Hierarchy Manager provides unified views of
the different relationships among master entities (e.g., customers, products, organizations, etc.)
and provides visibility to application touch points for those relationships. Combining Hierarchy
Manager with Master Reference Manager provides a unified view of the entities and their
corresponding relationships from the many data sources that contribute to the master repository.

Hierarchy Manager extends the MRM data model to encapsulate relationship and hierarchy data
and interact with the MRM metadata repository to capture relationship data from enterprise
sources, and provides a visual interface that allows the user to review different hierarchies at the
same time or to examine relationships (at multiple levels of depth) in greater detail. Indexing and
searching entities based on attribution enables the creation and modification of relationships
across multiple sources. The ability to review relationships and hierarchies in different modes,
coupled with the ability to review the histories associated with defined relationships, helps data
stewards better manage and maintain integrity of managed hierarchies. Additionally, as with all
Siperian capabilities, powerful web services allow access to HM functions, thereby allowing for
custom applications to be rapidly created over the Hub so that business users and data stewards
can visualize their data – their way and in their business context and workflow.
Activity Manager
Lastly, the need for synchronization of master data from numerous production applications poses
a challenge in trying to both capture and make available rapidly changing entity data. Whether
the focus is operational or analytic applications, typical enterprise architectures do not account
for the ability to reliably synchronize centrally stored master data with source data systems.
While the use of a master hub alleviates some of this challenge, the hub should provide cross-
referencing back to source systems as a way to both aggregate entity data as well as dynamically
capture modifications to master data in a timely and accurate manner.
Siperian’s approach provides a mechanism for continual monitoring of master data changes as
“events” within applications to determine appropriate actions and to effectively manage
synchronization and write-back to the master data repository (MRM). Activity Manager (AM)
dynamically aggregates and delivers a unified view of transactions involving core master data
objects, and continually monitors any changes associated with master entities, whether they are
within the Hub or in other applications, and then synchronizes the changes (or provides
notifications/alerts) based on defined operational rules.
Activity Manager essentially manages federation of transaction data combined with reliable
master data and relationship data via the Master Reference Manager and the Hierarchy Manager.
By continually monitoring and evaluating changes, user-defined rules may be triggered to take
actions to update the master data in the Hub, propagate master data modifications to dependent
applications and synchronize modifications among the different applications that are connected
to the Hub.
Summary
Our discussions with Siperian’s customers indicated significant satisfaction with the Siperian
MDM Hub, particularly in addressing customer data integration needs – especially in enhancing
cross-functional customer-centric business initiatives. Customers selected Siperian’s solution

largely due to its high level of user management and data stewardship capabilities, fully
integrated end-to-end MDM life cycle management, configurable rules-based approach and
model-driven framework that facilitates improved master data and metadata quality.


BI SiperianResearchReport

Uploaded by

Copyright:

Available Formats

BI SiperianResearchReport

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BI SiperianResearchReport

Uploaded by

Copyright:

Available Formats

Data Profiling, Data Integration and Data Quality

Business Intelligence Network™

Initiatives in centralization (such as data warehousing) intend to consolidate organizational data

Inventory and identification of candidate master data objects;

• Resolution of semantics, hierarchies and relationships for master entities;

• Seamless standardized information extraction, sharing and delivery;

• A service-oriented approach for accessing the consolidated master directory;

• Managing enterprise data integration using a data governance framework.

Copyright © May 2007, Powell Media, LLC and David Loshin

• Data governance is a common success theme for MDM.

3. Formalizing data governance is a critical success factor for MDM.

4. Master data management is not always about consolidation of data.

Copyright © May 2007, Powell Media, LLC and David Loshin

• Inventory of data objects used throughout the enterprise;

Copyright © May 2007, Powell Media, LLC and David Loshin

• The ability to seamlessly facilitate standardized information extraction, sharing, and

• A services-oriented approach to exposing the consolidated master directory for enterprise

Copyright © May 2007, Powell Media, LLC and David Loshin

DATA AND MASTER DATA MANAGEMENT

Copyright © May 2007, Powell Media, LLC and David Loshin

Defining Master Data

Copyright © May 2007, Powell Media, LLC and David Loshin

What is Master Data Management?

As opposed to being a technology or a shrink-wrapped product, master data management is

Copyright © May 2007, Powell Media, LLC and David Loshin

The Emergence of Process from Tools

Copyright © May 2007, Powell Media, LLC and David Loshin

ASSESSMENT: DATA PROFILING

Copyright © May 2007, Powell Media, LLC and David Loshin

1. Discover which enterprise data resources may contain entity information

2. Understand which attributes carry identifying information

3. Extract identifying information from the data resource

4. Transform the identifying information into a standardized or canonical form

5. Establish similarity to other standardized records

Copyright © May 2007, Powell Media, LLC and David Loshin

Profiling for Data Quality Assessment

Profiling as Part of Migration

Copyright © May 2007, Powell Media, LLC and David Loshin

• Data Propagation: Master data is synchronized and shared with participating

Data Extraction and Consolidation

Copyright © May 2007, Powell Media, LLC and David Loshin

Copyright © May 2007, Powell Media, LLC and David Loshin

CONSOLIDATION VIA DATA QUALITY TECHNOLOGY

• Record matching and identity resolution: Used to evaluate “similarity” of groups of

Parsing and Standardization

Copyright © May 2007, Powell Media, LLC and David Loshin

Copyright © May 2007, Powell Media, LLC and David Loshin

Identity Resolution and Matching

Copyright © May 2007, Powell Media, LLC and David Loshin

• The motivations for deciding to assemble an MDM program;

• The kinds of data elements deemed to be master data objects;

• Approaches, architectural techniques and technologies used;

• Risk and success factors;

Copyright © May 2007, Powell Media, LLC and David Loshin

• Formalizing data governance is a critical success factor for MDM.

• Master data management is not always about consolidation of data.

The Influence of Data Profiling and Quality on MDM (and Vice-Versa)

Copyright © May 2007, Powell Media, LLC and David Loshin

A public sector organization is entrusted with managing the cross-organizational