0% found this document useful (0 votes)
732 views

Identifying Master Data

The document discusses the importance of understanding master data and metadata before implementing a master data management (MDM) program. It emphasizes that the core requirements are to identify master data objects across the enterprise, standardize how they are represented, and consolidate them into a single repository. This involves three key challenges: 1) collecting and analyzing metadata from various data sources, 2) resolving differences in data structure between sources, and 3) unifying different semantics and definitions of master data objects. Resolving these challenges is imperative for effective management and sharing of master data as a centralized organizational asset.

Uploaded by

desijnk
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
732 views

Identifying Master Data

The document discusses the importance of understanding master data and metadata before implementing a master data management (MDM) program. It emphasizes that the core requirements are to identify master data objects across the enterprise, standardize how they are represented, and consolidate them into a single repository. This involves three key challenges: 1) collecting and analyzing metadata from various data sources, 2) resolving differences in data structure between sources, and 3) unifying different semantics and definitions of master data objects. Resolving these challenges is imperative for effective management and sharing of master data as a centralized organizational asset.

Uploaded by

desijnk
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A DataFlux White Paper

Prepared by: David Loshin

Semantics, Metadata and Identifying Master Data

Leader in Data Quality www.dataflux.com International


and Data Integration 877–846–FLUX +44 (0) 1753 272 020
Once you have determined that your organization can achieve the benefits of
integrating data quality and data governance through introducing a master data
management (MDM) program, some typical early questions emerge, such as “What
architectural approaches will we take to deploy our MDM solution?” or “What are the
business approaches for acquiring the appropriate tools and technologies required for
MDM success?” These are good questions, but they are often asked prematurely. Even
before determining how to manage the enterprise master data asset, there are more
fundamental questions that need to be asked and comprehensively explored, such as:

• What data elements constitute our “master data?”

• How do we locate and isolate master data objects that exist within the
enterprise?

• How do we assess the variances between the different representations in order


to consolidate instances into a single view?

Because of the ways that diffused application architectures have evolved within
different organizations, it is likely that while there are a relatively small number of
core master objects used, there are many different ways that these objects are
modeled, represented and stored. For example, any application that must manage
contact information for individual customers will rely on a data model that maintains
the customer’s name. Yet one application will track an individual’s full name, while
others will break up the name into its first, middle and last parts. And even for those
that track the given and family names of a customer will do it differently – perform a
quick scan of the data sets within your own organization and you are likely to find
“LAST_NAME” attributes with a wide range of field lengths.

Figure 1: Isolating master data from different data sets.


The challenges are not limited to determining what master objects are used. Indeed,
the core requirement is to find where master objects are used and to chart a strategy
for standardizing, harmonizing and consolidating them into a master repository or
registry. When the intention is to create an organizational asset that is not just
another data silo, it is imperative that your organization provide the means for both
the consolidation and integration of master data – and facilitate the most effective and
appropriate sharing of that master data.

What is Master Data?


What are the characteristics of master data? So far, the industry has been better at
describing master data but less adept at actually defining what master data is. As a
description, master data objects are those core business objects that are used in the
different applications across the organization, along with their associated metadata,
attributes, definitions, roles, connections, and taxonomies. Master data objects are
those “things” that we care about – the things that are logged in our transaction
systems, measured and reported on in our reporting systems, and analyzed in our
analytical systems. Common examples of master data include:

• Customers

• Suppliers

• Parts

• Products

• Locations

• Contact mechanisms

For example, consider the following transaction: “David Loshin purchased seat 15B on
flight 238 from BWI to SFO on July 20, 2006.” Some of the master data elements in
this example and their types are shown in Table 1.

Master Data Object Value


Customer David Loshin
Product Seat 15B
Flight 238
Location BWI
Location SFO

Table 1: Master data elements for a typical airline reservation.

Aside from the above description, master data objects share certain characteristics:

• The real-world objects modeled within the environment as master data objects
tend to be referenced in multiple business areas. For example, the concept of
a “vendor” may exist in the finance application at the same time as in the
procurements application.
• Master data objects are referenced in both transaction and analytic system
records. While the sales system may log and process the transactions initiated
by a “customer,” those same activities may be analyzed for the purposes of
segmentation and marketing.

• Master data objects may be classified within a semantic hierarchy, with


different levels of classification, attribution and specialization applied
depending on the application. For example, we may have a master data
category of “party,” which in turn is comprised of “individuals” or
“organizations.” Those parties may also be classified based on their roles, such
as “prospect,” “customer,” “supplier,” “vendor,” or “employee.”

• Master data objects may require specialized application functions to create


new instances, as well as manage the updating and removal of instance
records. Each application that involves “supplier” interaction may have a
function enabling the creation of a new supplier record.

• They are likely to have models reflected across multiple applications, possibly
embedded in legacy data structure models.

While we may see a natural hierarchy across one dimension, the taxonomies that are
applied to our data instances may actually cross multiple hierarchies. For example, a
party may be an individual, a customer and an employee – simultaneously. In turn,
the same master data categories and their related taxonomies would be used for
transactions, analysis and reporting. For example, the headers in a monthly sales
report may be derived from the master data categories (sales by customer by region
by time period). Enabling the transactional systems to refer to the same data objects
as the subsequent reporting systems ensures that the analysis reports are consistent
with the transaction systems.

Centralizing Semantic Metadata


Master data may be sprinkled across the application environment. The objective of a
master data management program is to facilitate the effective management of the set
of master data instances as a single centralized master resource. But before we can
materialize a single master record for any entity, we must be able to:

1. Discover which data resources may contain entity information

2. Understand which attributes carry identifying information

3. Extract identifying information from the data resource

4. Transform the identifying information into a standardized or canonical form

5. Establish similarity to other standardized records

This entails cataloging the data sets, their attributes, formats, data domains,
definitions, contexts and semantics, not just as an operational resource, but rather in
a way that can be used to automate master data consolidation as well as governing
the ongoing application interactions with the master repository. In other words, to be
able to manage the master data, one must first be able to manage the master
metadata. But as there is a need to resolve multiple variant models into a single view,
the interaction with the master metadata must facilitate the resolution of three critical
aspects:
• Format at the element level

• Structure at the instance level

• Semantics across all levels.

Figure 2: Preparation for a master data integration process must resolve the
differences between the syntax, structure, and semantics of different source
data sets.

These are effectively three levels of integration that need to dovetail as a prelude to
any kind of enterprise-wide integration, and introduces three corresponding challenges
for master metadata management:

1. Collecting and analyzing master metadata

2. Resolving similarity in structure

3. Understanding and unifying master data semantics

Challenge 1: Consolidating and Analyzing Master Metadata


One approach is to analyze and document the metadata associated with all data
objects across the enterprise and use that information to guide analysts seeking out
master data. Many of the data sets may have documented some of the necessary
metadata. For example, relational database systems allow for querying table structure
and data element types, and COBOL copybooks reveal some structure and potentially
even some alias data. Some of the data may have little or no documented metadata,
such as fixed-format or character-separated files.
If the objective is to collect comprehensive and consistent metadata, as well as ensure
that the data appropriately correlates to its documented metadata, we can use data
profiling as the tool of choice. Because of its ability to apply both statistical and
analytical algorithms for characterizing data sets, data profiling can drive the empirical
assessment of structure and format metadata while simultaneously exposing of
embedded data models and dependencies.

Our consolidated metadata repository will eventually enumerate the relevant


characteristics associated with each data set in a standardized way, including the data
set name, its type (e.g., RDBMS table, VSAM file, CSV file) and the characteristics of
each of its columns/attributes (e.g., length, data type or format pattern).

At the end of this process, we will have more than simply a comprehensive catalog of
all data sets. We will also be able to review the frequency of meta-model
characteristics, such as frequently-used names, field sizes, and data types. Capturing
these values with a standard representation allows the metadata characteristics
themselves to be subjected to the kinds of statistical analysis that data profiling
provides. For example, we can assess the dependencies between common attribute
names (e.g., “CUSTOMER”) and their assigned data types (e.g., VARCHAR[20]) to
identify (and potentially standardize against) commonly-used types, sizes and
formats.

Challenge 2: Resolving Similarity in Structure


Despite the expectations that there are many variant forms and structures for your
organization’s master data, the different underlying models of each master data object
are bound to share many commonalities. For example, the structure for practically any
“residential” customer table will contain a name, an address and a telephone number.
On the other hand, almost any vendor table will probably also contain a name, an
address and a telephone number. A closer look might suggest considering an
underlying model concept of a “party,” used as the basis for both customer and
vendor. In turn, the analyst might review any model that contains those same
identifying attributes as a structure type that can be derived or is related to a party
type.

There are two aspects to structure similarity for the purpose of tracking down master
data instances. The first is seeking out overlapping structures, in which the core
attributes determined to carry identifying information for one data object overlap with
a similar set of attributes in another data object. The second is identifying derived
structures, in which one object’s set of attributes are completely embedded within
other data objects. Both cases indicate a structural relationship, and when related
attributes carry identifying information, the analyst should review those objects to
determine if they indeed represent master objects.

Challenge 3: Unifying Semantics


The third challenge focuses the qualitative difference between pure syntactic or
structural metadata (as we can discover through the profiling process), and the
underlying semantic metadata. This involves more than just analyzing structure
similarity. It involves understanding what the data means, how that meaning is
conveyed, how that meaning “connects” data sets across the enterprise, and
approaches to capturing semantics as an attribute of your metadata framework.

As a data set’s metadata is collected, the semantic analyst must approach the
business client to understand that data object’s business meaning. One step in this
process involves reviewing the degree of semantic consistency in data element naming
is related to overlapping data types, sizes and structures. The next step is to
document the business meanings assumed for each of the data objects, which involves
asking questions like:

• What are the definitions for the data elements?

• Or for the data sets themselves?

• Are there authoritative sources for the definitions?

• Do similar objects have different business meanings?

The answers to these questions not only help in determining which data sets truly
refer to the same underlying real-world objects, they also contribute to an
organizational resource that can be used to standardize a representation for each data
object as its definition is approved through the data governance process. Managing
that semantic metadata as a central asset enables the metadata repository to grow in
value as it consolidates semantics from different enterprise data collections.

Identifying and Qualifying Master Data


Once the semantic metadata has been collected and centralized, the analyst’s task of
identifying master data should be simplified. As more metadata representations of
similar objects and entities populate the repository, the frequency of specific models
will provide a basis for assessing whether the attributes of a represented object qualify
the data elements represented by the model as master data. By adding additional
characterization data for each data set’s metadata profile, we can add more knowledge
to the process of determining sources that can feed a master data repository, which
will help in the analyst’s task.

One approach is to characterize the value set associated with each column in each
table. At the conceptual level, designating a value set using a simplified classification
scheme reduces the level of complexity associated with data variance, and allows for
loosening the constraints when comparing multiple metadata instances. For example,
we can limit ourselves to six data value classes, such as these:

1. Boolean or Flag – There are only two valid values, one representing “true” and
one representing “false.”

2. Time/Date Stamp – A value that represents a point in time.

3. Magnitude – A numeric value on a continuous range, such as a quantity or an


amount.

4. Code Enumeration – A small set of values, either used directly (e.g., using the
colors “red” and “blue”) or mapped as a numeric enumeration (e.g., 1 = “red,”
2 = “blue”).

5. Handle – A character string with limited duplication across the set may be
used as part of an object description (e.g., name or address_line_1 fields
contain handle information).

6. Cross-Reference – An identifier that either is uniquely assigned to the record


or provides a reference to that identifier in another dataset.
The Fractal Nature of Metadata Profiling
At this point, each data attribute can be summarized in terms of a small number of
descriptive characteristics: data type, length, data value, class, etc. In turn, each data
set can be described as a collection of its component attributes. Because we are
looking for similar data sets with similar structures, formats and semantics, our job is
to assess each data set’s “identifying attribution,” try to find the collections of data
sets that share similar characteristics, and determine if they represent the same
objects.

Let’s summarize:

• We are using our tools to assess data element structure

• We are collecting this information into a metadata repository

• We use our tools to look for data attributes that share similar characteristics

• We use our tools to seek out attributes with similar names

• We analyze the data value sets and assign them into value classes

• We use our tools to detect similarities between representative data meta-


models

In essence, the techniques and tools we can use for determining the sources of master
data objects are the same types of tools we use for consolidating the data into a
master repository! Using data profiling, parsing, standardization and matching, we can
facilitate the process of identifying which data sets (tables, files, spreadsheets, etc.)
represent which master data objects.

Standardizing the Representation


The analyst now has a collection of master object representations. But as a prelude to
developing the consolidation road map, decisions must be made as part of the
organization’s governance process. To consolidate the variety of diverse master object
representations into a single repository, the relevant stakeholders need to agree on a
common representation as well as the underlying semantics for that representation. It
is critical that a standard representation be defined and agreed to so that the
participants expecting to benefit from the data in the master repository can effectively
share the data. And because MDM is a solution that integrates tools with policies and
procedures for data governance, there should be a process for defining and agreeing
to data standards.

Summary: Metadata Profiling Drives the Process


In effect, we have described a process for analyzing similarity of syntax, structure and
semantics as a prelude to identifying enterprise sources of master data. And since the
objective in identifying and consolidating master data representations requires
empirical analysis and similarity assessment as part of the resolution process, it is
comforting to know that the same kinds of tools and techniques that will subsequently
be used to facilitate data integration can also isolate and catalog organizational master
data.

You might also like