Identifying Master Data
Identifying Master Data
• How do we locate and isolate master data objects that exist within the
enterprise?
Because of the ways that diffused application architectures have evolved within
different organizations, it is likely that while there are a relatively small number of
core master objects used, there are many different ways that these objects are
modeled, represented and stored. For example, any application that must manage
contact information for individual customers will rely on a data model that maintains
the customer’s name. Yet one application will track an individual’s full name, while
others will break up the name into its first, middle and last parts. And even for those
that track the given and family names of a customer will do it differently – perform a
quick scan of the data sets within your own organization and you are likely to find
“LAST_NAME” attributes with a wide range of field lengths.
• Customers
• Suppliers
• Parts
• Products
• Locations
• Contact mechanisms
For example, consider the following transaction: “David Loshin purchased seat 15B on
flight 238 from BWI to SFO on July 20, 2006.” Some of the master data elements in
this example and their types are shown in Table 1.
Aside from the above description, master data objects share certain characteristics:
• The real-world objects modeled within the environment as master data objects
tend to be referenced in multiple business areas. For example, the concept of
a “vendor” may exist in the finance application at the same time as in the
procurements application.
• Master data objects are referenced in both transaction and analytic system
records. While the sales system may log and process the transactions initiated
by a “customer,” those same activities may be analyzed for the purposes of
segmentation and marketing.
• They are likely to have models reflected across multiple applications, possibly
embedded in legacy data structure models.
While we may see a natural hierarchy across one dimension, the taxonomies that are
applied to our data instances may actually cross multiple hierarchies. For example, a
party may be an individual, a customer and an employee – simultaneously. In turn,
the same master data categories and their related taxonomies would be used for
transactions, analysis and reporting. For example, the headers in a monthly sales
report may be derived from the master data categories (sales by customer by region
by time period). Enabling the transactional systems to refer to the same data objects
as the subsequent reporting systems ensures that the analysis reports are consistent
with the transaction systems.
This entails cataloging the data sets, their attributes, formats, data domains,
definitions, contexts and semantics, not just as an operational resource, but rather in
a way that can be used to automate master data consolidation as well as governing
the ongoing application interactions with the master repository. In other words, to be
able to manage the master data, one must first be able to manage the master
metadata. But as there is a need to resolve multiple variant models into a single view,
the interaction with the master metadata must facilitate the resolution of three critical
aspects:
• Format at the element level
Figure 2: Preparation for a master data integration process must resolve the
differences between the syntax, structure, and semantics of different source
data sets.
These are effectively three levels of integration that need to dovetail as a prelude to
any kind of enterprise-wide integration, and introduces three corresponding challenges
for master metadata management:
At the end of this process, we will have more than simply a comprehensive catalog of
all data sets. We will also be able to review the frequency of meta-model
characteristics, such as frequently-used names, field sizes, and data types. Capturing
these values with a standard representation allows the metadata characteristics
themselves to be subjected to the kinds of statistical analysis that data profiling
provides. For example, we can assess the dependencies between common attribute
names (e.g., “CUSTOMER”) and their assigned data types (e.g., VARCHAR[20]) to
identify (and potentially standardize against) commonly-used types, sizes and
formats.
There are two aspects to structure similarity for the purpose of tracking down master
data instances. The first is seeking out overlapping structures, in which the core
attributes determined to carry identifying information for one data object overlap with
a similar set of attributes in another data object. The second is identifying derived
structures, in which one object’s set of attributes are completely embedded within
other data objects. Both cases indicate a structural relationship, and when related
attributes carry identifying information, the analyst should review those objects to
determine if they indeed represent master objects.
As a data set’s metadata is collected, the semantic analyst must approach the
business client to understand that data object’s business meaning. One step in this
process involves reviewing the degree of semantic consistency in data element naming
is related to overlapping data types, sizes and structures. The next step is to
document the business meanings assumed for each of the data objects, which involves
asking questions like:
The answers to these questions not only help in determining which data sets truly
refer to the same underlying real-world objects, they also contribute to an
organizational resource that can be used to standardize a representation for each data
object as its definition is approved through the data governance process. Managing
that semantic metadata as a central asset enables the metadata repository to grow in
value as it consolidates semantics from different enterprise data collections.
One approach is to characterize the value set associated with each column in each
table. At the conceptual level, designating a value set using a simplified classification
scheme reduces the level of complexity associated with data variance, and allows for
loosening the constraints when comparing multiple metadata instances. For example,
we can limit ourselves to six data value classes, such as these:
1. Boolean or Flag – There are only two valid values, one representing “true” and
one representing “false.”
4. Code Enumeration – A small set of values, either used directly (e.g., using the
colors “red” and “blue”) or mapped as a numeric enumeration (e.g., 1 = “red,”
2 = “blue”).
5. Handle – A character string with limited duplication across the set may be
used as part of an object description (e.g., name or address_line_1 fields
contain handle information).
Let’s summarize:
• We use our tools to look for data attributes that share similar characteristics
• We analyze the data value sets and assign them into value classes
In essence, the techniques and tools we can use for determining the sources of master
data objects are the same types of tools we use for consolidating the data into a
master repository! Using data profiling, parsing, standardization and matching, we can
facilitate the process of identifying which data sets (tables, files, spreadsheets, etc.)
represent which master data objects.