Modeling Unstructured Data Web
Modeling Unstructured Data Web
Managing data and metadata is becoming increasingly challenging when unstructured data is added to the mix.
Family photographs are one example of unstructured data. Many of us have shoeboxes and with the adoption of
digital cameras, hard drives packed with them! To manage the photos on my hard drive, I have software that allows
me to attach certain tags to each photograph. Instead of manually scanning photographs, I can now run detailed
queries. But, overtime, even that is not enough!. Our organizations have a similar never-ending cycle of managing
changing data.. As the price of storage space drops and CPUs become faster and faster, our business intelligence users
ask more and more complex queries.
See Figure 1.
I can get
more
I want to
do more
As data requirements become more sophisticated, as megabytes becomes cheaper, as CPU speed becomes faster,
we as analysts and modelers will be faced with more and more complex requirements. Many of these requirements
will exist in unstructured forms such as documents, images, and sound. This white paper will explain structured,
semi-structured, and unstructured data, and increase your awareness of the complex content and requirements
environment.
OVERVIEW
Data models are maps of our information landscape containing entities, relationships, and data elements:
• Entity. Something of interest to the business represented as a rectangle on the model. Examples include
Customer, Order, and Survey. Entity instances are the occurrences or values of a particular entity. The entity
Customer can have instances Bob, Joe, and Jane.
• Relationship. Business rules represented as lines on the model. Cardinality represents the symbols on both ends
of a relationship that define the number of instances of each entity that can participate in the relationship. For
example, ‘Each Customer can purchase many Products.’
• Data element. A property of importance to the business whose values contribute to identifying or describing
instances of an entity. The data element Student Last Name, for example, describes the last name of each Student.
• Domain. The complete set of all possible values that a data element can be assigned. Here are some examples of
domains:
– Order Status Code: {O,S,R,C}
– Book Cover: {*.jpg, *.pdf, *.tiff}
– Image Quality: {Between 150 and 300 dots per inch}
• Class word. The last term in a data element name. Here are some examples of class words:
– Amount. Numeric value expressing a quantity of monetary currency.
– Object. Image, document, multimedia.
– Text. Information, primarily in the form of words, stored as a unit.
1
EXPLANATION OF STRUCTURED, SEMI-STRUCTURED, AND UNSTRUCTURED DATA
We typically operate in a very structured world. Yet by understanding that other types of information exist in our
organizations, we can create much more robust and successful applications. Data can be organized in either a
structured, semi-structured, or unstructured state:
• Structured data is data whose structure can be understood by some external mechanism by exclusively looking
at the meta data. Each structured data element must have an atomic data type and therefore have a name that
ends in any class word except for ‘Object’ and ‘Text’. For example, Gross Sales Amount ends in the atomic class
word ‘Amount’ and is therefore structured. It is important to note that structured data has nothing to do with
whether the data can be physically stored in a database.
• Semi-structured data is data whose structure can be understood by some external mechanism by looking at the
data. As with structured data, it also must be one of the atomic data types. A column heading in Excel for
example, is typically a semi-structured data element. The only difference between structured and semi-structured
is that with semi-structured the values can only be understood by examining the contents instead of just the
meta data. Semi-structured data is one small step away from structured data, and has the same inherent
characteristics of structured data.
• Unstructured data is data whose structure cannot be understood by some external mechanism by looking at the
meta data or content. Examples of unstructured data are documents, music, and images. There is substantially
more unstructured data than structured data. The two class words ‘Object’ and ‘Text’ are unstructured data types.
Figure 2 shows a relational example with structured data. Figure 3 shows the dimensional example.
2
Figure 3. Dimensional example.
The relational example captures that a Customer can own many Accounts and that an Account can belong to more
than one Customer. There are some properties of both Customer and Account that are modeled including Customer
Last Name and Account Open Date. The dimensional example captures the measure Account Balance Amount and
identifies all of the different levels of granularity in which this measure can be viewed. Each of the data elements that
appear on these models are structured because their atomic type can be determined by solely examining the meta
data (i.e. the class word suffix on each data element name). Account Open Date is a date, and Customer First Name is
a name, for example.
However, there is quite a bit of information missing that would fit within the realm of unstructured data. For
example, a scanned bitmapped image of the New Account Application, a photograph of the Customer, an audio of the
actual conversation between the Customer and Bank Teller, a comments section containing text summarizing the
customer’s experience either positive or negative with opening the account, etc.
The analyst must both identify the content and describe the requirements as part of any application development
effort. Traditionally functional or technical analysts identify the content, and business or functional analysts describe
the requirements. Identifying the content means that the data elements that currently exist in the environment must
have their meta data understood and documented. For example, a data element such as Customer Last Name might
be provided externally from the customer through a particular website and therefore exist as electronic text, whereas
a land survey might exist as an internal document in the form of a piece of paper. These data element properties exist
independent of what the business needs from a particular application. The requirements on the other hand, are
described with a particular use in mind. Returning to Customer Last Name for example, the requirements might
dictate this data element to be stored in a structured state within a dimensional model. The land survey element
might reside in its original unstructured form as a pdf file in a relational model.
3
When focusing purely on structured data, the activities around content and requirements are simpler and therefore
usually performed in tandem. For example, “I need Customer Last Name, where do I go to get Customer Last Name?”
When unstructured and semi-structured data are brought into the mix however, the content and requirements are
each much richer in format, and all of the different combinations of content and requirements add a substantial
amount of complexity to the analysis process. To understand both this richness and complexity, I developed the
Content Cube and the Requirements Cube.
Source can be either internal or external. Internal meaning created within the organization and external meaning
created outside the organization. Medium can be either paper or electronic and Format can be either rich (e.g. image,
video, and audio) or plain (i.e. text only). An email you receive advertising store specials has a Source of external, a
Medium of electronic, and most likely a Format of rich consisting of both images and text. An insurance quote you
receive in the mail has a Source of external, a Medium of paper, and a Format of plain. A photograph taken with your
cell phone has a Source of internal, a Medium of electronic, and a Format of rich.
The Content Cube allows the analyst to identify challenges and opportunities independent of specific operational
or reporting requirements. For example, a Source of external might uncover timing or data quality issues. A Source of
internal might be critical for auditing as mandated by Sarbanes-Oxley or Basel II. The Medium could uncover storage
space limitations (both filing cabinets and computer megabytes) or data quality (e.g. dots per inch) challenges. The
Format can reveal the degree of difficulty required to parse and access the element.
4
Figure 5. The Requirements Cube.
The Characteristic setting indicates the degree of structure: structured, semi-structured, and unstructured. The
Representation setting is how the data should be shown on a data model, relational or dimensional. The State setting
indicates whether the content will be stored in its original state or modified for a particular requirement.
The Requirements Cube allows the analyst to understand the requirements independent of the content of the
elements. There could be any combination of content with requirements, leading to 96 possible combinations
between the two cubes! For example, let’s return to the email message which was identified as a Source of external, a
Medium of electronic, and a Format of rich. We can map this content to any combination on the Requirements Cube.
For example, if our requirement is to monitor the quantity of spam emails from specific email addresses that contain
the word ‘Viagra’ in the Subject Line or Message Body, we can have a Representation of dimensional, a Characteristic
of structured, and a State of modified. See Figure 6 for this model.
The analyst will continue to document the requirements and map to the source for these requirements, and the
data modeler will continue to use the analyst’s work as input to a creative design solution. Unstructured data will
change these roles both in the quantity of requirements and complexity of content (i.e. the two cubes mentioned
earlier). The sheer quantity of new information is going to substantially increase the analyst’s workload. Merrill Lynch
estimates that 85% of all data exists in an unstructured state. In addition, the analyst who traditionally creates
source/target mappings from source system to proposed application, will now need to create much more complex
mappings, especially when the State value from the Requirements Cube is modified. As an example, consider Table 1
which contains a partial mapping based on Figure 7, the Spam Dimensional Model.
5
Source Rules Target
Sender email Parse the sender’s email address to obtain From Domain Name
the characters after the last period.
Validate this against a known list
including ‘com’, ‘net’, and ‘gov’.
Date sent Parse the year from date sent. Year Code
Date sent Parse the month from date sent. Month Code
Subject Line and Message Body Search for the word ‘Viagra’ in both the Spam Email Quantity
Subject Line and Message Body. Sum the
number of emails containing ‘Viagra’ by
Sender email and Month.
This is an overly simplified mapping example. However, even so this is still a complex mapping document. There is
parsing to create the domain name, checking to ensure valid email addresses, and searching and summarizing to
create the Spam Email Quantity.
Steve Hoberman is a world-recognized thought-leader in the field of data modeling. He is a popular presenter at
conferences, and the author of Data Modeler's Workbench and Data Modeling Made Simple.
Today’s technology and regulatory environment have added additional pressures to information architects to include
more and more unstructured data. Providing data services around unstructured data is a challenge because of the very
nature of this data being difficult to classify. In this paper we are reminded of the fundamentals of data classification,
and are shown how that relates to structured, semi-structured and unstructured data. This paper uses clear examples
of the need to align the content to the classification needed to make the content useful.
Sybase PowerDesigner has recognized the need to understand the impact of unstructured and semi-structured data
on the overall information architecture, and the impact on the analysis and design of information structures that will
be used to create relevant classifications. PowerDesigner includes all modeling elements needed to capture the essence
of the business. From requirements models to use cases and domain models, PowerDesigner ensures that the business
needs around unstructured information is known. PowerDesigner also carries a rich information architecture stack,
from canonical data models in XML mapped to conceptual, logical and physical data models, the ideas of information
lead directly to implementation.
One of the keys in modeling unstructured and semi-structured information is to understand how to classify it. As
this paper shows, there are really many different combinations of approaches that can be considered depending on the
nature of the source, the needs of the organization and the use of that source data in the final analysis. PowerDesigner
is easily customized and extended to not just adapt well to the specifics unique to one organizations approach, but to
do so using simple VBScript and an easy to use customization interface to streamline the process of teaching
PowerDesigner your methods and standards.
SYBASE, INC. WORLDWIDE HEADQUARTERS, ONE SYBASE DRIVE, DUBLIN, CA 94568 USA 1 800 8 SYBASE Copyright © 2008 Sybase, Inc.
All rights reserved. Unpublished rights reserved under U.S. copyright laws. Sybase and the Sybase logo are trademarks of Sybase, Inc. or its subsidiaries. All
other trademarks are the property of their respective owners. ® indicates registration in the United States. Specifications are subject to change without
www.sybase.com notice. L03069 04-08