0% found this document useful (0 votes)

328 views6 pages

Modeling Unstructured Data Web

Data models are maps of our information landscape containing entities, relationships, and data elements. This white paper will explain structured, semi-structured, and unstructured data. Steve holberman: data models can be used to identify and manage unstructured information.

Uploaded by

sanjcizzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

328 views6 pages

Modeling Unstructured Data Web

Uploaded by

sanjcizzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MODELING UNSTRUCTURED DATA: AN INDEPENDENT PERSPECTIVE BY STEVE HOBERMAN,

DATA MODELING EXPERT

Managing data and metadata is becoming increasingly challenging when unstructured data is added to the mix.

Family photographs are one example of unstructured data. Many of us have shoeboxes and with the adoption of
digital cameras, hard drives packed with them! To manage the photos on my hard drive, I have software that allows
me to attach certain tags to each photograph. Instead of manually scanning photographs, I can now run detailed
queries. But, overtime, even that is not enough!. Our organizations have a similar never-ending cycle of managing
changing data.. As the price of storage space drops and CPUs become faster and faster, our business intelligence users
ask more and more complex queries.

See Figure 1.

I can get
more

I want to
do more

Figure 1. The Insatiable Human Appetite.

As data requirements become more sophisticated, as megabytes becomes cheaper, as CPU speed becomes faster,
we as analysts and modelers will be faced with more and more complex requirements. Many of these requirements
will exist in unstructured forms such as documents, images, and sound. This white paper will explain structured,
semi-structured, and unstructured data, and increase your awareness of the complex content and requirements
environment.

OVERVIEW

Data models are maps of our information landscape containing entities, relationships, and data elements:

• Entity. Something of interest to the business represented as a rectangle on the model. Examples include
Customer, Order, and Survey. Entity instances are the occurrences or values of a particular entity. The entity
Customer can have instances Bob, Joe, and Jane.
• Relationship. Business rules represented as lines on the model. Cardinality represents the symbols on both ends
of a relationship that define the number of instances of each entity that can participate in the relationship. For
example, ‘Each Customer can purchase many Products.’
• Data element. A property of importance to the business whose values contribute to identifying or describing
instances of an entity. The data element Student Last Name, for example, describes the last name of each Student.
• Domain. The complete set of all possible values that a data element can be assigned. Here are some examples of
domains:
– Order Status Code: {O,S,R,C}
– Book Cover: {*.jpg, *.pdf, *.tiff}
– Image Quality: {Between 150 and 300 dots per inch}
• Class word. The last term in a data element name. Here are some examples of class words:
– Amount. Numeric value expressing a quantity of monetary currency.
– Object. Image, document, multimedia.
– Text. Information, primarily in the form of words, stored as a unit.

1
EXPLANATION OF STRUCTURED, SEMI-STRUCTURED, AND UNSTRUCTURED DATA

We typically operate in a very structured world. Yet by understanding that other types of information exist in our
organizations, we can create much more robust and successful applications. Data can be organized in either a
structured, semi-structured, or unstructured state:

• Structured data is data whose structure can be understood by some external mechanism by exclusively looking
at the meta data. Each structured data element must have an atomic data type and therefore have a name that
ends in any class word except for ‘Object’ and ‘Text’. For example, Gross Sales Amount ends in the atomic class
word ‘Amount’ and is therefore structured. It is important to note that structured data has nothing to do with
whether the data can be physically stored in a database.
• Semi-structured data is data whose structure can be understood by some external mechanism by looking at the
data. As with structured data, it also must be one of the atomic data types. A column heading in Excel for
example, is typically a semi-structured data element. The only difference between structured and semi-structured
is that with semi-structured the values can only be understood by examining the contents instead of just the
meta data. Semi-structured data is one small step away from structured data, and has the same inherent
characteristics of structured data.
• Unstructured data is data whose structure cannot be understood by some external mechanism by looking at the
meta data or content. Examples of unstructured data are documents, music, and images. There is substantially
more unstructured data than structured data. The two class words ‘Object’ and ‘Text’ are unstructured data types.

A bank example to illustrate all three types

A new account has been opened at the local bank. This information has been captured for many years in the structured
world through relational and dimensional modeling. Relational modeling captures how the business works and
dimensional modeling captures navigation paths and how measures can be viewed at different levels of granularity.

Figure 2 shows a relational example with structured data. Figure 3 shows the dimensional example.

Figure 2. Relational example.

2
Figure 3. Dimensional example.

The relational example captures that a Customer can own many Accounts and that an Account can belong to more
than one Customer. There are some properties of both Customer and Account that are modeled including Customer
Last Name and Account Open Date. The dimensional example captures the measure Account Balance Amount and
identifies all of the different levels of granularity in which this measure can be viewed. Each of the data elements that
appear on these models are structured because their atomic type can be determined by solely examining the meta
data (i.e. the class word suffix on each data element name). Account Open Date is a date, and Customer First Name is
a name, for example.

However, there is quite a bit of information missing that would fit within the realm of unstructured data. For
example, a scanned bitmapped image of the New Account Application, a photograph of the Customer, an audio of the
actual conversation between the Customer and Bank Teller, a comments section containing text summarizing the
customer’s experience either positive or negative with opening the account, etc.

THE CONTENT AND REQUIREMENTS CUBES

The analyst must both identify the content and describe the requirements as part of any application development
effort. Traditionally functional or technical analysts identify the content, and business or functional analysts describe
the requirements. Identifying the content means that the data elements that currently exist in the environment must
have their meta data understood and documented. For example, a data element such as Customer Last Name might
be provided externally from the customer through a particular website and therefore exist as electronic text, whereas
a land survey might exist as an internal document in the form of a piece of paper. These data element properties exist
independent of what the business needs from a particular application. The requirements on the other hand, are
described with a particular use in mind. Returning to Customer Last Name for example, the requirements might
dictate this data element to be stored in a structured state within a dimensional model. The land survey element
might reside in its original unstructured form as a pdf file in a relational model.

3
When focusing purely on structured data, the activities around content and requirements are simpler and therefore
usually performed in tandem. For example, “I need Customer Last Name, where do I go to get Customer Last Name?”
When unstructured and semi-structured data are brought into the mix however, the content and requirements are
each much richer in format, and all of the different combinations of content and requirements add a substantial
amount of complexity to the analysis process. To understand both this richness and complexity, I developed the
Content Cube and the Requirements Cube.

The Content Cube

The Data Content Cube contains all of the different variations of how data can exist independent of application
requirements. Traditionally we view data as internal electronic text from one of our source applications. But in reality
there are eight different variations of how data can exist, based upon Source, Format, and Medium. Figure 4 contains
the Content Cube.

Figure 4. The Content Cube.

Source can be either internal or external. Internal meaning created within the organization and external meaning
created outside the organization. Medium can be either paper or electronic and Format can be either rich (e.g. image,
video, and audio) or plain (i.e. text only). An email you receive advertising store specials has a Source of external, a
Medium of electronic, and most likely a Format of rich consisting of both images and text. An insurance quote you
receive in the mail has a Source of external, a Medium of paper, and a Format of plain. A photograph taken with your
cell phone has a Source of internal, a Medium of electronic, and a Format of rich.

The Content Cube allows the analyst to identify challenges and opportunities independent of specific operational
or reporting requirements. For example, a Source of external might uncover timing or data quality issues. A Source of
internal might be critical for auditing as mandated by Sarbanes-Oxley or Basel II. The Medium could uncover storage
space limitations (both filing cabinets and computer megabytes) or data quality (e.g. dots per inch) challenges. The
Format can reveal the degree of difficulty required to parse and access the element.

The Requirements Cube

The Data Requirements Cube on the other hand captures the business needs for a particular application. There are
12 different variations on this cube that are a combination of Characteristic, Representation, and State. Figure 5
contains the Requirements Cube.

4
Figure 5. The Requirements Cube.

The Characteristic setting indicates the degree of structure: structured, semi-structured, and unstructured. The
Representation setting is how the data should be shown on a data model, relational or dimensional. The State setting
indicates whether the content will be stored in its original state or modified for a particular requirement.

The Requirements Cube allows the analyst to understand the requirements independent of the content of the
elements. There could be any combination of content with requirements, leading to 96 possible combinations
between the two cubes! For example, let’s return to the email message which was identified as a Source of external, a
Medium of electronic, and a Format of rich. We can map this content to any combination on the Requirements Cube.
For example, if our requirement is to monitor the quantity of spam emails from specific email addresses that contain
the word ‘Viagra’ in the Subject Line or Message Body, we can have a Representation of dimensional, a Characteristic
of structured, and a State of modified. See Figure 6 for this model.

Figure 6. Spam Dimensional Model.

OUR EXPANDING ROLES

The analyst will continue to document the requirements and map to the source for these requirements, and the
data modeler will continue to use the analyst’s work as input to a creative design solution. Unstructured data will
change these roles both in the quantity of requirements and complexity of content (i.e. the two cubes mentioned
earlier). The sheer quantity of new information is going to substantially increase the analyst’s workload. Merrill Lynch
estimates that 85% of all data exists in an unstructured state. In addition, the analyst who traditionally creates
source/target mappings from source system to proposed application, will now need to create much more complex
mappings, especially when the State value from the Requirements Cube is modified. As an example, consider Table 1
which contains a partial mapping based on Figure 7, the Spam Dimensional Model.

5
Source Rules Target

Sender email Parse the sender’s email address to obtain From Domain Name
the characters after the last period.
Validate this against a known list
including ‘com’, ‘net’, and ‘gov’.

Sender email Straight map. From Email Name

Date sent Parse the year from date sent. Year Code

Date sent Parse the month from date sent. Month Code

Subject Line and Message Body Search for the word ‘Viagra’ in both the Spam Email Quantity
Subject Line and Message Body. Sum the
number of emails containing ‘Viagra’ by
Sender email and Month.

Table 1. Partial mapping based on Spam Dimensional Model.

This is an overly simplified mapping example. However, even so this is still a complex mapping document. There is
parsing to create the domain name, checking to ensure valid email addresses, and searching and summarizing to
create the Spam Email Quantity.

ABOUT THE AUTHOR

Steve Hoberman is a world-recognized thought-leader in the field of data modeling. He is a popular presenter at
conferences, and the author of Data Modeler's Workbench and Data Modeling Made Simple.

PERSPECTIVE FROM SYBASE

Today’s technology and regulatory environment have added additional pressures to information architects to include
more and more unstructured data. Providing data services around unstructured data is a challenge because of the very
nature of this data being difficult to classify. In this paper we are reminded of the fundamentals of data classification,
and are shown how that relates to structured, semi-structured and unstructured data. This paper uses clear examples
of the need to align the content to the classification needed to make the content useful.

Sybase PowerDesigner has recognized the need to understand the impact of unstructured and semi-structured data
on the overall information architecture, and the impact on the analysis and design of information structures that will
be used to create relevant classifications. PowerDesigner includes all modeling elements needed to capture the essence
of the business. From requirements models to use cases and domain models, PowerDesigner ensures that the business
needs around unstructured information is known. PowerDesigner also carries a rich information architecture stack,
from canonical data models in XML mapped to conceptual, logical and physical data models, the ideas of information
lead directly to implementation.

One of the keys in modeling unstructured and semi-structured information is to understand how to classify it. As
this paper shows, there are really many different combinations of approaches that can be considered depending on the
nature of the source, the needs of the organization and the use of that source data in the final analysis. PowerDesigner
is easily customized and extended to not just adapt well to the specifics unique to one organizations approach, but to
do so using simple VBScript and an easy to use customization interface to streamline the process of teaching
PowerDesigner your methods and standards.

SYBASE, INC. WORLDWIDE HEADQUARTERS, ONE SYBASE DRIVE, DUBLIN, CA 94568 USA 1 800 8 SYBASE Copyright © 2008 Sybase, Inc.
All rights reserved. Unpublished rights reserved under U.S. copyright laws. Sybase and the Sybase logo are trademarks of Sybase, Inc. or its subsidiaries. All
other trademarks are the property of their respective owners. ® indicates registration in the United States. Specifications are subject to change without
www.sybase.com notice. L03069 04-08

Itil 4 Foundation Cheat Sheet
100% (4)
Itil 4 Foundation Cheat Sheet
4 pages
مكثف معاني فصل ثاني 2008
No ratings yet
مكثف معاني فصل ثاني 2008
18 pages
Programming Model 3 Tutorial Solutions - Processor (CPU) Simulators
No ratings yet
Programming Model 3 Tutorial Solutions - Processor (CPU) Simulators
8 pages
EB2406 - Teradata PDF
No ratings yet
EB2406 - Teradata PDF
18 pages
Man in Space (1971)
No ratings yet
Man in Space (1971)
78 pages
CV Template Engineer
No ratings yet
CV Template Engineer
1 page
Data Analysis and Harmonization: A Simple Guide
From Everand
Data Analysis and Harmonization: A Simple Guide
Jeff Voivoda
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Virtualization: How To Get Your Business Intelligence Answers Today
No ratings yet
Data Virtualization: How To Get Your Business Intelligence Answers Today
8 pages
Association Rules (Ardytha Luthfiarta)
No ratings yet
Association Rules (Ardytha Luthfiarta)
69 pages
An Overview On Data Quality Issues at Data Staging ETL
No ratings yet
An Overview On Data Quality Issues at Data Staging ETL
4 pages
Best Practices For Multi-Dimensional Design Using Cognos 8 Framework Manager
No ratings yet
Best Practices For Multi-Dimensional Design Using Cognos 8 Framework Manager
24 pages
Unit 2 Computer Systems: Structure
No ratings yet
Unit 2 Computer Systems: Structure
13 pages
UML Use Case Diagrams Graphical Notation Reference - Subject, Actor, Use Cases, Relationships Between Them, Extend, Include
No ratings yet
UML Use Case Diagrams Graphical Notation Reference - Subject, Actor, Use Cases, Relationships Between Them, Extend, Include
10 pages
CS54-Data Modeling Using The Entity-Relationship Data B
No ratings yet
CS54-Data Modeling Using The Entity-Relationship Data B
35 pages
What Is Fact?: A Fact Is A Collection of Related Data Items, Each Fact Typically Represents A Business Item, A
No ratings yet
What Is Fact?: A Fact Is A Collection of Related Data Items, Each Fact Typically Represents A Business Item, A
28 pages
Amazon ECS
No ratings yet
Amazon ECS
6 pages
Cloud Computing - Quick Guide
No ratings yet
Cloud Computing - Quick Guide
25 pages
ERStudioDA 9.7 QuickStart en
No ratings yet
ERStudioDA 9.7 QuickStart en
63 pages
Data Modelling Training 21st Century +917386622889
No ratings yet
Data Modelling Training 21st Century +917386622889
8 pages
Logical Modeling SDLC
0% (1)
Logical Modeling SDLC
6 pages
Coding Standard Manual
No ratings yet
Coding Standard Manual
10 pages
What's A Data Warehouse
No ratings yet
What's A Data Warehouse
24 pages
ERwin API
No ratings yet
ERwin API
72 pages
BSE Framework and Object Modeling 2024
No ratings yet
BSE Framework and Object Modeling 2024
17 pages
Lec 2 Data Modeling and Database Design
No ratings yet
Lec 2 Data Modeling and Database Design
10 pages
UML Intro
No ratings yet
UML Intro
38 pages
Connecting The Dots
No ratings yet
Connecting The Dots
36 pages
Forensic Dna Analysis: A Primer For Courts
No ratings yet
Forensic Dna Analysis: A Primer For Courts
60 pages
Discrete Mathematics - S. Lipschutz, M. Lipson and v. H. Patil
No ratings yet
Discrete Mathematics - S. Lipschutz, M. Lipson and v. H. Patil
62 pages
Basic Finance Formulas
No ratings yet
Basic Finance Formulas
1 page
Data Warehousing FAQ
No ratings yet
Data Warehousing FAQ
5 pages
Designing The Data Warehouse Aima Second Lecture
No ratings yet
Designing The Data Warehouse Aima Second Lecture
34 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
IBM Cloud Overview
No ratings yet
IBM Cloud Overview
11 pages
Data Warehouse and Data Sources
No ratings yet
Data Warehouse and Data Sources
18 pages
Data Models
No ratings yet
Data Models
57 pages
Computer Terminology
No ratings yet
Computer Terminology
5 pages
Data Mining: Concepts and Techniques: 0501 - 01/server.920/a96520 PDF
100% (1)
Data Mining: Concepts and Techniques: 0501 - 01/server.920/a96520 PDF
63 pages
Govindarajan Data Vault PDF
100% (1)
Govindarajan Data Vault PDF
29 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
A Concise History of Poland - Jerzy Lukowski Hubert Zawadzki
No ratings yet
A Concise History of Poland - Jerzy Lukowski Hubert Zawadzki
202 pages
Object-Oriented Modeling and Design PDF
No ratings yet
Object-Oriented Modeling and Design PDF
519 pages
What Is A Data Modelvery Important
No ratings yet
What Is A Data Modelvery Important
7 pages
Unit 3,4,5-1
No ratings yet
Unit 3,4,5-1
15 pages
Leverage The Data Trapped in Unstructured Sources With Data Extraction
No ratings yet
Leverage The Data Trapped in Unstructured Sources With Data Extraction
27 pages
Thebusinessanalystshandbookchapter 3 Standardsandguide
No ratings yet
Thebusinessanalystshandbookchapter 3 Standardsandguide
18 pages
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
No ratings yet
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
38 pages
BABOK Guide Appendix Glossary
No ratings yet
BABOK Guide Appendix Glossary
14 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
34 pages
What Is The Level of Granularity of A Fact Table
No ratings yet
What Is The Level of Granularity of A Fact Table
15 pages
Data Modeler Role
No ratings yet
Data Modeler Role
2 pages
Data Visualization
No ratings yet
Data Visualization
16 pages
Kanban Basics
No ratings yet
Kanban Basics
36 pages
DW
No ratings yet
DW
29 pages
Basics of Partitioning
100% (1)
Basics of Partitioning
2 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Chap01 Data Warehouse 1
No ratings yet
Chap01 Data Warehouse 1
65 pages
A Dimensional Modeling Manifesto
No ratings yet
A Dimensional Modeling Manifesto
8 pages
Naming Conventions - IDQ
No ratings yet
Naming Conventions - IDQ
12 pages
Data Lake Architecture: Delivering Insight and Scale From Hadoop As An Enterprise-Wide Shared Service
100% (1)
Data Lake Architecture: Delivering Insight and Scale From Hadoop As An Enterprise-Wide Shared Service
12 pages
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
Azure DevOps Engineer: Designing and Implementing Microsoft DevOps Solutions
From Everand
Azure DevOps Engineer: Designing and Implementing Microsoft DevOps Solutions
Rob Botwright
No ratings yet
Tomcat 6 Developer's Guide
From Everand
Tomcat 6 Developer's Guide
Damodar Chetty
4/5 (1)
Computer Networks: Manar Jammal, Taranpreet Singh, Abdallah Shami, Rasool Asal, Yiming Li
No ratings yet
Computer Networks: Manar Jammal, Taranpreet Singh, Abdallah Shami, Rasool Asal, Yiming Li
25 pages
Critical Issues Affecting An Erp Implementation
No ratings yet
Critical Issues Affecting An Erp Implementation
10 pages
Aqt Demo Tutorial
100% (1)
Aqt Demo Tutorial
45 pages
RobotBASIC HelpFile
100% (2)
RobotBASIC HelpFile
276 pages
EgzonaFida (Ef26398 - SWLcase - II)
No ratings yet
EgzonaFida (Ef26398 - SWLcase - II)
24 pages
Pandas Practice
No ratings yet
Pandas Practice
7 pages
E2 GRule Writer
No ratings yet
E2 GRule Writer
4 pages
Cluster3 Prompt Engineering Generative AI Practice Summary
No ratings yet
Cluster3 Prompt Engineering Generative AI Practice Summary
6 pages
Mod - Menu - Crash - 2023 - 12 - 19-14 - 27 - 20 JJKK
No ratings yet
Mod - Menu - Crash - 2023 - 12 - 19-14 - 27 - 20 JJKK
3 pages
02 - 04 - Quality Management Strategy Sample Document
No ratings yet
02 - 04 - Quality Management Strategy Sample Document
5 pages
FireShot Pro Webpage Capture 003 - 'Edit Resume - My Perfect Resume' - File
No ratings yet
FireShot Pro Webpage Capture 003 - 'Edit Resume - My Perfect Resume' - File
3 pages
Accounting Information System
No ratings yet
Accounting Information System
33 pages
CCS335 Cloud Computing-Notes
No ratings yet
CCS335 Cloud Computing-Notes
108 pages
Designing The Architecture: 4 Edition
No ratings yet
Designing The Architecture: 4 Edition
63 pages
Information Assets & Threats: Dr. Priya. V Associate Professor Coordinator - Vit Cyber Security Coe Vit University
No ratings yet
Information Assets & Threats: Dr. Priya. V Associate Professor Coordinator - Vit Cyber Security Coe Vit University
40 pages
Runge Kutta Gill - Listing Fortran
No ratings yet
Runge Kutta Gill - Listing Fortran
2 pages
2363 2021 Ti en Smart Safety Link
No ratings yet
2363 2021 Ti en Smart Safety Link
3 pages
Sap Abap Resume
No ratings yet
Sap Abap Resume
2 pages
Chapter 1 Question Answers
No ratings yet
Chapter 1 Question Answers
4 pages
Additive Manufacturing-Achal Dubey
No ratings yet
Additive Manufacturing-Achal Dubey
25 pages
2021 B.Tech CSE Curriculum
No ratings yet
2021 B.Tech CSE Curriculum
7 pages
MPMC Print PDF
No ratings yet
MPMC Print PDF
185 pages
Ns Tast GD 003
No ratings yet
Ns Tast GD 003
40 pages
BurpSuite Report - 205051120 - VoHoangAnhThu
No ratings yet
BurpSuite Report - 205051120 - VoHoangAnhThu
9 pages
Web Helper
No ratings yet
Web Helper
27 pages
USB WatchDog Pro2 User Manual - Open Development LLC
No ratings yet
USB WatchDog Pro2 User Manual - Open Development LLC
16 pages

Modeling Unstructured Data Web

Uploaded by

Modeling Unstructured Data Web

Uploaded by

MODELING UNSTRUCTURED DATA: AN INDEPENDENT PERSPECTIVE BY STEVE HOBERMAN,

DATA MODELING EXPERT

Figure 1. The Insatiable Human Appetite.

A bank example to illustrate all three types

Figure 2. Relational example.

THE CONTENT AND REQUIREMENTS CUBES

The Content Cube

Figure 4. The Content Cube.

The Requirements Cube

Figure 6. Spam Dimensional Model.

OUR EXPANDING ROLES

Sender email Straight map. From Email Name

Table 1. Partial mapping based on Spam Dimensional Model.

ABOUT THE AUTHOR

PERSPECTIVE FROM SYBASE

You might also like