NoSQL and SQL Data Modeling - Bringing Together Data, Semantics, and Software (PDFDrive)
NoSQL and SQL Data Modeling - Bringing Together Data, Semantics, and Software (PDFDrive)
first edition
Ted Hills
Published by:
2 Lindsley Road
Basking Ridge, NJ 07920 USA
https://www.TechnicsPub.com
Cover design by John Fiorentino
Technical reviews by Laurel Shifrin, Dave Wells, and Steve Hoberman
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording or by any information storage and retrieval
system, without written permission from the publisher, except for the inclusion of brief quotations in a
review.
The author and publisher have taken care in the preparation of this book, but make no expressed or implied
warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for
incidental or consequential damages in connection with or arising out of the use of the information or
programs contained herein.
All trade and product names are trademarks, registered trademarks, or service marks of their respective
companies, and are the property of their respective holders and should be treated as such.
Copyright © 2016 by Theodore S. Hills, [email protected]
ISBN, print ed. 9781634621090
ISBN, Kindle ed. 9781634621106
ISBN, ePub ed. 9781634621113
ISBN, PDF ed. 9781634621120
First Printing 2016
Library of Congress Control Number: 2016930173
To my wife Daphne Woods, who
has always believed in me, and
gave me the space and support
I needed to write this book.
Contents at a Glance
“I know, but it can’t be helped,” Sam explained. “You see, although the change
is easy, we have to be really careful we don’t mess up any downstream product
flows that could be inadvertently affected by this change. And that takes time to
figure out.”
“It takes six months to look at these drawings and figure out what the impact of
the change is?” Joe asked, somewhat incredulously.
Sam’s poker face began to show a little discomfort. “Well, that’s the problem,”
Sam said. “You see, the drawings engineering used weren’t up to date, so we
have to check them against the actual piping, and update them, and then look at
the change request again.”
Joe wasn’t just going to accept this as the final verdict. “Why do you have to
look at the actual piping? Why not pull out the latest drawings that engineering
should have used, and compare to them?”
Sam began to turn a little red. “I’m not quite sure how to say this, but
engineering did use the latest drawings we have on file. The problem is that they
don’t match what’s actually been implemented in the plant.”
Joe felt the tension rising, and realized that now was the time to pull out all his
diplomatic skills, to avoid a confrontation that could hide the truth. He paused a
moment, looked down at the counter to collect his thoughts, put on his best
“professor” demeanor, and then looked up at Sam. “So I guess what you’re
saying is that changes were made in the field, but the drawings weren’t updated
to reflect them.”
“That’s right,” Sam said quietly. “The project office doesn’t like us spending
time on drawings when we should be out in the field fixing things, and no one
ever asks us for the drawings, so we just do stuff to make the plant run better and
the drawings stay in the filing drawer.”
Joe was surprised and a bit distressed, but kept his voice level. “Interesting.
What kinds of changes do you do out in the field that don’t require engineering’s
involvement?”
“We’ve got this great guy—Manny. He’s worked here for 30 years, and knows
where every pipe goes and how every fitting fits together. When something goes
wrong, we call Manny, and he usually fixes the problem and finds an
improvement that the engineering guys overlooked. So we discuss it and then
implement the improvement, and everything runs better.”
“But no one updates the drawings,” Joe said quietly.
“Well, yeah,” Sam muttered embarrassedly, looking away from Joe.
“And no one tells engineering what changed,” Joe added. Sam didn’t say
anything. “Well, Sam, thanks for explaining the situation. I’ll go back to the
project office and we’ll see if we can figure out any way to update the drawings
with the current process flows in less than six months.” Joe turned to go, but
then hesitated and turned back. “Could Manny work with the engineers to
document his changes? I presume that would be faster than having someone
check every single connection.”
Sam turned white. He didn’t want to break this news. “Manny doesn’t work here
anymore.”
Joe’s shoulders slumped. “What happened to him?”
“He retired last month.”
WHY MODEL?
WHY MODEL?
Creating a model is not an absolutely necessary step before implementing a
database. So, why would we want to take the time to draw up a data model at all,
rather than just diving in and creating the database and storing data? A data
model describes the schema of a database or document, but if our NoSQL
DBMS is “schema-less” or “schema-free”, meaning that we don’t need to dictate
the schema to the DBMS before we start writing data, what sense does it make to
model at all?
The advantage of a schema-less DBMS is that one can start storing and
accessing data without first defining a schema. While this sounds great, and
certainly facilitates a speedy start to a data project, experience shows that a lack
of forethought is usually followed by a lot of afterthought. As data volumes
grow and access times become significant, thought needs to be given to
reorganizing data in order to speed access and update, and sometimes to change
tradeoffs between the speed, consistency, and atomicity of various styles of
access. It is also commonly the case that patterns emerge in the data’s structure,
and the realization grows that, although the DBMS demands no particular data
schema, much of the data being stored has some significant schema in common.
So, some vendors say, just reorganize your data dynamically. That’s fine if the
volume isn’t too large. If a lot of data has been stored, reorganization can be
costly in terms of time and even storage space that’s temporarily needed,
possibly delaying important customers’ access to the data. And schema changes
don’t just affect the data: they can affect application code that must necessarily
be written with at least a few assumptions about the data’s schema.
A data model gives the opportunity to play with database design options on
paper, on a whiteboard, or in a drawing tool such as Microsoft Visio, before one
has to worry about the syntax of data definitions, data already stored, or
application logic. It’s a great way to think through the implications of data
organization, and/or to recognize in advance important patterns in the data,
before committing to any particular design. It’s a lot easier to redraw part of a
model than to recreate a database schema, move significant quantities of data
around, and change application code.
If one is implementing in a schema-less DBMS without a model, then, after
implementation is complete, the only ways to understand the data will be to talk
to a developer or look at code. Being dependent on developers to understand the
data can severely restrict the bandwidth business people have available to
propose changes and expansions to the data, and can be a burden on developers.
And although you might have friendly developers, trying to deduce the structure
of data from code can be a very unfriendly experience. In such situations, a
model might be your best hope.
Beside schema-less DBMSs, some NoSQL DBMSs support schemas. Document
DBMSs often support XML Schema, JSON Schema, or other schema languages.
And, even when not required, it is often highly desirable to enforce conformance
to some schema for some or all of the data being stored, in order to make it more
likely that only valid data is stored, and to give guarantees to application code
that a certain degree of sanity is present in the data.
And this is not all theory. I have seen first-hand the failure of projects due to a
lack of a data model or a lack of data modeling discipline. I have also seen
tremendous successes resulting from the intelligent and disciplined application
of data modeling to database designs.
Here are some stories of failure and of success:
A customer data management system was designed and developed
using the latest object-oriented software design techniques. Data
objects were persisted and reconstituted using an object/relational
mapping tool. After the system was implemented, performance was
terrible, and simple queries for customer status either couldn’t be
done or were nearly impossible to implement. Almost everyone on
the project, from the manager to the lowliest developer, either left the
company or was fired, and the entire system had to be re-
implemented, this time successfully using a data model and
traditional database design.
Two customer relationship management (CRM) systems were
implemented. One followed the data model closely; the other deviated
from the model to “improve” things a bit. The CRM system that
deviated from the model had constant data quality problems, because
it forced operational personnel to duplicate data manually, and of
course the data that was supposed to be duplicated never quite
matched. It also required double the work to maintain the data that, in
the original model, was defined to be in only one place. In contrast,
the CRM system that followed the data model had none of these data
quality or operational problems.
A major financial services firm developed a database of every kind of
financial instrument traded in every exchange around the world. The
database was multi-currency and multi-language, and kept historical
records plus future-dated data for new financial instruments that were
going to start trading soon. The system was a success from day one.
A model-driven development process was used to create the system,
and to maintain it as additional financial instrument types were added,
so that all database changes started with a data modeler and a data
model change. Database change commands were generated directly
from the model. The system remained successful for its entire
lifetime.
A business person requested that the name of a product be changed on
a report, to match changes in how the business was marketing its
products. Because the database underlying the report had not been
designed with proper keys, the change, which should have taken a
few minutes, took several weeks.
You see, developing a data model is just like developing a blueprint for a
building. If you’re building a small building, or one that doesn’t have to last,
then you can risk skipping the drawings and going right to construction. But
those projects are rare. Besides, those simple, “one-off” projects have a tendency
to outlive early expectations, to grow beyond their original requirements, and
become real problems if implemented without a solid design. For any significant
project, to achieve success and lasting value, one needs a full data model
developed and maintained as part of a model-driven development process. If you
skip the modeling process, you risk the data equivalents of painting yourself into
corners, disabling the system from adapting to changing requirements, baking in
quality problems that are hard to fix, and even complete project failure.
WHY COMN?
There are many data modeling notations already in the world. In fact, Part II of
this book surveys most of them. So why do we need one more?
COMN’s goal is to be able to describe all of the following things in a single
notation:
the real world, with its objects and concepts
data about real-world objects and concepts
objects in a computer’s memory whose states represent data about
real-world objects and concepts
COMN connects concepts, real-world objects, data, and implementation in a
single notation. This makes it possible to have a single model that represents
everything from the nouns of requirements all the way down to a functional
database running in a NoSQL or SQL database management system. This gives
a greater ability to trace requirements all the way through to an implementation
and make sure nothing was lost in translation along the way. It enables changes
to be similarly governed. It enables the expression of reverse-engineered data
and the development of logical and conceptual models to give it meaning. It
enables the modeling of things in the Internet of Things, in addition to modeling
data about them. No other modeling notation can express all this, and that is why
COMN is needed.
Book Outline
The book is divided into four parts. Part I lays out foundational concepts that are
necessary for truly understanding what data is and how to think about it. It peels
back the techno-speak that dominates data modeling today, and recovers the
ordinary meanings of the English words we use when speaking of data. Do not
skip part I! If you do, the rest of the book will be meaningless to you.
Part II reviews existing data modeling, semantic, and software notations, and
object-oriented programming languages, and creates the connections between
those and the COMN defined in this book. If you are experienced with any of
those notations, you should read the relevant chapter(s) of part II. COMN uses
some familiar terms in significantly different ways, so it is critical that you learn
these differences. Those chapters about notations you are not familiar with are
optional, but will serve as a handy reference for you when dealing with others
who know those notations.
Part III introduces the new way of thinking about data and semantics that is the
essence of this book and of the Concept and Object Modeling Notation. Make
sure you’ve read part I carefully before starting on Part III.
Part IV walks through a realistic data modeling example, showing how to apply
COMN to represent the real world, data design, and implementation. By the time
you finish this part, you should feel comfortable applying your COMN
knowledge to problems at hand.
Each chapter ends with a summary of key points and a glossary of new terms
introduced. There is a full glossary at the end, along with a comprehensive
index. In addition, an Appendix provides a quick reference to COMN. You can
download the full reference and a Visio stencil from http://www.tewdur.com/.
This will enable you to experiment with drawing models of your own data
challenges while you read this book.
Book Audience
Each person who picks up this book comes to it with a unique background,
educational level, and set of experiences. No book can precisely match every
reader’s needs, but this book was written with the following readers in mind in
order to come as close as possible.
Data Modeler
You might be an experienced data modeler, already using an E-R or fact-based
modeling notation, or the UML, to design your databases. But there are some
niggling design problems that you always felt should not be so hard to tackle.
Or, you might want to fold semantics into your data models but aren’t sure how
to do that. You’ll find that learning COMN will build on the data modeling
knowledge you’ve already acquired, and expand how far you can use that
knowledge to include NoSQL databases, semantics, and some aspects of
software development. Make sure you read part I, then the relevant chapters in
part II, before you dig into part III and learn how to think differently about data
and data models. After you’ve learned COMN, you’ll find that much of the
advice on data modeling you’ve already learned is still valuable, but you’ll be
able to apply it much more effectively than before.
Software Developer
You might be a software developer who knows that there’s more to data than
meets the eye, and has decided to set aside some time to think about it. This
book will help you do just that. Make sure you read part I, then chapter 9 on
object-oriented programming languages. Chapter 9 will be especially relevant
for you, as it will draw connections between data and the object-oriented
programming that you’re already familiar with. If you design software with the
Unified Modeling Language (UML), you should also read chapter 6.
Ontologist
You’ve begun to apply semantic languages like OWL to describing the real
world. However, you find the mapping from semantics to data tedious and also
incomplete. It’s difficult to maintain a mapping between a model of real-world
things and a model of data. COMN is a tool you can use to express that mapping.
Make sure you read part I carefully, and chapter 8 on semantic notations, before
continuing on to part III.
Key Points
The Concept and Object Modeling Notation (COMN, pronounced “common”) can represent data designs and
their connections to the real world, to meaning (semantics), to database implementations, and to software.
A data model is essential to any successful database design project, and helps to meet requirements, build in
flexibility, and avoid quality problems and project failure.
Everyone should read all of part I of this book.
Part II contains chapters relevant to those who already know the notations and languages discussed.
The meat of the book is in part III, but will only make sense to those who read part I and the relevant chapters
of part II.
This book should deliver value to NoSQL and SQL database developers, new and experienced data modelers,
software developers, and ontologists.
Part I
Real Words in the Real World
In designing databases and data systems, we seek to accurately represent the real
world and data about the real world. But our ability to think about the real world
is hampered by the special meanings we have attached to ordinary words,
making it difficult or impossible to reason without inadvertently carrying along
the intellectual baggage of a particular technical view of reality.
Part I of this book returns us to the ordinary English meanings of words that we
have co-opted for special purposes in the field of information technology. By the
end of this section, your mind will be refreshed to remember the way we use
these words in ordinary speech. This will prepare you to learn new, more precise
meanings for these words that will make them powerful tools in analysis and
design.
Chapter 1
It’s All about the Words
“When I use a word,” Humpty
Dumpty said in rather a scornful
tone, “it means just what I choose it
to mean—neither more nor less.”
“The question is,” said Alice,
“whether you can make words mean
so many different things.”
“The question is,” said Humpty
Dumpty, “which is to be master—
that’s all.”
[Carroll 1871]
Key Points
The word “entity” is the technical term for “thing”.
Entities come in two flavors, conceptual and objective.
Objective entities include material things called objects. All objects, except the elementary particles, are
composed of other objects.
Conceptual entities are concepts or ideas.
Unlike objects, widely shared concepts have no time or place.
We will look at overloaded technical definitions of these words in later chapters. For now, we will use their
natural-language definitions.
Chapter Glossary
entity : something that has separate and distinct existence and objective or conceptual reality (Merriam-
Webster) object : something material that may be perceived by the senses (Merriam-Webster) concept :
something conceived in the mind : thought, notion (Merriam-Webster) composite : made up of distinct
parts (Merriam-Webster) component : a constituent part (Merriam-Webster)
Chapter 3
Containment and Composition
We saw in the previous chapter that all material objects, except the elementary
particles, are composed of other material objects. We’ll take a closer look at how
composition works, but first we’ll look at the idea of objects that contain other
objects without being composed of them. Once again, one of our goals is to
recover the ordinary meanings of words that have been overloaded with
technical meanings.
CONTAINMENT
Suppose I go to a grocery store and
buy a dozen eggs. I carry the eggs
home in a carton that is made of
Styrofoam, fiberboard, or some
other material that protects the
fragile eggs from breaking. Over the
course of a week I eat the eggs, and
when the last egg is gone I throw
away the carton.
Each egg is an object—a material
thing—and the carton is an object, but they are different kinds of objects. The
carton was specially designed to hold up to twelve eggs. The carton is a
container and the eggs are its contents. When I brought the carton home it was
full. As soon as I took the first egg out of the carton it was no longer full. Once I
took the last egg out of the carton it was empty. So the state of the container—
full, partially full, empty—varied over the course of the week. However, despite
its changing state, the composition of the carton never changed. I would never at
any time say that the carton was composed of eggs. It was composed of
Styrofoam or fiberboard.
In general, a container is designed so that contents can easily be added and
removed. These operations change the state of the container, but do not change
its composition—that is, what it is made of. If I took a few eggs out of the
carton and made a cake from them, it would be correct to say that the eggs were
in the cake, but not in the same sense as being in the carton. In the cake the eggs
have lost their integrity and can never be removed from it again. Unlike the egg
carton, the cake is composed of eggs, and flour and milk and sugar and other
ingredients, blended together.
Observe that containment is exclusive, in two senses. First, an egg is either in a
carton or not in a carton. It cannot be partially contained. Second, if I had
another egg carton, it would be impossible for me to have a particular egg in
both cartons simultaneously.
Some containers can nest, like
Russian matryoshka dolls. Each
container can contain, not only
whatever fits, but also another,
smaller container, which can
contain another smaller container,
and so on. With nesting
containers, it is possible to say
that something in a smaller
container is also in the larger
container that contains the smaller
container, but the contents of the
smaller container are not in the
large container directly. An object can only be directly in one container at a time.
If the carton of eggs is in a grocery bag, we can say that the eggs are in the bag,
but it is more complete to say that the eggs are in the carton in the bag.
Suppose I bring home the dozen eggs from the grocery store, in their protective
container, but my refrigerator is so full that I cannot fit the carton of eggs inside.
To solve the problem, I remove the eggs from the carton, tuck each of the twelve
eggs into little spaces that can accommodate an egg-sized object, and throw the
carton away. Even though I have destroyed the container, the twelve eggs
continue to exist in the refrigerator. This shows that, in ordinary parlance, a
container and its contents can exist independently.
COMPOSITION
We saw in the previous section that containment is not composition; that is, a
container is not composed of its contents. We also saw one kind of composition,
where a cake is composed of its ingredients blended together in such a way that
they can’t be separated again. There are a few more modes of composition—
ways in which objects can be composed of smaller objects—that are relevant to
our ultimate purpose of representing data, software, and semantics.
Remember that an object is composed of
other objects in some kind of relatively
static spatial relationship. Certainly a
cake is an object, because it is composed
of eggs, milk, flour, sugar, and other
objects in a relatively static spatial
relationship: they are all blended together
and will remain that way until the cake is
consumed.
Now let’s think about a frosted cake.
Frosting is applied to the top of the cake
and between the layers. This isn’t quite
blending, because the integrity of the cake and the frosting is still preserved. You
can still see the difference between them, though it would be difficult to separate
them once again. This kind of object composition is called aggregation. An
object is formed from other objects in a way that the components keep their
integrity, but it would be difficult to extract the components after they’ve been
joined together.
For those who know the UML, please note that, in ordinary English and in
COMN, composition is the over-arching term, and aggregation is one particular
kind of composition. Likewise, a component is that which is part of any kind of
a composite.
For those who are familiar with dimensional modeling, please note that what is
called aggregation in that discipline is blending in ordinary English and in
COMN.
In contrast to aggregation, we can have assembly. This is a mode of composition
where components retain their integrity and can even be removed from the
object which they compose, if so desired. A real-world example of an assembly
is an engine. Its parts are connected with screws and other connectors that can be
disconnected and reconnected at will.
Another important mode of
composition is juxtaposition, where
objects are arranged in a fixed spatial
relationship to each other without
being blended and without being
connected to each other. For instance,
dinner plates and silverware are
juxtaposed on a dining table to form a
place setting.
Key Points
A container is an object that can hold other objects in such a way that they can be easily added to and removed
from the container.
Adding objects to and removing objects from a container changes the container’s state but not its composition.
We never say that a container is composed of its contents.
Containment is exclusive. An object can only be in one container at a time, and is either entirely in or entirely
out of the container.
Containers can nest: A container may contain another container.
All objects, except the elementary particles, are composed of other objects.
Four modes of composition important to us are:
juxtaposition
blending
aggregation
assembly
In any given real-world object, it is likely that many modes of composition are present at once.
Chapter Glossary
container : an object that can contain other objects (like an egg carton) contents : the objects inside a
container (like the eggs in an egg carton) juxtaposition : arranging objects in a fixed spatial relationship
without connecting them (like a place setting) blending : combining two or more objects in such a way that
they lose their integrity (like eggs, flour, milk, and sugar in a cake) aggregation : combining two or more
objects in such a way that they retain their integrity, but it is difficult or impossible to separate them again
(like a layer cake) assembly : combining two or more objects in such a way that they retain their integrity,
and it is relatively easy to separate them again (like an engine)
Chapter 4
Types and Classes in the Real World
In this chapter we will examine some of the most fundamental concepts that are
essential to the tasks of data modeling and implementation. Once again, we’ll
endeavor to recover the ordinary, non-technical meanings of some words that
have become fuzzy technical terms.
COLLECTIONS OF OBJECTS
Art museums usually contain paintings. Based on the previous chapter, we can
recognize that a museum is a container, and the paintings are its contents.
Art museum curators
often speak of their
collections of paintings.
For example, an art
museum may say that it
has a collection of Monet
paintings, a collection of
Morisot paintings, and a
collection of Renoir
paintings. Unlike
containers and their
contents, collections are
not so strictly tied to
physical relationships.
Let’s see how this works.
It is common in the art world for museums to share their collections with each
other, as a benefit to the art world and the public at large. For example, suppose
there is an art museum on the East Coast of the United States that has a
wonderful collection of oil paintings by the French impressionist Monet. This
East Coast museum will package up part of its collection of Monet paintings and
send them to a museum on the West Coast, for display there for some months,
before they are shipped back to the museum that owns the collection.
Now, while those paintings are on the West Coast, they are still considered part
of the collection belonging to the East Coast museum, even though they are not
physically contained in that museum. So we can see that a collection can exist
inside or outside any particular container.
The paintings on loan to the West Coast museum might be displayed side-by-
side with paintings belonging to the West Coast museum, but they would never
be considered to be part of any collection of the West Coast museum. The East
Coast museum always owns its collection, no matter where the members of that
collection might be. This is evidence that our concept of collection, like our
concept of containment, involves exclusivity. We often describe this exclusivity
in terms of ownership. Something owned by one person or group is not owned
by any other person or group. Something that is a member of one collection may
not also be a member of another collection—except transitively: a collection
may belong to another collection.
We can see from this example that the objects belonging to a collection may or
may not be in the same container at any one time. Although a container, being an
object, always has but one location at any one time, a collection of objects is not
necessarily localized. This gives us a clue that, while a container is an object, a
collection is not an object; a collection is merely a concept.
Our hypothetical East Coast art museum has paintings and other drawings of
many types, including oil paintings, watercolors, charcoal and pencil sketches,
pastels, and engravings. The paintings can have many subjects, including
landscapes, still-lifes, and portraits. The museum curators often speak of the
paintings in their collection according to any of these characteristics, and will
call these collections, too. A painting by Monet might be a landscape and an oil
painting, so that, when a curator speaks of “our Monet collection,” “our
collection of landscapes,” and “our collection of oil paintings,” the oil landscape
by Monet is included every time. Thus, we can see that, in ordinary English, an
object can be in multiple collections at the same time, provided that all of the
collections have the same owner.
SETS OF CONCEPTS
We have seen that a collection is conceptual, even when the members of the
collection are objects. It is also possible to have a collection of concepts. In such
a case, both the collection and its members are conceptual. However, we don’t
usually use the word “collection” in connection with concepts. We will usually
say that we have a set of concepts.
We know that numbers are concepts. Mathematicians have a special notation
that they’ve developed just so that they can talk about sets of numbers (and other
things). It is called set notation. Very simply, a list of numbers is enclosed in
curly braces, as in {1, 2, 3}
The whole expression is called a set. The set just given consists of the numbers
one, two, and three.
One of the interesting things about sets of conceptual entities, such as sets of
numbers, is that you can destroy the set notation that describes the set, but that
doesn’t destroy the set itself, nor its members. A set of numbers, and the
numbers themselves, don’t exist just because they are written down. This is in
contrast to collections of objects. A collection of objects can be destroyed in a
number of ways:
The objects themselves can be destroyed.
The collection can be destroyed by ceasing to consider it as existing.
For instance, the East Coast art museum might give away all of its
Monet paintings to other museums. The paintings continue to exist,
but the collection is destroyed.
Membership of concepts in sets is not exclusive. A single concept can be in
multiple sets at the same time. Consider as an example the number 2. It is in all
these sets simultaneously:
the set of natural numbers
the set of integers
the set of even numbers
the set of prime numbers
In fact, we could go on inventing sets ad infinitum for 2 to be part of.
Although membership in a set is not exclusive, two sets can be exclusive of each
other. For example, the set of even numbers is exclusive of the set of odd
numbers. Any given integer is a member of only one of those two sets. But, as
we have seen, an integer can be a member of many non-exclusive sets at the
same time.
SETS OF OBJECTS
We have covered collections of objects and sets of concepts. We don’t generally
speak of collections of concepts. But we can speak of sets of objects. A set of
objects is very similar to a collection of objects, except without the concept of
ownership. For instance, we could speak of the set of cars in a parking lot at a
given moment, and the set would be understood, even though the cars had no
one owner. A set, like a collection, is a concept, even though members of the set
may be objects.
And, as with sets of concepts, some sets may be exclusive of each other. A given
painting may not simultaneously be a pastel and watercolor, and it may not
simultaneously be a portrait and a landscape. But a painting may simultaneously
be in the pastel set and the landscape set.
The worlds of database and software development have heavily overloaded two
particular words related to classification, namely the words “type” and “class”,
giving them specialized meanings that are actually at the core of one set of
problems plaguing computer science. In our refined terminology we will use
these two words with great caution and specificity. Since “type” and “class” are
synonyms in ordinary English, this choice of specialized meanings is arbitrary
from an English point of view. We will see in part III how these choices are
influenced by object-oriented programming languages, but how quite different
definitions for these terms clarify our thinking.
We will use the word type to mean something that designates a set, usually a set
of concepts but also possibly a set of objects. We will use the word class to
mean a description of the structural and/or behavioral characteristics of
potential or actual objects. In cases where we don’t have enough context to
choose between “type” and “class”, the word “kind” will be used, meaning
“some kind of category, but we’re not sure whether it’s a type, a class, or
something else”.
Types Designate Sets What does it mean for a type to “designate” a set? We
mean that there is some means by which we can identify the members of the set,
and distinguish those from things that are not in the set. It turns out that there are
many ways to designate sets. Here are some examples of types that designate
sets of concepts:
by naming; for example, “natural numbers” is the name of a well-
known set and therefore that phrase designates the set and is a type.
by selection using some condition; for example, all those natural
numbers which are divisible by two.
by enumeration—in other words, by listing the names of the
members of the set; for example, {1, 2, 3}.
There are additional ways to designate sets which we’ll see later in the book.
It’s important to keep in mind that a type—a designation of a set—is not the set
itself. For example, the phrase “natural numbers” consists of words, but the set it
designates is quite different, consisting only of numbers.
Here are some examples of types that designate sets of objects:
by selection using some condition; for example, “all red cars”.
by enumeration; for example, I could designate a set of objects by
listing their names; let’s say, “scissors, ruler, matches, twist tie”.
by location: This is a special case of selection. For example, I could
designate a set of objects as “those objects found in the junk drawer
in my kitchen.” In fact, this set could be the same set as I just
designated by enumeration. It is significant that objects can be
designated by their location in space and time. Concepts cannot be so
designated.
Again, there are additional ways to designate sets of objects, and again, the
designation of the set is not the set itself.
Three Aspects of Types and Classes With regard to types and classes, we have
three things separately:
the type or class, which designates, through some means (condition,
description, enumeration, etc.), the things that are members of the set
the actual members of the set designated by the type or class
the name of the type or class; for example, “natural number”,
“elephant”.
Chapter Glossary
type : something that designates a set class : a description of the structural and/or behavioral characteristics
of potential or actual objects collection : a set of objects having a single owner
Key Points
Objects may belong to collections. An object may belong to several collections, but only if all the collections
have the same owner.
The objects belonging to a collection need not be in the same container or even in the same vicinity.
A collection is a concept, even though it consists of objects.
We generally don’t speak of collections of concepts. We speak of sets of concepts. A concept may be a
member of more than one set at a time.
We may also have sets of objects.
Some sets, of objects or concepts, may be exclusive of each other.
Sets and collections of objects may be destroyed by destroying the objects themselves, or by simply ceasing to
consider the set or collection to exist.
Sets of concepts are not destroyed merely by destroying some representation of them.
The terms “type” and “class” are synonyms in English, but are not synonyms in information technology, nor
in COMN.
We will use the term “kind” when we don’t care to distinguish between type and class.
Types designate sets; classes describe objects.
A type may designate a set through many means, including naming, selection, enumeration, and (for sets of
objects only) location.
A class indirectly designates the set of all potential or actual objects which match its description.
A type or class is not the same as the set it designates.
Part II
The Tyranny of Confusion
Our thinking is dominated by words. When those words have become
overloaded with multiple ill-defined and contradictory meanings, we cannot
think clearly. The words themselves keep us in a state of confusion. How will we
break out of this tyranny? By simplifying and clarifying our terminology.
Part I of this book returned us to the ordinary English meanings of words that we
have co-opted for special purposes in the field of information technology. For
those with knowledge of established modeling notations and/or programming
languages, it will be important to re-interpret those notations using the clarified
vocabulary of everyday English. This next section contains a chapter on each of
five major modeling notations. Each chapter provides a brief overview of the
notation, and focuses on what those notations really mean in ordinary English.
This will bring out a number of intellectual short circuits inherent in each
notation that limit our ability to analyze requirements and design solutions.
If you know one of these notations, it is very important that you read the relevant
chapter, so that you can make the translations necessary from the terminology
you are familiar with to the terminology of COMN. By learning the refined
terminology, new vistas of analysis and design will open up to you. You may be
surprised at all the ideas you took for granted that turn out to have more to them
than the notation teaches. But you will only be able to gain these insights if you
can rise above the terminology and related concepts that are integral to the
notation you already know. The chapters in this section are intended to help you
do that.
You may read just the chapters that apply to the notations with which you are
familiar. You may read all the chapters if that suits your interest. And, if you
don’t have a background in any of these notations, feel free to skip this entire
section.
The same example data modeling problem is used in each of these chapters so
that if you are reading more than one chapter it will be easy to compare the
notations to each other.
If you are an enthusiastic user or supporter of one of the notations discussed in
the following chapters, please keep in mind that each chapter is not intended to
be a complete presentation of the notation. Rather, it is intended to orient the
reader who is already familiar with the notation to how the same concepts are
represented in COMN, and to highlight the areas where COMN can represent
things that the subject notation cannot.
Chapter 5
Entity-Relationship Modeling
Entity-relationship (E-R) modeling was formally proposed by Peter Chen in
1975 , and is almost certainly the dominant form of data modeling in use
[Chen 1976]
today. The notation of E-R modeling has evolved significantly since Chen’s
paper, and has forked into several variants, including Integration DEFinition for
Information Modeling (IDEF1X), Barker-Ellis, Information Engineering (IE),
and other variants. IE notation is common but not standardized, and exists in
several variants. For the purposes of this chapter, we will use the variant of IE
notation implemented in a Microsoft Visio drawing tool stencil.
E-R modeling defines three stages of data modeling: conceptual, logical, and
physical. We will start our review of E-R modeling with logical data models,
where the focus is on the design of data structures to hold data relevant to the
problem to be solved.
The relationship lines connecting the rectangles use what is called “crow’s feet”
notation, and indicate how many records matching a foreign key are found in the
table at each end of a relationship. Two lines crossing a relationship line mean
“one and only one” record at that end. In Figure 5-1, the Person ID of each
Person Address record references only one Person record. This makes sense,
since Person ID is the primary key of the Person table. A primary key value is
always unique and will always reference exactly one record. Reading the same
relationship line in the other direction, we see a circle and then what look line
three lines fanning out to the Person Address record. The circle indicates
optionality, and the three lines indicate “many”. This indicates that one Person
record may be referenced by any number of Person Address records, including
none.
It should be pointed out that the model of Figure 5-1 is not ideal, because it
creates the possibility that the same data might be stored repeatedly in a
database. For instance, two people who live at the same address will have Person
Address records that are identical except for the Person ID foreign key values.
But this design has been chosen to illustrate issues relevant to COMN, so, for
now, please ignore these otherwise important design issues.
MULTIPLE LEVELS OF ABSTRACTION
It is customary to use E-R data models at three levels of abstraction:
conceptual (the highest level)
logical
physical (the lowest level)
Each rectangle in a conceptual data model may relate to one or more rectangles
in a logical data model, and each rectangle in a logical data model may relate to
one or more rectangles in a physical data model.
A conceptual data model is similar to a logical data model, in that each rectangle
represents a logical record type and eventually a table whose records conform to
that type. But a conceptual data model simplifies a logical data model in two
important ways:
1. It is customary to hide the display of data attributes within the
rectangles of a conceptual data model, in order to allow the modeler
to focus more on the relationships between entities.
2. A conceptual data model will depict many-to-many relationships
between entities with a simple relationship line. In a logical data
model, such many-to-many relationships are “resolved” by inserting
an associative entity between the pair.
A conceptual data model is usually developed as a first-order approximation of a
logical data model. Data-related details are omitted in order to support the early
stages of model development, where too much attention to data-related details
could distract from the task of documenting the data requirements that are
present in a given set of business requirements. The conceptual data model is
then used to derive a logical data model, where details such as data attributes and
associative entities are added.
With just a little training, non-technical personnel can interpret and interact with
conceptual data models. When interviewing business stakeholders, a conceptual
E-R data model is a useful tool for capturing data requirements, validating
terminology, and scoping application efforts.
Lower in the layers of abstraction, a physical data model enables the expression
of the details of how a data design will be implemented in a particular database
management system (DBMS). A rectangle in a physical data model no longer
represents a logical record type, but rather a physical record type as the layout of
a row in a table. A list of names and data types inside the rectangle expresses the
names and types of the table’s columns. The names of tables and columns are
spelled in a way that is dictated by the technical requirements of the chosen
DBMS; the most noticeable difference from names in the logical data model is
that physical data names do not usually contain spaces.
Just as a logical data model can contain associative entities that are not depicted
in a conceptual data model, a physical data model can contain tables that are not
represented in a logical data model; for example, a physical data model might
include tables that combine data from multiple logical data model entities for
faster read access. In a disciplined data modeling process, such tables are limited
solely to those needed for some implementation purpose, and do not represent a
short circuiting of the logical data modeling process.
This approach to modeling at three levels of abstraction can be thought of as a
stack of five planes. The uppermost plane is where most of the things mentioned
in business requirements are to be found. This plane is called the real world.
The next plane down is inhabited by the conceptual data model. Many entities in
the conceptual data model correspond to real-world entities, and are named so as
to reflect that fact. The middle plane is inhabited by the logical data model. Most
of the entities and their relationships on this plane correspond one-for-one to
those on the conceptual data model plane. As mentioned above, some logical
data model entities correspond to many-to-many relationships from the
conceptual data model. The next plane down is inhabited by the physical data
model. Again, most things correspond one-for-one to the things on the next
higher plane. Finally, the bottom-most plane is the database implementation. In a
fully model-driven database implementation, everything in the database has a
one-for-one correspondence to something in the physical data model. In fact,
there are data modeling software tools available that support the generation and
maintenance of SQL databases directly from physical data models. These tools
make possible the ideal of model-driven development, where a connection can
be maintained between the business requirements captured in the conceptual data
model, through the several planes of modeling, all the way to the tables of the
database.
NoSQL Arrays and Nested Data Structures Logical and physical E-R notation
assumes that database implementation will be in a DBMS, such as a SQL
DBMS, that stores data in tables. It does not, therefore, have any way of
expressing two modes of data storage that NoSQL databases make possible:
arrays
nested data structures (often called “nested documents” by NoSQL
DBMSs)
Arrays can be useful ways for storing some small and simple data structures. For
example, it might be preferable to store a person’s list of telephone numbers
directly as an array attribute on the Person entity of Figure 5-1. This avoids the
overhead of a separate table and foreign key just to store phone numbers. (If one
is to do this, one must consider whether the telephone numbers need an index to
support fast searching, and, if so, whether the NoSQL DBMS chosen for
implementation can index array-type data attributes. We’ll look at that question
in greater depth in chapter 17.) E-R notation has no way to express an array.
We’ll look at nested data structures in the next section.
Both arrays and nested data structures are modeled in E-R notation in a way that
corresponds to their necessary implementation in a SQL database. This is what is
shown earlier in Figure 5-1. The array or nested data structure is split out into its
own table. That table has a foreign key back to the table from which it was split.
Additionally, that table must have its own primary key.
NoSQL DBMSs support the direct aggregation of arrays and nested data
structures in enclosing data structures, without the use of keys. E-R notation has
no way to show data structures that are related to each other without keys. As a
result, E-R notation cannot be used for NoSQL database design.
Some organizations have nonetheless used E-R notation for NoSQL designs, and
use notes or other means to communicate to humans that a relationship line
between two entities / logical record types does not represent a foreign key
relationship. Such models can be useful, but cannot be used in a model-driven
development of a NoSQL database, because the modeling tool can’t tell the
difference between foreign-key relationships and aggregation of arrays or nested
data structures in a single “entity”. Such models also require extra discipline on
behalf of the data modeler, to make sure that the meaning of the non-foreign-key
relationships is preserved as the model is updated, since the modeling tool used
can’t know the difference.
Modeling the Real World As we have seen, the so-called entities of conceptual
and logical E-R data models are typically given names that imply the real-world
entity types that they are about. Consider our logical data model above which
has a logical record type named “Person”. The name of the logical record type
clearly implies that records of this type hold data about the real-world entity type
known as “person”.
Sometimes data modelers will add so-called entities to a conceptual model to
depict things in the problem space about which no data will be stored. These are
in fact not logical record types at all, but are actually real-world entity types.
Data modeling tools enable a data modeler to tell the tool not to carry such
“entities” forward to the logical data model, but no visual indication is given that
the symbol does not represent a logical record type. This shows that it would be
very useful if E-R notation could graphically distinguish real-world entity types
from logical record types.
Data in Software E-R notation was developed in order to support the design of
databases. As such, it did not take into account any of the needs of software
development. Software developers cannot use E-R notation to represent their
software designs. This leaves quite a gap between the modeling notation used by
database developers and the modeling notations or languages of software
developers.
TERMINOLOGY
Let’s review the terms that E-R modeling has specialized, and compare them to
their ordinary English meanings and their use in COMN.
Entity
As we have seen above, the E-R term “entity” can mean any of the following
things:
a logical record type
a set of records that conform to the logical record type
(in a conceptual model) a real-world entity type
Calling a logical record type an entity is convenient shorthand, and can’t be
called incorrect, since the term “entity” just means “thing”, and everything is a
thing. But it only works because E-R notation cannot express the idea of
individual things, but only types of things—specifically, types of logical records.
Taking the ordinary term for thing—entity—and using it to mean a type of
logical record or a set of records makes it difficult or impossible to talk about
individual records.
In a conceptual model, an E-R entity may represent a type of real-world thing.
Again, this makes it difficult to model or discuss an individual thing.
As will be seen in Part III of this book, it can be very valuable to be able to talk
about individual things, not just types of things.
As mentioned above, the presence of a so-called “entity” in an E-R data model
implies that there is or soon will be a table of records in a database
corresponding to the entity of the model. Thus, the rectangle of an E-R model
has a number of explicit and implicit meanings, depending on the kind of model
in which it is found and the context in which it is discussed. E-R notation does
not make it possible to indicate the exact meaning using a graphical symbol.
Conceptual
As mentioned, the meaning of the term “conceptual” in “conceptual data model”
is used to mean “first approximation” of a logical data model. This is analogous
to the use of “concept” in an “artist’s concept drawing” of a building: it’s just
supposed to give the viewer a preliminary idea or “concept” of the final result.
Using “conceptual” this way makes it more difficult to talk of “concepts” in
distinction to “objects”. Both are important when discussing data, as data exists
both as concepts (at the logical level of abstraction) and as objects (at the
physical level of abstraction). When using the word “conceptual”, we must pay
close attention to the context. A concept can be a very precise thing, and treating
“conceptual” as a synonym for “approximate” can prevent us from seeing the
intended precision.
E-R Terms Mapped to COMN Terms A mapping from E-R terms to the
corresponding COMN terms is given in the table below. Where more than one
COMN term is given for a single E-R term, it indicates that the E-R term is
ambiguous.
E-R Term
COMN Term
(no equivalent)
composite type
conceptual
approximate
The lines between the class rectangles in Figure 6-1 express what the UML calls
associations. They indicate that objects of the classes will have “connections” to
each other. Just as objects are instances of classes, links are instances of
associations. Just as an object has a slot to hold the value of each class attribute,
a link has a slot to hold a reference to each object at the ends of the association.
For example, a link that is an instance of the association between a Person object
and one or more Address objects would have exactly one reference to the Person
object and one or more references to Address objects.
in the UML are used for higher-level database design, and then database-specific
details, including keys, are specified using the Information Engineering (IE)
variant of entity-relationship (E-R) data model notation.
Thus, as a graphical notation for database design, the UML cannot stand on its
own.
Middling Level of Abstraction The UML is aimed at just about the same level
of abstraction as an object-oriented program. The classes of the UML and of a
program are both analogous to similarly named real-world entity types (concepts
and real-world objects).
One can use the UML to denote real-world classes and real-world objects,
provided that one makes it clear in notes on a diagram as to which classes and
objects should be interpreted as existing in the real-world and not in a
computer’s memory.
The UML depends on the notion of a “slot” which it does not define. The UML
also does not enable the depiction of a “slot” in any of its graphical symbols.
This is a pretty clear indication that the UML considers lower-level physical
implementation details to be taken care of by things that should not be
diagrammed. This approach makes it difficult to use the UML to express
implementation details with the rigor and completeness necessary for model-
driven development. It also requires the assistance of other notations, such as E-
R, for complete specification of a database design.
Lack of Concept The UML defines an object as “a discrete entity with a well-
defined boundary and identity that encapsulates state and behavior; an instance
of a class” [Rumbaugh 1999, p. 360]. It defines a class as “the descriptor for a
set of objects.” [ibid., p. 185]
This is all well and good, but the UML lacks any ability to describe entities that
do not have state or behavior; that is, concepts. Concepts are expressible in the
UML, but only implicitly and only in connection with classes, objects, or other
things that the UML can express.
Concepts appear frequently in requirements, and an inability to model them
directly means that a model can only represent things related to a concept. For
example, an order is a concept. A model often focuses on the record of an order,
which can be represented in the UML, but the order itself is just the idea that a
customer has made a request of a supplier, and the order might not even be
recorded—it might merely be spoken. Another important concept to represent is
that of a role played by an actor. In the examples given in writings about the
UML, a role is a structural piece of some object, rather than a concept
independent of any object. Actors, such as humans, can take on and shed many
roles, and the inability to model this apart from an object seems rather limiting.
If one needs to represent a concept and how it, and not a record of it, relates to
other concepts in the problem space, one will need to use stereotyping. It seems
that something as basic as “concept” ought to have a direct representation in a
modeling notation.
UML Terms Mapped to COMN Terms A mapping from UML terms to the
corresponding COMN terms is given in the table below. Where two COMN
terms are given for a single UML term, it indicates that the UML term is
ambiguous.
UML Term
COMN Term
class class or type
implementation class
class
attribute component of a type or class; possibly a data attribute
type stereotype of class
type
data type type where the members of the type are simple concepts
“slot”
object component of a class
relationship no direct equivalent; see the various kinds of UML relationships listed below
association
relationship
no UML equivalent composition, which is the over-arching term for the formation of composite things from
component things
composition assembly with the additional constraint that destruction of one component leads directly
to destruction of all components
aggregation ill-defined in the UML, so no COMN equivalent
no UML equivalent aggregation, which is the form of composition of the components (UML attributes) of a
type or class
Key Points
The UML was designed to support the specification of software systems, and it does this well. However, it
lacks a few features needed for data modeling.
The UML lacks the concept of a key, which is essential to data modeling. It can only express the identification
of objects by their physically distinct existence.
The UML aims at a middling level of abstraction. It can represent types and classes, and objects in the real
world. It cannot represent many things at a lower, physical implementation level, making it difficult to use for
fully specifying a database design.
The UML lacks direct support for modeling concepts as distinct from objects.
The UML does not distinguish between subclassing and subtyping.
REFERENCES
[Rumbaugh 1999] Rumbaugh, James, Ivar Jacobson, and Grady Booch. The Unified Modeling Language
Reference Manual. Reading, Massachusetts: Addison-Wesley, 1999.
[Blaha 2013] Blaha, Michael. UML Database Modeling Workbook. Westfield, New Jersey: Technics
Publications, LLC, 2013.
Chapter 7
Fact-Based Modeling Notations
While working at Control Data Corporation in the Netherlands in the early
1970s, Dutch computer scientist Sjir Nijssen developed what came to be known
as the Natural-language Information Analysis Methodology, or NIAM, which
incorporates fact-based modeling. The unique central aspect of fact-based
modeling is an approach where modeling starts with statements of facts about a
problem domain, provided by domain experts in their own language. The data
analyst deduces patterns from these fact statements called fact types. A fact type
is a statement in natural language that has one or more blanks or “roles” to be
filled in. The roles are played either by object types or by label types.
Several very similar graphical notations, and associated methodologies, have
been developed to support fact-based modeling, including Object Role Modeling
(ORM) and Fully Communication-Oriented Information Modeling (FCO-IM).
The examples in this section were drawn in ORM notation using the NORMA
tool and Microsoft Visual Studio.
[NORMA]
Not shown in Figure 7-1 are additional constraints that can be imposed on any of
the relationships in a model. Fact-based modeling has a full set of constraint
symbols that allow the constraints of reality and of business requirements to be
expressed. This captures more meaning in the model and increases the likelihood
that the implementation will meet requirements.
Incompleteness The latest edition of Halpin and Morgan’s book positions [Halpin 2008]
ORM as a tool that should be used to ensure that a conceptual model is valid
before proceeding to use E-R modeling or UML modeling to express physical
database design details. In this approach, the details of the mappings from ORM
to the final database design can be lost between the models.
The FCO-IM book does not recommend the use of other established
[Bakema 2002]
ORM has a special place for measures, which are quantities of some units; for
example, centimeters or kilograms. Measures in COMN are discussed as special
kinds of composite types in chapter 12.
The view of relationships and roles in fact-based modeling is very similar to the
view in COMN. This should become evident as you read chapter 15 on
relationships and roles.
Doubles and Quadruples Not every statement that we would like to make about
things come in triplicate form. Sometimes we need to be able to say something
that has four parts and can’t be sensibly subdivided.
For instance, consider this statement: John threw the ball to Mary.
There is no good way to reduce this statement to three parts. There really are
four parts, and dropping any part leaves out some important meaning.
Here are two possible approaches to reducing this four-part statement to triples.
In the first approach, we place the direct object (the ball) and the indirect object
(Mary) in their own triple, and then make that triple the object of another triple
that includes the subject and verb (predicate). This can be expressed in pseudo-
code using functional syntax as: // triple #1
thrownToSomeone(ball, Mary) // triple #2
threw(John, thrownToSomeone(ball, Mary)) You can see that the second triple
incorporates the first triple as its third part.
The second approach places the subject, predicate, and direct object in a triple,
and then makes that triple the subject of another triple.
// triple #1
someoneThrewSomething(John, ball)
// triple #2
thrownTo(someoneThrewSomething(John, ball), Mary)) Logical predicates
don’t care how many arguments they take: any number greater than zero will do.
A logical predicate corresponding to the above statement in functional notation
might look like this: threw(< Person_t who, Object_t what, Person_t toWhom >)
The above statement would appear in the same functional notation as:
threw(John, ball, Mary) Forcing an extra level of factoring of such statements
into triples could be disabling to some Big Data applications.
There is a lesser problem in the other direction, when we have only a subject and
a verb/predicate; for example, Horses exist.
Unicorns do not exist.
These statements could be represented as triples as long as there is a placeholder
for the missing object. Such statements do not occur as frequently as those in the
form of triples and quadruples, and the extra overhead of the missing object
placeholder is probably not a performance problem.
OWL
The Web Ontology Language, or OWL, is a language for expressing ontologies.
It has its own implicit ontology, described in the abstract syntax of the language.
COMN can be used to represent ontologies, because its symbology enables the
depiction of real-world things, their relationships, and their properties. However,
COMN has at its foundation a strong distinction between things that are concepts
and things that are material objects. This distinction is present in order to ensure
that COMN can represent not only real-world things, but also the real-world
material objects of which computers are made, and can show how the
meaningless states of those objects can be used to represent meaning.
This strong distinction in COMN leads to very different uses of words like type,
class, and object than in OWL. Despite these differences, there is nothing in
COMN that is incompatible with the abstract syntax of OWL. Consult the
terminology mapping table in the Terminology section below for guidance.
statement an ordered list of three values. The second value (the RDF predicate) identifies a logical
predicate with two variables. The first and third values (the RDF subject and RDF object,
respectively) supply the values for the predicate’s two variables. The statement forms a
logical proposition.
no RDF equivalent logical predicate: a logical formula having one or more variables which, when the
variables are bound, forms a proposition
class type
no OWL equivalent class: a description of the structure and/or behavior of material objects
property attribute
Key Points
The field of semantics today is dominated by the Resource Description Framework (RDF) and the Web
Ontology Language (OWL).
RDF statements and triples are inefficient for representing information that are not in the form of a logical
predicate with two variables.
COMN uses words like type, class, and object differently than OWL, but their abstract syntaxes are
compatible.
COMN offers the field of semantics a single modeling notation that can represent the real world,
representations of the real world in data, and the static structure of software. This can help ensure a complete
and correct translation of an ontology into a running system.
Chapter 9
Object-Oriented Programming Languages
Programming languages have undergone almost continuous evolution since they
were first introduced as a way to express through symbols what instructions
should be given to computers. A major change in programming occurred when
the Simula programming language introduced the idea of objects in the late
1960s. The concepts of objects and classes were further developed in SmallTalk,
C++, and other programming languages.
Today, most programming languages in wide use (other than C) are object-
oriented and don’t make much of a fuss about it. This chapter will focus on two
of the currently most popular object-oriented programming languages, namely
Java and C#.
variable of class type or of a computer object whose class is a pointer or reference to a computer object
interface type
variable of array type a computer object whose class is a pointer or reference to a computer object representing
an array
no C# equivalent variable: a symbol which may or may not be represented by a computer object in a
compiled program
object
computer object
value
value
no C# equivalent state: the meaningless physical state of a computer object
Key Points
Object-oriented programming languages inherited types from early programming languages that specified
both value sets and memory structure.
COMN separates the designation of a set of values from the description of computer object structure and
behavior. Types designate sets without specifying memory structure. Classes describe computer objects in
terms of their structure in memory and the routines (methods) exclusively authorized to operate on them. The
otherwise meaningless physical states of objects only have meaning if their classes represent types.
Part III
Freedom in Meaning
Part I of this book returned us to the ordinary English meanings of words that we
have co-opted for special purposes in the field of information technology. Each
chapter in part II reviewed a modeling notation or language, in order to prepare
you to see the issues in those notations that may not be evident, that COMN
addresses.
Part III introduces COMN in earnest. It is the knowledge in part III, built on the
foundation of the clear and simple meanings of words introduced in part I, that
will enable you to use COMN to develop models of the real world, of data, and
of software that are complete and precise, and that can become, with proper tool
support, the basis of a highly efficient and accurate model-driven development
process of data and software systems.
Chapter 10
Objects and Classes
Recall from chapter 2 that we have restored the words entity, object, and concept
to their ordinary English meanings—meanings that these words possessed for
centuries before computing machines were even imagined, let alone constructed.
For your reference, here are the definitions again that give those meanings, all of
which are quoted from Merriam-Webster’s Online Dictionary.
entity 2 : something that has separate and distinct existence and objective or
conceptual reality object 1a : something material that may be perceived by the
senses concept 1 : something conceived in the mind : thought, notion In this
chapter we will take steps toward using these ordinary English words to describe
software and data, but without distorting their ordinary meanings. It is the
distortions that have made it so difficult for us to think clearly about real-world
problems and their solutions in computer systems.
We will go down to a very low level of abstraction, specifically the level of
computer hardware. We don’t want to stay there, because designing data one
storage location at a time or developing software one computer instruction at a
time is a laborious and inefficient way to work. But we do have to glance at this
basement level, because that’s where the foundation is. Understanding that
everything rests on very physical objects and their very physical states gives the
more abstract things we do a solid foundation.
If you are familiar with any of the notations or languages discussed in part II of
this book, make sure you refer frequently to the relevant terminology maps at the
end of each chapter in part II, to keep your mind clearly focused on the simpler,
more natural terminology of COMN.
MATERIAL OBJECTS
Figure 10-1 below shows a fundamental example of an object in the ordinary
English sense of the word “object”. The object pictured is a rock. It is certainly
material, and it can certainly be perceived by the senses.
Figure 10-1. A Rock; a Material Object
From this point onwards, unless it is already clear from the context, I will qualify
the word “object” with the adjective “material” to mean an object in the natural
language sense.
Meaning of States
Some material objects have states with intrinsic meaning. Consider the lighted
sign in Figure 10-3.
This stateful material object does have intrinsic meaning. If this sign is over a
doorway to a studio, and it is lit, then it indicates that there are live microphones
inside the studio transmitting all sounds to a recording device. This object has
two states and two meanings, and the meanings are fixed to the states.
In contrast, consider the flashlight again. I have conducted a fun little experiment
with several groups of people. I hold a flashlight in front of the group, and,
without any other preamble, ask everyone in the group to call out the state of the
flashlight as it changes. I then begin to operate the flashlight’s switch, and
everyone dutifully calls out, in unison, “on … off … on … off”.
I then stop the experiment and tell everyone that this time I want them to call
out, not the flashlight’s state, but rather its value. As I begin to operate the
switch, unanimity is gone. I still hear a few calling out “on … off”, but then I
hear some calling out “one … zero” (those are the programmers), some calling
out “useful … useless”, some calling out “light … dark”, and some saying
nothing except by the confused looks on their faces.
The point is this: There is widely accepted nomenclature for the states of a
flashlight (on and off), and no explanation is needed in order for a person
familiar with natural language to name each state. But there is not such a widely
accepted nomenclature for the values of these states. That is because state and
value are very different things. At least in the case of a flashlight, meanings must
be assigned to the states—or not: if one wishes to use the flashlight simply to see
in the dark, no assignment of meaning is required.
In summary, we have learned the following about objects and their states:
The states of some objects have intrinsic meaning, while the states of
other objects have no intrinsic meaning.
It is not always necessary to assign meanings to the states of an object
in order for the object to be useful.
Methods
Object-oriented technologists talk much about methods, which, in terms of
material objects, are mechanisms that are part of those objects that enable one to
change their states. Let us consider the methods that are part of the material
objects we have considered so far.
rock: no methods (which makes sense, since it has but one state)
flashlight: one method, the on-off switch
lighted sign: a method to turn the sign on or off
Old North Church: a method to light either lantern
Just to keep you nimble, here is one more material object to consider: a tricycle.
Figure 10-5: A Tricycle
Summary
In summary then,
1. A material object is an object in the natural-language sense of the
word; in other words, something you can see and touch.
2. Some objects have states and methods to change those states (for
example, a flashlight), and some do not (for example, a rock).
3. Objects capable of having more than one state are called stateful.
Objects having only one state are called stateless.
4. The states of some objects have intrinsic meaning, while the states of
other objects have no intrinsic meaning.
5. It is not always necessary to assign meanings to the states of an object
in order for the object to be useful.
6. We sometimes have stateful objects with more states than meanings.
7. Some material objects have methods but not states.
8. For practical purposes, the meaningless physical states of material
objects are often numbered. The states of objects with two states are
often numbered 0 and 1.
9. For different purposes, at different times we may assign different
meanings to the same states of an object.
10. Objects are often combined into a composite object. In general, the
composite object has a number of states which is the product of the
number of states of its component objects.
SUMMARY
Computer objects are entirely physical. Hardware objects have physical states
that, for the most part, have no meaning. We refer to the states of these hardware
objects using numbers, but that doesn’t necessarily mean that the states represent
numbers. They may; they may not.
Software objects can be constructed from hardware objects and other software
objects in a tree-like fashion, but—at least as far as we know at this point—the
composite states of software objects have no more intrinsic meaning than the
states of the hardware objects of which they are composed.
Objects as seen in this light may have states that are useful even though they
have no meaning. Think of the flashlight whose “on” state is useful for seeing in
the dark, but which has no meaning. When one assigns meaning to an object’s
state for some signaling purpose, the state itself still does not express the
meaning. A British soldier could stare all night at the two lanterns in the Old
North Church tower and never discover the meaning assigned to them. In
general, the meanings of an object’s states must be supplied from some source
outside the object itself. In the next chapter we’ll see how meaning is supplied.
In addition to the states of objects having no intrinsic meaning, so far the
concepts of “value” and “data” are also not associated with objects. This will be
shocking to many in the industry. This is a major departure from established
thought and terminology. This will be justified in the next chapter.
Key Points
A material object—that is, an object in the natural-language sense of the word—is something you can see and
touch.
A stateful material object is an object that has more than one state. A stateful material object may have
mechanisms to change its state.
The states of material objects may or may not have any meaning. Their states may be assigned meaning. Their
states might be useful apart from any meaning.
Computers are composed of stateful material objects which we call hardware objects.
Software objects are composed of hardware objects and/or other software objects, in a tree.
In general, the states of software objects have no more meaning than the states of the hardware objects of
which they are composed. In general, meaning must be assigned to states by something other than the objects
having those states.
Chapter Glossary
computer object : a stateful material object whose state can be read and/or modified by the execution of
computer instructions hardware object : a computer object which is part of the physical composition of a
computer software object : an object composed of hardware objects and/or other software objects by
exclusively authorizing only certain routines to access the component objects method : a routine authorized
to operate on the components of software objects of the class of which it is a part encapsulate : to authorize
only a certain set of routines (called methods of the class) to operate on the components of objects of a
class state : the physical condition of an object stateful : having more than one state stateless : having
only one state value : a concept that is fully specified by a symbol for the concept; also, a symbol for such a
concept
Chapter 11
Types in Data and Software
Now that we’ve established that computers are composed of material objects,
most of which have meaningless physical states, we need to find a way to
express meaning. In this chapter we’ll learn how types provide meaning. When
we have a good handle on types, we’ll realize that that’s where we focus our data
analysis and logical data design efforts, and we’ll know how to express that in
COMN.
something different than “type”. “Type” for the most part retained its early
meaning of set of values plus storage specification. Additionally, because the
term “type” was adopted early in the history of programming language
development, types were generally quite simple or “primitive”, specifying little
more than sets of letters and/or numbers. In fact, in some contexts the terms
“primitive type” and “data type” are considered synonyms, and the adjective
“primitive” considered unnecessary. In contrast, the “class” of object-oriented
programming is associated with the enablement of programmers to define
structures of arbitrary complexity, leading to a terminology that considers
“classes” to be more powerful in their descriptive capabilities than mere
“[primitive/data] types”. Both class and type retained their use as specifying
storage in addition to designating a set. In programming and database
development (though not in data modeling), both terms lost their meaning
related to classification.
Unfortunately, this vocabulary leaves the programmer or database developer
with several problems. One problem is that it becomes difficult for the analyst or
designer to talk of types and classes of things in the real world without confusing
those terms with the very different meanings of “type” and “class” in data and
software. Modeling the real world, and translating those models into their
representations in software and data, can get quite confusing. Making this worse
is the fact that the fields of semantics and philosophy use the terms “type” and
“class” differently than their ordinary English meanings and differently than
their programming-language and DBMS meanings.
Another problem is that there is no substantial difference between the
programming / DBMS meanings of “type” and “class” other than degree of
complexity. We are left with two words for things that appear, at least on the
surface, to be very similar.
SIMPLE TYPES
We have seen how hardware objects are simple objects, having no components
(from the point of view of software), and how we will rarely deal with hardware
objects directly. We leave that difficult and tedious work to compilers and
DBMSs.
Not so with simple types. Database designers must deal with simple types, and
composite types, throughout the analysis, design, and implementation phases of
any project.
The implementers of DBMSs and programming languages have done us a great
favor by creating large collections of so-called “types”—which we now think of
as classes representing types—that name and describe particular
implementations of representations of values. We can use these implementations
to build our systems. But if, at analysis time, we ignore these implementations
and focus only on specifying the sets of values to be represented—types in the
COMN sense—we can specify our systems’ requirements—the “what”—
without even a glance at what particular implementation systems provide for us.
For example, if we need some variable to range between -1 and 100,000, we can
specify that as a type, and defer until later the exact choice of an implementation
of some class whose objects can represent just those values. We can also specify
that type without recourse to the arbitrarily distinct idea of a so-called “domain”
supported by some E-R modeling tools. These modeling tools need the concept
of “domain” in addition to the concept of “type” because they’ve hard-wired
“type” to the fixed set of mostly simple types provided by DBMS
implementations. If, instead, types have nothing to do with implementations,
then a type is a type is a type, whether it is directly supported by an
implementation out of the box or will require some programming. The E-R
modeling concept of “domain” is just redundant.
In addition to the simple type starter kits provided to us by DBMSs and
programming languages, we often need to make up our own simple types. One
of the most common of these is an enumeration. An enumeration is a type that is
specified by listing the names of the members of the set it designates. Here are
some example enumerations:
account status: open, closed, suspended, abandoned
organization type: corporation, government entity, non-profit
order status: ordered, shipped, back-ordered, canceled
In general, enumerations have no components. Now, their representations do:
the example enumerations listed above represent enumeration values with words
and phrases which are composed of letters and punctuation. But what these
representations represent have no components. For instance, an account status of
“open” can’t be broken down into any constituent parts. Likewise, an order
status of “shipped” has no components. Don’t confuse the value, which is
simple, with information about what these values represent. For instance, we can
learn of the date on which an account was opened, or the reason an order was
canceled. But the enumerated values that these data are about, “open” and
“canceled”, are simple values.
Figure 11-2 below shows a COMN diagram for account status. Such a drawing
is most useful for enumerated types that designate relatively small and stable sets
of values. Stable enumerated types of those sorts can be extremely important in a
data design, as it enables distinct parts of a system to communicate with each
other. For larger and/or more fluid enumerated types, the type names are often
kept in a database table. (There are well documented standard techniques for
managing such lists of reference values in databases.) For the more fluid
enumerated types, a model will typically just show the type rectangle and omit
the enumerated values.
The rectangles and rounded rectangles in Figure 11-2 are dashed because they
represent concepts, and are in bold outline because they represent the concepts in
the real world, not as expressed in data. The lines crossing through the shapes
indicate that these are a simple type and simple values, having no components.
Figure 11-3 is just a small representative sample of how COMN supports high-
level analysis, decisions about representations for information, and physical
design decisions. We will look at representation more closely as part of chapter
12, when we look at composite types.
As we will see in chapter 13, the type/class split makes it easier to work with
subtypes, which are a powerful tool for analysis and design.
REFERENCES
[Holmevik 1994] Holmevik, Jan Rune (1994). “Compiling Simula: A historical study of technological
genesis”. IEEE Annals of the History of Computing 16 (4): 25–37. doi:10.1109/85.329756.
Key Points
Classification is an innate human activity. When stripped of their technical meanings, the English words
“type” and “class” are synonyms, and are used to designate sets of things with similar characteristics. We say
that types designate sets.
The word “type” was co-opted by the information technology industry to express both a potential set of values
and memory storage requirements for representations of those values.
The word “class” grew up later to describe more complex structures than those that could be described
directly by the types of the previous earlier decade. “Type” alone took on the connotation of being simple or
“primitive”; data types were also considered primitive.
In COMN, we keep the programming-language concept of a class, which is very physical. We strip any notion
of physicality from the concept of a type, and use types solely to designate sets.
Classes may optionally declare that they represent types.
Our type/class split enables us to specify systems in terms of types without reference to any default or implicit
representations or implementations. This enables us to specify systems in highly portable and machine-
independent ways, and defer all implementation considerations to a later stage of design.
Chapter Glossary
simple type : a type that designates a set whose members have no components
composite type : a type that designates a set whose members have components
Chapter 12
Composite Types
In the previous chapter we have seen how very basic types, such as integer types,
are simple—having no components—but classes describing software objects are
always composite. In this chapter we will dig into types that have components—
so-called composite types—which actually dominate the work of data modeling.
The name of the first component of the UK NINO Record type, the Person
National Insurance Number, is followed by the letters “PK” in parentheses. This
means that it is a component (in this case, the only component) of the primary
key of the record type. A key is a component or set of components whose values
are always unique in any set of records of the type. Without a key, records in a
set of records can be difficult or impossible to distinguish from each other.
The bottom section of the rectangle has lines crossing through it. This notation
asserts that this type has no methods. When a composite type has no methods, it
does not encapsulate its components. They are visible and directly manipulatable
by all. Now, encapsulation is incredibly valuable. By controlling what routines
can access or modify the component objects of a software object, encapsulation
makes software much simpler, therefore easier to write and easier to avoid
generating bugs in the writing. Encapsulation has led to a significant increase in
the reliability of software, and a concomitant decrease in the cost of software
development. But upon reflection, one realizes that the value of encapsulation is
related to the encapsulation of mechanisms. We want to limit the routines which
can operate the internal mechanisms of an object. But in the case of data, we
actually want data to be visible to others—we don’t want to hide it as we want to
hide internal mechanisms. “Information hiding” a là David Parnas should
[Parnas 1972]
be about hiding information about mechanisms, not about hiding information per
se.
So, when we define a logical record type, we typically don’t define any
particular methods. We might do so at a higher level than an individual logical
record type. For example, we might have a higher-level type that references
several different record types, and that provides mechanisms to manipulate
groups of records of those various types in ways that ensure that their values
remain consistent with each other. By encapsulating access to groups of related
records of different types, and allowing only the methods of the encapsulating
type to access them, we achieve all the benefits of encapsulation, without forcing
the overhead of encapsulation on the very methods designed to manipulate
instances of those record types. We also enable powerful relational operations
(see chapter 16) which are not possible on encapsulated data. This form of
composition of records and objects through reference is composition by
assembly.
We’ve seen rectangles in COMN before, but the UK Person Type rectangle is
drawn differently: it has a solid bold outline. A solid outline indicates that the
type represented designates objects, not concepts, and people are material
objects, because they can be perceived by the senses. The bold outline indicates
that these objects exist in the real world and not in the computer. A person is a
material object but not a computer object. We named this type UK Person Type
because it’s not the type of all persons, but only the type of persons known to the
UK Department for Work and Pensions.
In between the UK NINO Record Type and the UK Person Type we have a
dashed hexagon with a shadow labeled the UK NINO Record Collection. A
dashed hexagon is the COMN symbol for a variable, and the unadorned line
connecting it to the UK NINO Record Type tells us that that is the type of the
variable. The shadow indicates a collection, so the shadowed hexagon represents
a collection of variables. Since we’re focused on logical data design and not
implementation, we haven’t decided yet whether the collection will be
represented by a table in a SQL database, a collection of documents in a NoSQL
document database, or something else. In any case, we think of this collection of
records as a collection of variables, where each variable has the UK NINO
Record Type. Each variable will eventually be a table row, a document, or
something similar.
The representation line from the UK NINO Record Collection to UK Person
Type indicates that the record collection represents the UK Person type. This
means that each value of a record in the UK NINO Record Collection represents
a UK Person. Since we know that the key of a UK NINO Record—the Person
National Insurance Number—is always unique, we know that in fact it is the
Person National Insurance Number that represents, or identifies, a UK Person.
It is not true that every NINO identifies a UK Person. It is only a NINO that has
been assigned to a person, as recorded in a UK NINO Record, that identifies a
UK Person. This is where the NI Number Type on the right of Figure 12-2
comes in. The NI Number Type designates the full set of NINO numbers,
whether assigned or not. The full set is defined as all those strings of characters
starting with two letters, then six decimal digits, and finally one suffix letter,
minus certain prohibited combinations as defined by the UK Department for
Work and Pensions. NINOs to be assigned to people must be drawn from this
set.
Let’s look more closely at the relationship from the UK NINO Record Type to
the NI Number Type. The definition of the UK NINO Record Type includes a
component, the Person National Insurance Number, whose type is NI Number
Type. The UK NINO Record Type incorporates the NI Number Type by
aggregation; in other words, the NI Number Type is part of the UK NINO
Record Type and can’t be separated from it, although it remains a recognizably
separate part. The line with the solid arrowhead pointing from the UK NINO
Record Type to the NI Number Type indicates this. As is standard in COMN
notation, arrowheads always point in the direction of reference. The UK NINO
Record Type mentions the NI Number Type, and not the other way around.
We can now see that we have two subtly different sets of values that have two
different functions. The NI Number Type designates a set of strings of characters
in a certain format. The UK NINO Record Collection includes a subset of all
possible NI Number Type strings, specifically only those NI Number Type
values actually assigned to identify persons. This subset of NI Number Type
values is what represents or identifies the UK Person Type.
In striving toward our goal of efficient development of reliable systems, both the
UK NINO Record Type and the NI Number Type are valuable. The NI Number
Type can be used at points of data entry to ensure that character strings entered
as purported NINOs are at least in the right format, whether or not they are
assigned to anybody. This level of type checking is a valuable first line of
defense in ensuring high data quality, and might be all that’s possible if access to
the UK government’s authoritative NINO database isn’t readily available. To be
really sure that a NINO identifies a UK Person, one must look in the
authoritative UK NINO Record Collection to find a matching value there.
In place of the logical record type rectangle, the record collection shadowed
hexagon, and the UK Person type rectangle, an E-R data model would show a
single rectangle. In an E-R data model the rectangle would be called an “entity”,
but it would in fact represent three things simultaneously: the record type, the
actual collection of records, and the real-world objects represented by the
identifiers on the records. An equivalent model could be drawn in COMN using
a single shadowed hexagon with components recorded directly in it, as long as
the details of the representation relationships were not important—and we will
do exactly this in chapter 13. But for the representation of reusable composite
types such as measures, this separation is essential. We will also see in chapter
15 how separating the three things can give us important insights to our data.
13 CR 29 GS 45 - 61 = 77 M 93 ] 109 m 125 }
On the far right we show the integer type of the Ordinal component of ASCII
Type, but this is for illustration purposes only. The model is complete without
this rectangle.
Since ASCII Type is a type and not a class, no storage allocation has been
specified. We need a class before there’s anything to implement in a computer.
A class implementing ASCII Type would quite reasonably store each character
code in a byte, but the methods of the class would limit the byte to entering only
128 of its 256 possible states. The other 128 states would have no meaning in
this usage. If we wished to show this level of detail, we would draw a class that
represents the ASCII Type, and show that its only component is an integer class
having a byte component.
By reusing the Currency Amount Type, the designer of the Foreign Exchange
Transaction Record Type does not have to define four variables as components
of Foreign Exchange Record, but only two. The designer does not have to look
up how many decimal digits the organization wishes to use when recording
currency amounts. That standard has already been built into the Currency
Amount Type, and the designer merely has to reference the composite type in
order to incorporate that standard. Finally, the names of the two variables reflect
their role in the record type, and aren’t lengthened or complicated by the names
and types of the individual components of Currency Amount Type. All these
benefits are in addition to the additional type safety gained by encapsulating the
components of the measure.
In a large or even a medium-sized enterprise, there are typically hundreds of
composite types that need standardization and thousands of opportunities to re-
use those standard types. Examples of such composite types include postal
addresses, personal names, telephone numbers, and various identifiers: the list
could go on for quite some time. Most data modeling notations either can’t
express reusable composite types, or can but insist that their components always
be encapsulated. (For the reasons why, review the relevant chapters in part II). In
COMN, expressing these things is straightforward.
Incidentally, the world of database design has standardized on names to use with
measures, to make it easier to judge from the name of a component what it
indicates. Conventional use will make sure that a measure’s name ends in one of
the three words count, quantity, or amount, as follows:
count: an integral number of things that are counted; for example,
Access Attempt Count
quantity: a possibly fractional number of things that are measured, not
counted, including the result of statistical functions on counts; for
example, Order Item Quantity, Average Children Per Family
Quantity, Fuel Capacity Gallon Quantity, Distance Kilometer
Quantity
amount: a quantity of some currency; for example, Price Amount
NESTED TYPES
Types that represent other types are only useful if they are incorporated into
composite types as the types of some components. Measures are composite types
that are most useful when incorporated into other composite types by
aggregation, as the Currency Amount Type was incorporated twice into the
Foreign Exchange Transaction Record Type.
There is nothing that says that this composition by aggregation must be limited
to a single level. It can go on for as many levels as are useful. We call this
nesting of types. In Figure 12-6 below, we have nesting to four levels, as
follows:
ASCII Type is nested three times inside Char{3}.
Char{3} is nested inside ISO 4217 Currency Code Type.
ISO 4217 Currency Code Type is nested inside Currency Amount
Type.
Currency Amount Type is nested (twice) inside Foreign Exchange
Transaction Record Type.
Figure 12-6. Nested Types
MODELING DOCUMENTS
Some vendors offer what they call “document databases”, which are presumably
structured in such a way that they can efficiently store the electronic equivalents
of what we would recognize in printed form as documents: contracts, tax forms,
papers, even entire books. A document in this parlance is a composite type, and
should be modeled in COMN as such. Documents often include nested types,
and as we have just seen, these can be modeled in a straightforward manner in
COMN.
The eXtensible Markup Language (XML) is a common form for exchanging
documents in electronic form. See Figure 12-7 for a snippet of an XML
document. The names enclosed in angle brackets are called tags, and constitute
the markup of what is otherwise plain text. Most tags come in pairs with text
between the start tag and end tag, and the whole construction is called an
element. For example, in Figure 12-7 the plain text “Chapter 1” is surrounded
by the start tag <title> and the end tag </title>. Elements can next. For example,
the Chapter 1 title is nested inside a <chapter> element. The same <chapter>
element also contains two <para> elements. The <chapter> element is nested
inside the <book> element.
<?xml version=”1.0” encoding=”UTF-8”?> <book xml:id=”simple_book”
xmlns=”http://docbook.org/ns/docbook” version=”5.0”> <title>Very simple book</title> <chapter
xml:id=”chapter_1”> <title>Chapter 1</title> <para>Hello world!</para> <para>I hope that your day is
proceeding <emphasis>splendidly</emphasis>!</para> </chapter>
</book>
{
“firstName”: “John”,
“lastName”: “Smith”,
“isAlive”: true,
“age”: 25,
“address”: {
“state”: “NY”,
“postalCode”: “10021-3100”
},
“phoneNumbers”: [
{
“type”: “home”,
},
{
“type”: “office”,
}
],
“children”: [],
“spouse”: null
}
Figure 12-8. A JSON Text
As with XML, a JSON text may or may not have its type described by some
other document using a schema language such as JSON Schema. As with XML,
COMN can directly express exactly what a JSON schema language can express,
using nested composite types—something that, again, cannot be done in E-R
notations or in fact-based modeling.
JSON is often compared to XML as a more efficient language with the same
expressive power. This is not quite accurate. The confusion has arisen because
XML has been heavily used as a data interchange language, although that was
not its original design intent. XML is a markup language, which means that it
is focused primarily on adding annotations to human-readable text; those
annotations are most often used to express the meaning or significance of the
text that they mark up. In contrast, JSON is a language for expressing data,
which might include human-readable text as data but not marked-up text in the
same sense as XML. It is unfortunate that the term “document” is commonly
used to describe a piece of JSON text. The JSON spec never uses that term, and
simply refers to “a JSON text”.
Notwithstanding the confusion between “a JSON text” and “document”, COMN
can be used to model a JSON text’s type as a composite type.
ARRAYS
An array is a special kind of composite type. It consists of some non-negative
integral number (possibly zero) of variables all of a single type. Each variable is
called an element of the array. The entire collection of variables is known by the
name of the array, and each element within the array is identified by an integer
known as its element number or index.
An array type is defined by the element type plus the range of possible numbers
of elements that may be possessed by a variable of the array type. The possible
numbers are called the multiplicity of the array. (The actual number of elements
in any particular array variable or value is called its cardinality.) Here are some
example array multiplicities and the COMN notation for expressing them:
a plus sign (“+”), indicating that one to any integral number of
elements may occur
an asterisk (“*”), indicating that zero to any integral number of
elements may occur
integer expressions enclosed in a pair of curly braces(“{“ and “}”)
giving the possible numbers of elements. The expressions can take
the following forms:
a single positive integer, indicating exactly that many elements
will occur; for example, “{3}”
a range of integers specified as two non-negative integers
separated by a hyphen; for example, “{0-2}”
a comma-separated list of non-negative integers giving allowable
numbers of elements; for example, “{0, 2, 4, 6}”
any combination of number ranges and non-negative integers; for
example, “{0, 2-5, 9}”
Arrays can be represented in COMN diagrams in two ways:
When a type or class is depicted with a rectangle having three
sections, the multiplicity of a component can be indicated using one
of the above expressions after the element’s type.
When one type or class is composed of another by either aggregation
or assembly, the arrowhead pointing to the element type or class may
have a multiplicity expression next to it, at the element type end.
One kind of array we can’t live without is the character string. We’ve already
seen a three-character array type, Char{3}, as a component of the ISO 4217
Currency Code Type. It is quite common to use variable-length character strings
to represent human-readable text in various contexts. For example, you might
see character string components defined like this:
Person Last Name: ASCII Type{1-200}
Product Name: Unicode Type{1-1000}
Postal Code: Unicode Type{2-50}
These simple arrays are heavily used in data design. However, an array’s
element type can be of arbitrary complexity. We can have arrays of measures
(perhaps a series of sensor readings), arrays of records (hmm, that sounds like a
table!), and, since an array is a composite type, we can have arrays of arrays if
we find that useful.
Key Points
Logical record types are composite types.
We must be careful not to encapsulate every logical record type with methods. It may be better to encapsulate
a higher-level logical record type that has exclusive access to logical records it references.
By separately representing a logical record type, a collection of records of that type, and real-world entities
represented by the collection of records, we can see how identification really works. We can see that it is the
set of identifier values in a collection of records that identifies real-world entities, and not the type of the
identifier itself.
COMN supports stepwise refinement, which is the gradual addition of detail to a model.
It is common to use one type to represent another type.
Composite types are a wonderful means to standardize representations, ensure correct operations on values of
the types, and enable reuse of the correct and standard representations.
It is normal for types to nest, though E-R notations and SQL cannot express nesting.
Documents are nested types.
Chapter Glossary
logical record type : a composite type that is intended to be used as the type of data records stored singly
or in a collection of records measure : a composite type consisting of a number and a type of thing being
measured or counted identifier : any value that represents exactly one member of a designated set array :
a collection of some integral number of variables or objects of the same type or class
REFERENCES
[Parnas 1972] Parnas, D. L. “On the Criteria to be Used in Decomposing Systems into Modules.”
Communications of the ACM, 15, 12. New York: Association for Computing Machinery, 1972, pp. 1053-
1058.
Chapter 13
Subtypes and Subclasses
Before the type/class split, we could consider the terms “subtype” and “subclass”
to be synonyms. But now that a type designates a set while a class describes a
computer object, these two terms take on distinct meanings. Both meanings are
quite useful, separately and together.
SUBTYPES
The modern era of biological classification started in about the Sixteenth
Century, as biologists began to recognize common characteristics across classes
of animals, and began to create “superclasses” of animals. For instance,
elephants, lions, and tigers were all classified as “mammals”. Lions, tigers,
jaguars, and other similar mammals were recognized as a subclass of mammals
called “cats”. A full taxonomy (system of classification) was developed by a
number of scientists, and refined over time.
There are many taxonomies that classify many things besides animals. For
example, there are systems for classifying currency (for example, hard and soft),
crimes (for example, misdemeanors and felonies), and passenger cars (for
example, 2-door, 4-door, and SUV). There can even be multiple classification
systems for a single set of things. For example, here are just a few ways the
cards in a standard deck of 52 playing cards can be classified:
by suit: diamonds, hearts, spades, clubs
by suit color: red, black
by rank: face card, number card
Many of the most fascinating card games classify playing cards by complex
criteria, and even by dynamically changing classification criteria. For example,
some games allow a player to designate a suit of cards to be “trump”, making all
cards of that suit rank higher than any other card for the duration of one round
(hand) of the card game. In the next round, the trump might be different.
A subset is a set of things drawn from a larger set of things. Just as a type
designates a set, a subtype designates a subset. A subtype is always related to
the type that designates the larger set; that is its supertype.
As an example, let’s consider the nature of some forms of government. Figure
13-1 shows a brief taxonomy of forms of government. All of the shapes and
connecting lines are both dashed and bold. They are dashed to show that they are
about concepts, not material objects. They are bold because they are about real-
world concepts, not data concepts.
Figure 13-1. A Taxonomy of Forms of Government
You’ve seen the type rectangles before. The new shape in this figure is the
pentagon, which depicts a restriction relationship. The wider side of each
pentagon is towards the type that designates the set with more members; in other
words, the supertype. The pointed side of each pentagon is towards the type that
designates the set with fewer members; in other words, the subtype. For each
subtype, there is some restricting condition, not directly modeled, that
determines whether a member of the supertype is included in the subtype. For
instance, it’s clear from the labeling of the type rectangles that, out of the set of
all possible forms of government, only those governments where the individual
is considered to be greater than the state are in the set designated by the type,
“Form of Government Where Individual Greater Than State”.
The X in each pentagon indicates that the subtypes connected through it to a
common supertype are exclusive of each other. In other words, a given member
of the set designated by one subtype is not designated by the other subtype.
It is very common to define subtypes as restrictions on supertypes. But we can
go the other way around, too. For example, we can define the type called
“alphanumeric character” as a supertype of the types “letter” and “number”.
Figure 13-2 is a model of the alphanumeric character type. The symbols are
mostly the same as in Figure 13-1. Let’s look at the differences.
Figure 13-2. Alphanumeric Character as a Supertype of Letter and Digit.
First of all, the shapes and connecting lines in this figure are solid and bold.
They are solid because characters are material objects. An individual character
doesn’t exist unless it can be seen. It has to exist as some relatively stable
configuration of matter. It could be ink on paper, or liquid crystals on a computer
display, or even objects on a flat surface juxtaposed to form characters. The
COMN shapes are bold because the material objects being described exist
outside a computer’s memory. Something we refer to as a character that’s inside
a computer’s memory exists only as a representation of a character, and not a
character itself. Those kinds of characters don’t get bold outlines.
The second difference is that there is an arrowhead on the line from
Alphanumeric Character to the pentagon. Arrowheads on lines in COMN always
indicate a direction of reference. This arrow says that the Alphanumeric
Character type references the Letter and Digit types, and not the other way
around. This is what tells us that Alphanumeric Character is defined in terms of
its subtypes. Such a supertype is called a union type, because the set it
designates is the union of the sets designated by its subtypes. When a type is
defined like this, it is more accurate to call the relationship represented by the
pentagon an inclusion relationship, especially since there is no restriction in
effect. But the sub/supertype relationships that result are the same as those in a
restriction relationship.
Figures 13-1 and 13-2 show simple strict type hierarchies, where each type has
only one supertype, except for the type at the top, which has no supertype. But
not everything is that simple, and it is a major mistake in analysis to force all
things into a single type hierarchy with a single root. To illustrate, let’s go back
to our deck of playing cards. Realize that any standard deck of playing cards can
be divided in half based on the color of the suits (excluding jokers): all cards in a
red suit (hearts and diamonds) can be put in one pile, and all cards in a black suit
(spades and clubs) can be put in the other pile. We could then further subdivide
the two piles by the four suits. Figure 13-3 shows this type hierarchy. The shapes
and lines are in bold outline because they describe real-world things. Since the
things they describe are material objects, they are in solid outline. The rectangle
at the top represents any playing card, regardless of suit, color, or rank. Each of
the middle two rectangles represents the class of card whose suit is in one of the
two colors. The rectangles at the bottom reflect the four suits.
This is not the only way to divide a deck of cards. Figure 13-4 shows an
alternative way of classifying playing cards, by type of rank: face card (king,
queen, and jack) and number card (ace through ten).
Figure 13-3. Playing Cards Divided into Suits
Figure 13-5 shows a deck of cards classified by all of the criteria above. Now we
can see that, although a given classification system may be complete, there can
be multiple classification systems side-by-side.
The symbols at the bottom of this figure deserve some attention. Two particular
cards are shown: jack of hearts and nine of diamonds. But each is shown twice,
once as a type and once as an object. That’s to emphasize that “jack of hearts”
and “nine of diamonds” are still types of cards, not individual cards. Yes, it’s
true that in a single deck of cards there will be only one card which is a jack of
hearts and one card which is a nine of diamonds. But there is more than one deck
of cards in the universe, and each deck contains one of those cards, so “jack of
hearts” and “nine of diamonds” identify, not just one card, but a potentially
unlimited set of cards. If we want to talk about hypothetical single cards, we
show them using object symbols with the name beginning with the indefinite
article, “a” (or “an”): “a jack of hearts” and “a nine of diamonds”.
Figure 13-5. Playing Cards Divided by Multiple Criteria
Each of the types “jack of hearts” and “nine of diamonds” is connected to two
type hierarchies, the hierarchy that classifies by suit and the hierarchy that
classifies by rank. This shows that a type can have more than one supertype.
This is sometimes referred to as multiple inheritance, but we’ll avoid that term
for now.
These illustrations of subtype/supertype relationships specified by restriction and
inclusion have used real-world concepts and objects as examples, rather than
data and computer objects. Subtype/supertype relationships work the same with
data and computer objects. Since subtype/supertype relationships have to do
with selecting members from supersets or combining members from subsets, no
alterations to the structure of the types occur as a result of these operations.
Because of this, no issues arise when an individual object (or concept) has
multiple types. The problems of multiple inheritance only arise when one is
dealing with subclassing, which we’ll review in the next section.
Restriction is Subtyping
Recall that we said in chapter 4 that a type could be defined in a number of
ways, including:
by selection
by enumeration
by generation
By far the most common method of defining a subtype is by selection from a
supertype using some criteria. The criteria restrict which members of the
supertype may be members of the subtype.
It turns out that every restriction defines a subtype. If you scan through any data
system design, you will find many data definitions that are defined as restrictions
on more general data definitions. These are all subtypes. Here are some
examples:
An edit control on a user interface page normally accepts any
character, but code on the page restricts the control to accepting only
decimal digits. The restricted edit control only accepts values of type
“numeric string”, which is a subtype of “string”.
A field in a database is defined with the database’s built-in type of
“string”, but a constraint is defined on the field such that only strings
matching a certain regular expression can be stored in it. The regular
expression defines a subtype of string.
Here is a powerful analysis technique: When analyzing any system or its
requirements, look for expressions that restrict possible values, and label that
expression a subtype. Identify the set of values being restricted, and label that
expression the supertype.
SUBCLASSES
We know that types designate sets while classes describe objects. Let’s see how
the physicality of objects affects what it means to have a subclass.
Recall from chapter 10 that the world of object-oriented programming defined
the term “class” as a description of something in the memory of a computer. In
object-oriented programming, a subclass is derived from a base class. The
subclass includes (“inherits”) all of the components defined by the base class,
and adds its own components to them. The methods of the base class continue to
be valid for operating on objects of the subclass, but since they were written
without knowledge of the subclass, they won’t operate on any components of
objects that are only defined in the subclass. It is common practice for a subclass
to override many of the methods inherited from the base class, in order to extend
them to operate on components added by the subclass in addition to those of the
base class.
This description of the structure of subclasses is complete without making any
reference to meaning. The world of object-oriented programming has put a
strong and fixed set of ideas on the meaning of subclasses, but we are going to
keep those ideas on the side for now, and come back to them in the next section
of this chapter. For now, we will focus only on the physicality of objects and the
mechanism of subclassing. This is consistent with COMN’s view that, at their
base, computer objects and their states have no intrinsic meaning.
A type designates a set. A class describes objects, but by so doing also
designates a set, which is the potential and/or actual set of objects described by a
class. The burning question, then, is, is a subclass equivalent to a subtype as
described in the previous section?
To examine this question, let us consider the following base class and subclass:
The base class is Circle. This class describes an object representing a
circle drawn on a graphical display. It holds three values: the X and Y
coordinates of the center of the circle, and the radius of the circle.
This is all the data that is needed to draw the circle on the display.
The Circle class does not support the concept of color. A circle is
always drawn in white on a black background. The methods of the
class are simply setPosition (X, Y) and setRadius(r).
The derived class is Colored Circle. This class inherits the X, Y, and
radius components of Circle, and adds a fourth component, which is
color. The setPosition and setRadius methods still work, but now
there’s an additional setColor() method.
Figure 13-6 shows this design. The triangle between Circle and Colored Circle
indicates extension, where the extending class adds components to those already
in the base class, adds its own methods, and may override methods in the base
class[2]. The way to remember the direction is that the wider side of the triangle
is toward the class that adds components, while the narrower side of the triangle
is toward the base class.
Figure 13-6. Circle and Colored Circle
To the left of the Coffee Shop Person rectangle we have a dashed hexagon
representing the collection of logical records that will hold data about these
persons. This is our first example of a hexagon divided into two parts. The top
part contains the name of the collection of records, and the bottom part contains
the names and types of the components of each record. This form of COMN
enables one to model a set of composite variables or objects without an
explicitly named type.
Each record in the Person Record Collection has three components: the
internally generated meaningless number, called Person ID, which we use to
distinctly identify each person; each person’s name as Person Name; and the
role(s) played by each person. The component name Person ID is followed by
the letters PK in parentheses, indicating that it is a key to any set of Person data.
The relationship line from Person to Coffee Shop Person is a representation
relationship. It is the values of the key of Person Record Collection, namely
Person ID values, which represent, or identify, Coffee Shop Persons.
The Person Role component of a person record is defined as an array of one to
two variables of type Person Role Type. This tells us all we need to know about
the Person Role component, but we’d like to know more about the Person Role
Type, so that has been drawn, connected to the Person Record Collection with a
line having a solid arrowhead, which indicates aggregation. When variables are
defined in-line in a data structure, they are juxtaposed, and their types are
effectively aggregated together in the structure’s type.
There are two values depicted as values of the Person Role type, namely
Customer Role and Employee Role. Each of the one or two elements of the
Person Role component of a person record can be bound to either of these role
values. We presume that there is a business rule that eliminates the possibility
that a person could be recorded as playing the employee role twice or as playing
the customer role twice. This rule is not expressed directly in this model, but it is
expressed indirectly through the relationships to the customer and employee
record collections, as we shall see.
Below the Person Record Collection we have two more record collections, one
each for Customers and Employees. These are connected to the Person logical
record type via a triangle, which is COMN’s symbol for extension. The wider
base of the triangle is on the side towards the types that extend the Person type,
in an analogy to the fact that these types have more components than Person.
Don’t read the triangle as indicating direction of reference. The direction of
reference is given by the arrowhead pointing to Person.
The multiplicity of the extension relationship is given differently above and
below the triangle symbol. Below the triangle, the text “{0-1}” indicates that a
customer record might or might not extend a person record, and an employee
record might or might not extend a person record. Above the triangle, the text
“{1-2}” indicates that a person record will be extended at least once and possibly
twice. The model therefore indicates that a person record must not be created
without creating at least one extending record. This prevents the system from
keeping records of data about persons who are neither customers nor employees
—probably a good thing.
The Customer and Employee logical record types include their own IDs, which
are in addition to Person ID. You can imagine customers with key-ring cards
with bar codes on them holding their customer IDs, and employees with
identification badges showing their employee numbers. Customers and
employees will typically not know their Person IDs.
It is an implementation question whether Person ID values will be carried on
Customer and Employee records, or whether two or three records of these types
will be aggregated, such that Person ID is immediately accessible. This design
decision is influenced by whether we implement in a SQL or NoSQL database.
We can document the logical design, and defer that physical question. We will
examine physical database design issues in chapter 18.
The Customer and Employee logical record types each hold data that is
applicable only to customers and employees, respectively. Only customers have
a Last Purchase Date, and only employees have a Hire Date. Extension is meant
whenever a composite type or a class is extended with additional components,
such that instances of the extending type or class may have more values or
states, respectively, than the base type or class.
This picture is completed by the representation lines drawn from the Customer
and Employee record collections to these Person Playing Role types. We see that
the Customer Record Collection represents the type of Person Playing Role of
Customer, and the Employee Record Collection represents the type Person
Playing Role of Employee. A Customer ID identifies a customer, and an
Employee ID identifies an employee.
Thus we see that extension and subtyping are very different but often exist side-
by-side.
INHERITANCE
Both subtyping and extension are valuable because they enable us to define data
and write software that inherits components and/or methods of a supertype, base
type, or base class. The derived type or class can then be defined merely in terms
of what’s different from or in addition to the base. This makes the task of writing
software more efficient. It also makes software more reliable, because it is easier
to verify the correctness of a base type or class, and then separately verify the
correctness of a derived type or class assuming that the base is correct, than it is
to verify two separate but redundant implementations, especially when one
implementation—the one that would have been derived—is more complex than
the other.
Inheritance multiplies reuse when you realize that variables, values, and objects
of derived types or classes can sometimes be used in places where only the base
types or classes were contemplated in the original design. These are extremely
powerful mechanisms for expanding software and data reuse. But this kind of
reuse is subject to certain limitations—limitations that, it turns out, guarantee the
correctness of the reuse. Let’s look at these more closely.
Key Points
Subtyping and extension work in different ways and are often used together.
Just as a type designates a set, a subtype designates a subset.
Anything that creates a restriction on values, by format, size, or other means, defines a subtype.
A major mistake in analysis is to force things into a single hierarchy with a single root. There are no problems
with multiple type hierarchies and multiple supertypes of a single type when subtyping is only of the
restriction variety.
Extension adds components and/or methods to a base class or type. It only applies to composite classes and
types.
We avoid the terms “subclass” and “superclass” to avoid confusion with “subtype” and “supertype”. Instead,
we say “extending class” and “base class”. Similarly, when speaking of the extension of composite types, we
say “extending type” and “base type”.
Chapter Glossary
subtype : something that designates a subset of the set designated by another type, called the supertype
restriction relationship : a relationship between two types, where one type, called the subtype, is defined
in terms of a restriction on members of the set designated by the other type, called the supertype; inverse of
inclusion relationship
inclusion relationship : a relationship between types, where the supertype is defined as a union of its
subtypes; inverse of restriction relationship
extending class (or type) : a class (or type) that is defined in terms of another class (or type), called a base
class (or type), by defining components and/or methods to what are already available in the base class (or
type)
extension : the addition of components to a base class or type; its inverse is projection
Chapter 14
Data and Information
We have been using the terms “data” and “information” throughout this book,
but do we know what they really are? This chapter lays the foundation for
understanding how data combines with predicates to form information, which is
the topic of the next chapter. Predicates generalize information. Meaning is
given by information, and meaning is the focus of semantics.
INFORMATION
When we attempt to define the words “information” and “data” in ways that are
more precise and yet compatible with natural language, we encounter problems
right away. Consider these definitions from Merriam-Webster.
information : FACTS, DATA fact : a piece of information presented as having
objective reality Here we have a circularity: Information consists of facts, and a
fact is a piece of information! Let’s see if the word “data” can help us escape the
circle.
data : information in numerical form that can be digitally transmitted or
processed So data is a kind of information: it’s numerical information. That’s
fine, but we still don’t know what information is!
Of course, I have deliberately selected these definitions from several possibilities
Merriam-Webster gives for each of these words, in order to show that, to some
extent at least, a good definition of “information” is hard to find. A study of the
alternative definitions of these words begins to widen the circle to include the
word “knowledge”, among others, but there is no strong definition of
“information” in this dictionary. Our task, then, is to develop a definition of
“information” that is precise, consistent, and useful as a building block.
It is at least indisputable that the word “information” refers to a mass quantity. It
is much like when we refer to “water”: we aren’t referring to any particular
quantity of water, and we certainly aren’t counting water molecules: we just
mean water en masse.
The human race did not need to know about water molecules before we could
benefit from or harness the power of water, but our understanding of the physical
world took a great leap forward when we learned of the existence of water
molecules, and in fact this understanding helped to usher in the modern era
where we have much greater control over the physical world. If we are to
understand information in a deep way, and truly gain control over it, it is
essential that we understand information at the molecular level, so to speak. We
already get value out of information, but by understanding information at the
molecular level, we will enable even greater insights and accomplishments. We
need to answer the question, What is the fundamental piece of information?
For the answer, I will look to the field of mathematical logic, specifically
propositional logic and first-order predicate logic, for the terms proposition and
predicate. We gain a tremendous advantage by linking the definition of
information to the field of logic, because we can harness all of the proven
techniques of formal logic systems to assist in information analysis and
processing. Propositional logic and predicate logic are what link the fields of
data and semantics.
Merriam-Webster’s defines the word proposition as follows: proposition 2 a :
an expression in language or signs of something that can be believed, doubted,
or denied or is either true or false For example, the statement, “It is raining
outside right now,” is a proposition, because at this very moment the statement is
either true or false—or at least one may argue about whether it is true or false.
(Perhaps it is only drizzling.)
A proposition is the most fundamental piece of information. A collection of
propositions constitutes information.
Let’s see what this means in practice. Here is a series of propositions, which I
believe most of us intuitively consider to be, collectively, information.
The Dow Jones Industrial Average is down 100 points today,
finishing the week 1% lower, at 10,194.
The Secretary of State will be representing the United States at this
year’s G8 Summit in Paris.
Average SAT scores were up this year, reversing a five-year trend.
According to a recent Gallup poll, over 45% of Americans are in
church every Sunday.
(In fact, each of these propositions is a compound proposition, because each
asserts more than one claimed truth. We will save consideration of decomposing
propositions—moving from the molecular level to the atomic level—until a later
time.) Is Information Always True?
In natural language we sometimes speak of “false information”. The above
definition of information, as a collection of propositions, allows for the
possibility that information is false, since a proposition may be true or false.
Thus, our definition of information enables us to use the word in this natural-
language way. In contrast, the word “fact” carries with it the notion of truth.
When a supposed statement of fact turns out to be false, we don’t call it a “false
fact”; instead, we say that it is not a fact.
Given this definition of the word: fact 5 : a piece of information presented as
having objective reality— in fact : in truth (Merriam-Webster) , we can say that
a fact is a proposition (a piece of information) that is true.
Those familiar with fact-based modeling, which was reviewed in chapter 7, will
recognize that the facts of fact-based modeling are propositions.
Data en Masse
We tend to deal with data en masse. That’s because the value of data processing
lies in the capability of computers to process large quantities of data. This is the
reason we see the singular form of the word, datum, so seldom. It is also the
reason that the word “data” has come to be treated as a mass noun, like “water”
and “information”: we treat it not as a plural noun (“the data are . . .”), but as a
singular noun (“the data is . . .”). In this perfectly legitimate usage, we ignore
that any quantity of data is composed of many elemental particles, in the same
way that we ignore that any quantity of water is composed of many molecules.
We need to accept both the singular and plural usages of the word “data”,
reserving the plural usage for more technical contexts where we are paying
attention to the fact that data is composed of multiple atoms, each of which is a
datum.
In order to deal with data en masse, we separate data from information. That is,
we reduce a number of propositions of the same form to a single predicate and a
set of data per proposition. We then store the data in a database management
system, which is a computer system specifically designed to manage large
quantities of data. In order to recover the original information, we must retrieve
the data from the database system and marry it to its associated predicate. This
latter operation is rarely done in an automated fashion. That is, it is usually done
by humans, outside any computer system. For instance, a worker in a human
resources department might bring up an employee’s record, and see, on a screen,
values labeled Employee ID, Salary, and Department. Because the values are
appropriately labeled on the screen, the human computer user understands the
implicit predicate and re-constitutes the original proposition in his mind
(“Employee #956 works in Department 4567 and earns a salary of $4000 per
month.”). Rarely does any so-called information system represent this whole
proposition. (Rarely does any so-called information system even represent the
predicate, but that is a topic for a later section.) Variable Names To make things
easy for ourselves, we humans typically try to choose variable names that
remind us of what the variables stand for—in the example above, we are
reminded by the variable names EmpId, DeptNr, and SalaryMnthUsdAm that
these variables stand for employee ID number, department number, and monthly
salary, respectively. But the computer attaches no such meaning to variable
names; in fact, it attaches no meaning to them at all. As far as the computer is
concerned, the predicate could be Employee #X works in Department Y and
earns a salary of Z per month.
, and everything would be just fine.
It’s Just Data The term “data” is sometimes used as a pejorative term, to imply
that there is insufficient meaning or value in some data or information, and that a
context must be supplied for the data, or further analysis of the so-called data is
required. It is sometimes said, “This is just data; we need information.”
That which is referred to as “just data” might really be data in the strict sense, in
which case a context is definitely needed (more precisely, a predicate) in order to
understand what the data indicates. By the definition above, data is separate from
the context (a predicate) which gives it meaning.
If that which is referred to as “just data” is in fact information in the strict sense
—a set of propositions—then the complaint is saying either that not enough
supporting information has been supplied in order for the information to be
useful, or that analysis of the information (usually a large quantity of
information) is required in order to extract valuable insights from it.
DATA OBJECT
Whether or not something is a datum depends on the use to which the entity is
put. As the example above showed, the number 39 is just a number unless it is
known that it is intended to be substituted for a variable in a predicate. Strictly
speaking, then, there is no special “data object”. An object is a data object only if
its states represent values intended for use in a variable that is part of a predicate.
One may construct objects for dealing with data in general, but then such objects
will likely deal, not with individual objects representing individual values, but
rather with more complex objects representing logical records, tables, and other
data structures. Such objects are then indeed “data objects”, but modeling them
generically and distinctly from non-data objects probably has no value unless
one is designing a database management system. A value is data only if it plays
the role of data. In our next chapter we will focus on roles played by data.
Key Points
Data is dehydrated information—values separated from the variables of the predicates that give the data
meaning.
Propositional logic and predicate logic link data to semantics.
Calling some values “data” indicates that the values play a certain role. They are intended to be bound to the
variables of a predicate.
“Data” and “information” are mass nouns, like “water”. Mass nouns indicate some unknown plural quantity
but are treated as if they were singular.
Chapter Glossary
proposition : an expression in language or signs of something that can be believed, doubted, or denied or is
either true or false (Merriam-Webster) information : a collection of propositions fact : a proposition that is
true or believed to be true predicate : short for logical predicate logical predicate : a statement containing
variables which, when the variables are bound, yields a proposition predicate : a statement containing
variables which, when the variables are bound, yields a proposition datum : that which is intended to be
given to a predicate as a value for one of its variables data : plural of datum analytics : information
derived from other information or data insight : information derived from analytics structured data :
collections of data items stored in a database that imposes a strict structure on that data
unstructured data : data representing text, audio, video or other data which have no structure imposed on
what they represent semi-structured data : collections of data items stored in a way that supports but does
not enforce a structure
Chapter 15
Relationships and Roles
In this chapter we’re going to learn how COMN models express relationships,
how data plays roles, and how expressing these relationships as predicates makes
the connection from data to semantics. We’re going to start with a design that’s
not entirely clear, and then straighten it out based on what we’ve learned about
subtypes and predicates.
Aha, you say, a flight schedule! There’s just one very important piece of
information missing: are these departures or arrivals? You typically have to
search for a title over the display screen for that information. This shows that
two or more sets of data can have the same structure even though they are meant
for substitution into different predicates (that is, the logical predicates of chapter
14). Even though the structure of departure and arrival data is the same, there is
a great difference in meaning between the proposition that flight 351 is departing
for Charlotte at 11:05 AM and the proposition that flight 351 is arriving from
Charlotte at 11:05 AM. If you’re a passenger, confusing those two meanings can
result in a missed flight, and if you’re an air traffic controller, confusing those
two meanings can result in pandemonium!
Figure 15-1 shows the common Flight Schedule Record Type as the type of both
the Departures and Arrivals record collections.
Figure 15-1. Departures and Arrivals
Since we know that not every 3-or 4-digit number is an actual flight number, nor
that every string of up to 200 characters is an actual city name, we want to keep
collections of the legitimate values. Before a record of data is stored in the
Departures or Arrivals collection, we can check these collections to ensure that
the flight number and city name are known. The Flight Number and City Name
collections are shown at the bottom of Figure 15-1. Flight Numbers are just
integers, and integers are simple types; hence the crossed lines through the
hexagon. City Names are character strings, which are arrays of characters and
therefore composite types, and so there are no crossed lines. (A single character
is a simple type.)
The Departures and Arrivals collections reference the Flight Numbers and City
Names collections, as shown by the relationship lines with open arrowheads
indicating the direction of reference. This kind of relationship expresses
reference without any kind of composition.
Logically—ignoring implementation details—the references to Flight Numbers
are made by the Flight Number components of Departures and Arrivals, and the
references to City Names are made by the City components of Departures and
Arrivals. In a SQL database, the Flight Number and City components are called
foreign keys. A foreign key is a component that can only take on values that are
found as key values in the referenced table. In this way, a flight schedule’s flight
number is restricted to be only known flight numbers and not just any 3-or 4-
digit number, and a flight schedule’s city name is restricted to be only known
city names and not just any string of characters. No relationship is shown to a
table of times. We’d rather not keep a table of all 1440 minutes in a day! Instead,
we depend on the Time of Day Type to enforce that the time reference hours and
minutes in a day in a customary format.
By this point in the book your COMN antennae may have gone up when you
read the words “restricted”, and you might suspect that there are subtypes
somewhere here—and in fact there are. Let’s examine the Flight Number
component. Its format is defined to be a 3-or 4-digit number. But only some 3-or
4-digit numbers are listed in the Flight Numbers collection, and only those are
legitimate flight numbers. Thus, we have two sets:
the set of all 3-and 4-digit numbers
the subset of 3-and 4-digit numbers found in the Flight Numbers
table.
The foreign key constraint that restricts a Flight Schedule’s Flight Number
component is in fact a type, because it designates a set, that set being the flight
numbers listed in the Flight Number table.
This shows that it is always true that a foreign key constraint is a subtype. The
supertype is the underlying type of the key. The subtype designates the set of
key values actually in the referenced table.
This is a really radical observation, because up until now we have always said
that the lines in a data model represent relationships, and now we’re learning
that some of the lines in fact represent types! To be specific, a reference to a
collection without aggregation or assembly, shown in COMN with an open
arrowhead, amounts to the definition of a subtype, if the collection has a key.
Many document DBMSs, and other NoSQL DBMSs, don’t have the concept of a
foreign key, and will allow flight number and city name fields to accept any digit
strings or character strings. But the fact that these DBMSs don’t enforce the type
constraints doesn’t mean that the types don’t logically exist. For your NoSQL
database design, you should still document that these types exist at the logical
level. When you move to physical database design, and have no way to represent
or enforce these logical types physically, it will be important to document that
these fields are wide open, and the database is dependent on the application code
to store only data that is of the correct logical type.
Within the Flight Schedule Record Type rectangle, we’ve changed the types of
the components Flight Number and City. The type of Flight Number is now
given as FK(Flight Numbers), meaning that a Flight Number’s type is the set of
all key values found in the Flight Numbers collection. “FK” stands for foreign
key. Likewise, the type of City has been changed to FK(City Names). If we
wanted to know the underlying types of Flight Numbers and City Names—that
is, the types of the keys, which are the supertypes involved here—we’d have to
look at the logical record types for Flight Numbers and City Names. They aren’t
shown in this model just to keep the clutter down. This change further reduces
redundancy in the model, because now those underlying types are defined only
once, rather than both at their origin and at their point of reference.
At the bottom of Figure 15-2 you’ll see the role boxes used in fact-based
modeling. Each group of small role box rectangles represents a predicate, and
the number of boxes in the group gives the number of variables in the predicate.
The phrase near a group spells out the predicate, and uses logical record type
component names (underlined) as variable names.
The role boxes illustrate that, in a record-oriented design, relationships exist
between components of a single logical record. Foreign-key relationships are
really subtype specifications.
If you’re lucky enough to have a design where all the important relationships
correspond to the subtype/foreign key relationships, you can skip the role boxes
and label those relationship lines. But you can see from this example that you
might not always be so lucky.
Which notation should one use? That is almost entirely an issue of preference.
There is a large community of data modelers and business people who are
comfortable with E-R notations, and the very similar UML notation. The
community of data modelers and business people familiar with fact-based
modeling is much smaller. This would tilt the preference for notation in favor of
foreign-key/subtype relationship lines over small-box relationship notation.
However, if one wished to show the relationships that exist between components
of a single logical record type when no foreign key is involved, or when the
diagram doesn’t show the referenced logical record type as a rectangle, role-box
notation can be used.
Furthermore, role-box notation can show relationships involving three or more
data items—predicates with three or more variables—without forcing the
inclusion of a so-called “associative entity” in the diagram.
Key Points
Relational database foreign key constraints are actually subtypes, because they designate the set of key values
in the referenced table, which are a subset of the values designated by the key’s type.
Foreign-key relationship lines in E-R diagrams depict subtype relationships.
Relationships exist between the attributes of a single logical record.
Role boxes can be used to express relationships between components of a single logical record type.
Data are values that play roles in predicates, as values for predicate variables.
A logical predicate is a relationship type, because it designates a potential set of relationship propositions.
Chapter Glossary
relationship : a proposition concerning two or more entities relationship type : a logical predicate
Chapter 16
The Relational Theory of Data
We’ve gotten quite far in our investigation of data models, including discussions
of what data and information are, and what relationships are. It’s finally time to
take a look at the fundamental theory behind all of this.
We’ve already established, and the world has already proven, that you can do a
lot with data without understanding relational theory. It’s also true that you can
do a lot with water power without understanding the water molecule, H O. So 2
WHAT IS A RELATION?
In overly simplistic terms, a relation is a table. So why do we use this fancy
word “relation”, instead of the more easily understood term “table”? Because
there are some important differences, namely:
1. The order of rows in a relation has no significance whatsoever, while
the order of rows in a table may carry information.
2. The repetition of rows in a relation has no significance whatsoever,
while the repetition of rows in a table may carry information.
Below I will explain these differences, and why they matter.
When cash registers were mechanical devices that printed item prices on paper,
cash register receipts showed the prices of grocery items in the order in which
the cashier rang them up. This custom has been preserved even in the age of
computerized registers and bar code scanners. Thus, not only does the printing
on the paper register tape show each item purchased together with its price, but it
also shows the order in which the items passed by the scanner. If, for some
reason, we wished to reorder the items in the list, say by sorting them so that the
least expensive item appeared first, we would lose track of the order in which
they were rung up or scanned. We observe that there is information carried in the
order of the items in the list. In contrast, consider a printed telephone directory.
See Figure 16-2 for an example of part of a telephone directory.
Doe Jack 123 Main St. 555-1234
This list also has an order. It is apparent that the order of this list is determined
by sorting the names of telephone subscribers in alphabetical order. The
telephone directory publisher established this order so that it is easy to find
listings by name: one simply searches alphabetically through the directory for
the name of the desired listing.
If you took a telephone directory, cut off the binding while leaving the contents
of the pages intact, and shuffled the pages like a deck of cards, you would not
lose any information. Every listing would still be in those pages, though it would
take much longer to find any one listing because of the loss of order. With
enough time and patience, one could re-sort the pages to alphabetical order, by
consulting the listing at the top of each page. Because the order can be restored
using the information still on each page, one can see that the directory’s physical
property of alphabetical order carries no information. However, that order is
valuable for efficiency when searching the directory.
Consider again the cash register receipt. To record all of the information from
the receipt explicitly, including the significance of the order of the rows, and
preserve the information when the rows are reordered, we would have to add a
column to the table that lists the sequence in which items were rung up, as
shown in Figure 16-3 below.
12 TOMATO CAN 16 OZ 1.69
Figure 16-3. Part of a Grocery Store Cash Register Receipt with Explicit Order
The printer’s error is to print one listing three times, which is entirely
unnecessary and a waste of paper and ink. No matter how many times John
Doe’s listing is printed, it is still only true once, so to speak, that his telephone
number is 555-8877. This repetition does not indicate that, for instance, he has
three telephones in his house. He might have four, or only one: the directory
does not carry such information.
In contrast, the cash register receipt of Figure 16-1 has two rows that are
identical, and this repetition carries information, specifically that two 16-ounce
cans of tomatoes were purchased. Thus, a telephone directory is much more like
a relation than a cash register receipt is, because neither the order of its rows nor
any repetition of rows carries any information.
We say that a table depicts a relation because it is like a picture of a relation. A
table is a physical representation of a relation. It is not the relation itself. You
can’t see the relation itself, because the relation is a concept. Just as no one has
ever seen the number one, even though we’ve all seen thousands of
representations of the number one in words and symbols, no one has ever seen
or will ever see a relation.
The column headings create an expectation as to what data we will find in the
corresponding fields of each row. For example, we would be very surprised to
find the words “FACIAL TISSUE” in the Price column. In fact, if that occurred
on a cash register receipt or in a corresponding table in a computer’s memory,
we would consider it to be an error.
By definition, each column in a relation can only carry one type of data; that is,
data drawn from only one set of values. For instance, values in the Price column
of a cash register receipt can only be drawn from a set of numbers that represent
prices, with two digits to the right of the decimal point. Further, the values in
each column carry a significance which is usually indicated by the column
heading. For instance, we expect that values in the Item Description column
describe the items purchased.
We can describe the form of the cash register receipt and the telephone directory
using a logical record type. We know from the previous chapter that
relationships exist between the components of a logical record type, so this
means that relationships must exist between the columns of these tables.
Summary
A relation is a conceptual record of data, where there is no significance to the
order of rows, nor to repetition of data. We represent relations as tables, which
do have order to their rows, and which can repeat row values. However, by
avoiding the use of the physical order of data in a database to record
information, data can be reordered in order to make retrieval faster, without
losing information. In any relation, relationships between data exist between (not
necessarily adjacent) columns.
This table shows many data attribute values. Here are a few examples:
<Departure City, FK(City Names), Charlotte> <Departure City, FK(City
Names), Chicago> <Arrival Time, Time of Day Type, 4:30 PM> <Flight
Number, FK(Flight Numbers), 445>
The angle brackets (< >) indicate that the order of terms inside them is
significant. This is important: we don’t want to confuse the name of a role that
data plays with the name of the set from which it is drawn. For instance, it is
important not to confuse a particular Flight Number with the set of possible
Flight Numbers.
A set of data attribute values, taken together, is called a tuple value, or, more
simply, a tuple. This strange name comes from the names we use for sets of
particular numbers of things: single, double, triple, quadruple, quintuple,
sextuple, septuple, . . . . (You can pronounce “tuple” to rhyme with “couple” or
to rhyme with “scruple”; both are acceptable.) Table 16-2 shows four tuples,
each one as a row of the table. Here is the tuple represented by the first row of
the table.
{
<Flight Number, FK(Flight Numbers), 351>,
<Departure City, FK(City Names), Charlotte>,
<Departure Time, Time of Day Type, 11:05 AM>,
<Arrival City, FK(City Names), Philadelphia>,
<Arrival Time, Time of Day Type, 12:40 PM>
}
The outer braces around this list of data attributes indicate that they are members
of a set, and therefore the order of items in the list is insignificant. I listed the
data attribute values in the same order in which they were depicted in Table 16-2
—in column order—but since each data attribute value carries its role name
(which equals the column name), the order of data attribute values in the set is
irrelevant. One could rearrange the data attribute values within the tuple without
losing any information.
Technically, a relation is a set of tuples that all have the same set of role names
and types in their data attribute values. Table 16-2 depicts a relation with four
tuples as four rows. Each field in a row, at the intersection of a row and a
column, depicts a data attribute value. Each column name is the role name of the
data attribute value.
One can see that writing out each set of data attribute values—each tuple—is a
very inefficient way of displaying the data in a relation. If one were to show a
relation as a set of tuples, it would take a great deal of space, indeed. (This, by
the way, is unfortunately the manner in which XML depicts tuple values, and it
is so inefficient that it disallows XML from use in many demanding applications
that involve structured data.) We prefer the compact depiction in a table such
Table 16-2. The only disadvantage to the table notation is that it provides only
two of the three parts of a data attribute value, namely, the role name and the
value: nowhere is the data attribute type specified. That is a problem easily
addressed by adding an additional header row to the table; see Table 16-3, where
the types are given as non-bold headers directly below the column names.
Flight Departure City Departure Time Arrival City Arrival Time
Number
FK(Flight FK(City Names) Time of Day Type FK(City Names) Time of Day Type
Numbers)
If we store many, usually related, relations in one place, we have what is called a
relational database. In relational theory, a database is a collection of relation
variables, where each relation variable can take on the value of some relation.
The logical record type collections, represented in COMN as shadowed and
dashed hexagons, are relation variables. In practice, a relational database is
implemented using a physical table for each relation variable.
With a quick glance at the table heading we observe that there are three name
columns that, taken together, give an employee’s full name. Each of the three
name columns reference a Personal Names table, so that, as employees’ names
are added to the table, a system can check whether a certain name has ever been
seen before. If a name is not found, the system can ask the data entry person,
“Are you sure this is a personal name?”, allowing entry, and adding the name to
the Personal Names table, only after the data entry person has confirmed the
spelling. This helps ensure high data quality on personal names, which are so
varied as to be difficult to check in any other way.
However, with the table as defined, we have no means to deal with an
employee’s full name as a whole. Rather, we are forced to deal with an
employee’s name as three separate components.
We can improve on this situation. See Table 16-5 below.
Employee
Name
In this version of the Employee Data table, we have collected three data
attributes under a new heading, Employee Name. Note the change in the role
names under Employee Name. The word “Employee” has been removed from
the role names of the sub-attributes. It is the “big attribute” that tells us that this
is a name of an employee. We intuitively recognize that the sub-attributes—Last
Name, First Name, Middle Name—are applicable not just to the names of
employees, but also to the names of customers, parents, taxpayers, etc.
How do we understand this in relational terms? Table 16-5 introduces a new
tuple scheme, with three data attributes: Last Name, First Name, and Middle
Name. All three data attributes have the same type, FK(Personal Names), but
this is purely coincidental and not significant. The important aspect is that we are
now using a tuple scheme as the type for the data attribute Employee Name. We
call data attributes like Employee Name, whose types are tuple schemes,
composite data attributes. We recognize that a composite data attribute is
merely an attribute whose type is a composite type; that is, a type defined using a
scheme.
Now, we haven’t given this new tuple scheme a name: it is anonymous. But it
would make perfect sense to call this tuple scheme Person Name, and then we
could re-use this tuple scheme to represent the names of persons in many
different contexts. Table 16-6 depicts this same table with the additional tuple
scheme type shown as Personal Name Type, and Figure 16-5 shows the COMN
logical data model corresponding to this table.
Employee
Name
Table 16-6. The Employee Data Table with a Sub-Scheme with Explicit Type
This is classic type nesting. We do this all the time in the context of
programming languages, where the components of a class may be other classes,
to any level. We do this in XML, where an XML element can contain other
XML elements, nested to any level. We do this in JSON, where an object or an
array can contain other objects or arrays, nested to any level. We now
understand type nesting as it relates to tables and relations. This is made possible
by two aspects of COMN:
separation of the idea of a composite type from a record collection
that conforms to that type
recognition that a foreign key constraint is a subtype
Figure 16-5. A COMN Model for the Employee Data Table
RELATIONAL OPERATIONS
There are nine relational operators that return relations as results: select (or
restrict), join, project, union, intersection, difference, extend, rename, and divide.
These relational operators show up directly or indirectly in SQL, and are often
present in NoSQL DBMSs as well, of necessity. For instance, an operation that
selects documents from a document database based on the value of a particular
document element is performing the relational operation of restriction.
Encapsulating data in classes—a common practice in object-oriented
programming—disables the relational operators. Relational operators need free
access to the data attributes of relations in order to recombine them in useful
ways. SQL DBMSs, and other DBMSs that implement the relational operations,
provide powerful means to manipulate large quantities of data very efficiently
and with minimal programming.
The NoSQL community is in danger of leaving that efficiency and
expressiveness behind, and manually replicating the same operations repeatedly
at the application level. This is costly and inefficient, and requires that what is
(or should be) essentially the same logic be tested over and over. It’s important
to be aware that relational theory and relational operations are not tied to SQL or
any particular physical implementations, and are best implemented once in a
DBMS for all to use.
TERMINOLOGY
Relational Term COMN Term
attribute data attribute
(no relational attribute: an inherent characteristic
equivalent)
attribute value data attribute value
tuple scheme composite type
logical record type if intended to be used as such
tuple variable a variable having a composite type
a logical record if intended to be used as such
tuple, tuple value value of a composite type
relation scheme composite type for a collection of logical records
relation variable a variable having a relation scheme as its type
relation, relation value of a relation variable
value
Key Points
The relation of relational theory is not a relationship. This is a great terminological tragedy that has impeded
comprehension of relational theory.
A relation is like a table where the order of the rows is irrelevant, and any repeated row values are irrelevant.
Making order irrelevant enables data independence, so that data can be reordered for faster access.
Type nesting is compatible with relational theory, but the lack of support for type nesting in SQL and E-R
notations have led to the belief that one must abandon relational theory in order to gain type nesting.
Relational operators are powerful means for recombining data in useful ways. Encapsulating data disables
Relational operators are powerful means for recombining data in useful ways. Encapsulating data disables
these operators, and leaving them out of NoSQL DBMSs forces them to be implemented repeatedly in
applications, with resulting higher costs, lower quality, and lower reliability.
Chapter Glossary
attribute : an inherent characteristic (Merriam-Webster)
data attribute : a <name, type> pair. The name gives the role of a value of the given type in the context of
a tuple scheme or relation scheme.
data attribute value : a <name, type, value> triple tuple : a tuple value tuple value : a set of data attribute
values tuple scheme : the specification of the data attributes of a tuple, together with any constraints
referencing a tuple value as a whole relation : a relation value relation value : a set of tuple values all
having the same tuple scheme; informally, a table without significance to the order of or repetition of the
values of its rows
relation scheme : the specification of the data attributes of a relation, together with other information such
as keys and other constraints on a relation value as a whole tuple variable : a symbol which can be bound
to a tuple value relation variable : a symbol which can be bound to a relation value data independence :
the ability to reorder data without losing any information
Chapter 17
NoSQL and SQL Physical Design
As you can see, the bulk of this book is spent explaining concepts of analysis
and design, and teaching you how to represent things in the real world, and data
about those things, in COMN models. Now we get to the final step that follows
from analysis and design: expressing a physical database design in a COMN
model. There are several goals for such a model:
1. We want the model to be as complete and precise as an actual
database implementation, so that there’s no question how to
implement a database that follows from the model. This is especially
important if we want to support model-driven development.
2. We want assurance that the physical design exactly represents the
logical design, without loss of information and without errors
creeping in.
3. We want the database implementation to perform well, as measured
by all the criteria that are relevant to the application. We want queries
by critical data to be fast. We want updates to complete in the time
allowed and with the right level of assurance that the data will not be
lost. The physical design process is the place where performance
considerations enter in, after all the requirements have been captured
and the logical data design complete.
DATABASE PERFORMANCE
Our task as physical database designers is to choose the physical data
organization that best matches our application’s needs, and then to leverage the
chosen DBMS’s features for the best performance and data quality assurance.
This might lead us to choose a DBMS based on the data organization we need.
However, sometimes the DBMS is chosen based on other factors such as
scalability and availability, and then we need to develop a physical data design
that adapts our logical data design to the data organizations available in the
target DBMS.
The task of selecting the best DBMS for an application based on all the factors
below and the bewildering combinations of features available in the
marketplace, at different price points and levels of support, is far beyond the
scope of this book to address. This chapter will equip you to understand the
significance of the features of most of the DBMSs available today, and then—
more importantly for the topic of this book—show you how to build physical
models in COMN that express concrete representations of your logical data
designs.
Physical design is all about performance, and there are several critical factors to
keep in mind when striving for top performance:
scalability: Make the right tradeoffs between ACID and BASE,
consulting the CAP theorem as a guide. Know how large things could
get—that is, how much data and how many users. You will need to
know how much of each type of data you will accumulate, so that you
can choose the right data organization for each type.
indexing: Indexing overcomes what amounts to the limitations of the
laws of physics on data. If a field is not indexed, you will have to
scan for it sequentially, which can take a very long time. Add indexes
to most fields which you want to be able to search rapidly, and
consider the various kinds of indexes the DBMS offers you. But be
aware of the tradeoffs that come with indexes.
correctness: Make sure the logical design is robust before you embark
on the physical design. There’s nothing worse than an implementation
that is fast but does the wrong thing. In this context, “robust” means
complete enough that we don’t expect that evolving requirements will
require much more than extension of the logical design.
ACID
Almost all SQL DBMSs, plus a few NoSQL DBMSs, implement the four
characteristics that are indicated by the acronym ACID: atomicity, consistency,
isolation, durability. These four characteristics taken together enable SQL
databases to be used in applications where the correctness of each database
transaction is absolutely essential, such as financial transactions (think
purchases, payments, and bank account deposits and withdrawals). In those kind
of applications, getting just one transaction wrong is not acceptable.
Here is a guide to the four components of ACID.
Atomicity
A DBMS transaction that is atomic will act as though it is indivisible. An update
operation that might update a dozen tables will either succeed completely or fail
completely. Either all affected tables will be updated, or, if for some reason the
update cannot completely succeed, tables that were updated will be rolled back
to their pre-update state. A transaction might not complete because of errors in
the update, or because a system crashed. It does not matter: An atomic
transaction will appear to a DBMS user as if it either completely succeeded or
completely failed.
Consistency
A DBMS can enforce many constraints on data given to it to store. The most
fundamental constraints are built-in type constraints: string fields can only
contain strings, date fields can only contain dates, numeric fields can only
contain numbers, etc. Additional constraints can be specified by a database
designer. We’ve talked a lot about foreign-key constraints, where a column of
one table may only contain values found in a key column of another table. There
can be what are called check constraints, which are predicates that must be true
before data to be stored in a table is accepted.
Consistency is the characteristic that a DBMS exhibits where it will not allow a
transaction to succeed unless all of the applicable constraints are satisfied. If, at
any point in a transaction, a constraint is found which is not met by data in the
transaction, the transaction will fail, and atomicity will ensure that all partial
updates that may have already been written by the transaction are rolled back.
Isolation
Isolation is the guarantee given by a DBMS that one transaction will not see the
partial results of another transaction. It’s strongly related to atomicity. Isolation
ensures that the final state of the data in a database only reflects the results of
successful transactions; isolation ensures that, if a DBMS is processing multiple
transactions simultaneously, each transaction will only see the results of
previous successfully completed transactions. The transactions won’t interfere
with each other.
Durability
Durability is the guarantee that, once a transaction has successfully completed,
its results will be permanently visible in the database, even across system restarts
and crashes.
Key/Value DBMS
A system that organizes data as key/value pairs is arguably the simplest means
of managing data that can justify calling the system a database management
system. The main focus of a key/value DBMS is on providing sophisticated
operations on key values, such as searching for exact key values and ranges of
key values, searching based on hashes of the key values, searching based on
scores associated with keys, etc. Once the application has a single key value, it
can speedily retrieve or update the “value” portion of the key/value pair.
Key/value terminology is somewhat problematic, since the keys themselves have
values. In reality, the “value” portion of key/value is just an object of some class
that is unknown to the DBMS.
Because of the simplicity of key/value DBMSs, it is relatively easy to achieve
high performance. The tradeoff is that the work of managing the “value” is left
to the application. Some DBMSs are beginning to provide facilities for
managing the “value” portion as JSON text or other data structures.
Mapping a logical record type to a model like this involves the following
considerations:
Each physical record class can have only one key, which could
consist of one component or of a set of components treated as a unit
(in other words, a composite key). This will be the only component
that can be searched rapidly. If records need to be found by the value
of more than one component, the data might have to be split into
several physical record classes.
To the DBMS, the “value” component is a blob, but to the application
it’s quite important. Therefore, it will behoove the designer and
implementer to fully define the “value” components of the logical
record.
Because key/value DBMSs don’t support foreign key constraints, a greater
burden is put on the application to ensure that only correct data is stored in a
key/value database.
Graph DBMS
A graph DBMS supports the organization of data into a graph. There are only
two kinds of model entities in a graph: nodes and edges. Graph data is usually
drawn using ellipses or rectangles to represent nodes and lines or arcs to
represent edges. In contrast to common practice in most applications of data
modeling, graph data models usually depict entity instances rather than entity
types. Figure 17-1 shows some graph data expressed using the COMN symbols
for real-world objects (Sam), simple real-world concepts (Employee Role), and
data values (2016-01-01).
Figure 17-1. A Data Graph
Document DBMS
At its interface, a document DBMS stores and retrieves textual documents. A
document DBMS usually supports some partial structuring of documents using a
markup language such as XML. Some so-called document DBMSs support
JSON texts, often mistakenly called documents. A document is a primarily
textual composite record type with possibly deep nesting. Usually, many of a
document’s components are optional. It is straightforward to model any
document’s structure in COMN as nested composite types; however, a non-
trivial document might involve many types and might need to be split across a
number of diagrams.
As with any database, speedy access to data will depend on important
components being indexed. A database index is built and maintained by a
DBMS by taking a copy of the values (or states) of all instances of a specified
component or components, and recording in which documents those values are
found. This is represented in COMN as a projection of data from the record or
document collection to the index, and then a pointer back to the collection.
DBMSs usually offer indexes in many styles. The particular index style can be
indicated in the title bar of the index collection, in guillemets, as in «range
index». This notation gives the class (or type) from which the current symbol
inherits or is instantiated. See Figure 17-3 for an example. The Employee ID
Index is defined as a physical record collection that is a projection from the full
Employee Resume Collection of just the Employee ID component. The index is
also an instance of a unique index, a class supported by the DBMS. The pointer
back to the collection is implicitly indicating a one-to-one relationship from each
index record to a document in the collection, which is true of a unique index. A
non-unique index would require a “+” at the collection end of the arrow,
indicating that one index record could indicate multiple documents or records in
the collection.
Figure 17-3. A Document Database Design with a Unique Index
Columnar DBMS
A columnar DBMS presents data at its query interface in the form of tables, just
as a SQL DBMS does. The difference is in how the data is stored.
Traditional table storage assumes that most of the fields in each row will have
data in them. Rows of data are stored sequentially in storage, and indexes are
provided for fast access to individual rows.
Sometimes it would be better if the data were sliced vertically, so to speak, and
each column of data were stored in its own storage area. A traditional query for a
“row” would be satisfied by rapidly querying the separate column stores for data
relevant to the row, and assembling the row from those column queries that
returned a result. However, columnar databases optimize for queries that retrieve
just a few columns from most of the rows in a table. Such queries are common in
analytical settings where, for example, all of the values in a column are to be
summed, averaged, or counted. Columnar databases optimize read access at the
cost of write access (it can take longer to write a “row” than in a row-oriented
database), but this is often exactly what is needed for analytical applications.
Consider, for example, a database of historical stock prices. These prices, once
written, do not change, but will be read many times for analysis. In a row-
oriented database, each row would repeat a stock’s symbol and exchange, as
well as the date and open, high, low, and close prices on each trading day. A
columnar database can store the date, open, high, low, and close each in their
own columns and make time-series analysis of the data much more rapid.
Representing a columnar design in COMN involves modeling the columns,
which are physical entities. This is quite straightforward to do, as a column is a
projection of a table. Figure 17-4 below shows our example design, where the
physical table class that represents the logical record type is shown projected
onto a row class and five columnar classes.
Tabular DBMS
Let’s not forget about traditional tables as a data organization. Traditional tables
can, of course, be used in SQL DBMSs, but increasingly NoSQL DBMSs
support them, too.
The physical design of tabular data emphasizes several aspects for performance
and data quality:
indexing: Probably the most important thing to pay attention to in a
tabular database design is to ensure that all critical fields are indexed
for fast access. Add indexes carefully, because each index speeds
access by the indexed data, but also increases the database size and
slows updates.
foreign keys: Foreign key constraints are valuable mechanisms for
ensuring high data quality. They were covered in chapter 15.
partitioning: Many DBMSs enable the specification of table partitions
(and even index partitions). The idea is that each partition, being
smaller than the whole table, can be searched and updated more
quickly. Any query is first analyzed to determine which table
partition(s) it applies to, and then the query is run just against the
relevant partition.
Figure 17-4. A Columnar Data Design
SUMMARY
COMN’s ability to express all the details of physical design for a variety of data
organizations means that the physical implementation of any logical data design
can be fully expressed in the notation. COMN enables a direct connection to be
modeled between a logical design and a physical design in the same model,
enabling verification that the implementation is complete and correct. The
completeness of COMN means that data modeling tools could use the notation
as a basis for generating instructions to various DBMSs to create and update
physical implementations. This makes model-driven development possible for
every variety of DBMS.
REFERENCES
[Brewer 2012] Brewer, Eric. “CAP Twelve Years Later: How the Rules Have
Changed.” New York: Institute of Electrical and Electronic Engineers: IEEE
Computer, February 2012, pp. 23-29.
Key Points
Physical database design is the place where performance becomes paramount.
Physical design should not begin until a robust logical data design has been completed.
There are many physical data organizations available for implementation, including key/value, document,
graph, columnar, and row-oriented tabular. Some DBMSs are specialized for exactly one form of data
organization, and some are hybrid, supporting multiple organizations.
DBMSs vary to the extent to which they can be scaled in size, and to the extent they support ACID and BASE
transactional characteristics.
DBMS selection must sort through the various intersections of data organization, ACID/BASE, scaling,
performance, price, and support that appear in the marketplace. Matching a DBMS’s data organization style to
an application’s needs is just one important aspect of DBMS selection.
Many DBMSs achieve speed and scale by omitting features that applications have often used to achieve high
data quality, such as type safety and foreign key constraints. Be sure to consider these aspects when selecting
a DBMS.
Data to be stored in a key/value DBMS should be modeled in its entirety, even though the DBMS sees the
“value” only as a blob, because the application will need to know the structure of the “value”.
Graph data can be modeled in COMN using value and object symbols. A graph schema can be modeled with a
graph of COMN type and class symbols.
Database indexes, whether for documents or tables, are modeled in COMN as projections of the main record
collection. A DBMS-specific index class can be noted in a shape’s title, surrounded by guillemets (« »).
A columnar data organization is modeled as projections of the record class onto multiple column classes.
A columnar data organization is modeled as projections of the record class onto multiple column classes.
A horizontally partitioned row-oriented table is shown as a set of exclusive subsets of the main table.
Part IV
Case Study
In this section we will walk through a case study to document a sample business
as it would exist in the real world, design the data needed to operate the
business, and design a hybrid physical implementation that uses a document
database and a tabular database. The result will be a single COMN model
documenting data requirements, logical data design, and physical database
implementation.
Chapter 18
The Common Coffee Shop
You’ve very excited to have been brought into a new specialty coffee shop
business. The owner is starting small, but has high hopes of taking his chain to
an international level. You’ve been selected to design the information system
and its database that will enable the chain to operate a few stores in one locality
but eventually expand to operations in several countries.
So far, so good. However, we have a few gaps. We have not yet decided how our
order will identify our products, our employees, and our customers. We don’t
have a means for identifying the coffee shop within which each order is placed.
Identification is an issue centered mostly in data. We’re going to need more
space in our diagram to develop our identification schemes. Since we’ve
confirmed that our data model accurately represents the real world, in Figure 18-
4 we’ve dropped the real entity types, and redrawn just the logical record
collection hexagons and logical types of Figure 18-3 in the three-section form.
This gives us the room we need to show the components of the records in the
collections.
This diagram will illustrate the significant difference in a logical data design
between composition and reference. An order is composed of order items, but
merely references a customer, an employee, and a coffee shop. What’s the
difference here?
Data we can’t identify separately can only exist as a component of some other
data. In our data model we’ve decided that we must be able to reference data
about employees, customers, products, orders, and coffee shops, because, in the
real world, they all stand apart from each other. For one instance of data to
reference another instance of data, it must have some value by which to identify
the referenced data. From relational theory we know that the data attribute or
attributes of some data record that distinguish it from all other data records in a
set are its key, and that the value of a key is an identifier of a particular record in
that set. This is a true statement whether the data in questions is stored in tables,
in graph nodes, in documents, or in some other form. Relational theory is not
limited to describing the storage of data in tables. In fact, we need to understand
when we need keys in our data before we get to the question of how we’ll store
the data.
So, in order for us to enable an Order to reference a customer, an employee, and
a coffee shop, we must have keys for the three corresponding record collections.
They are indicated in each record collection rectangle as components with a
(PK) suffix. “PK” stands for primary key. A composite data type may have more
than one key. Each additional key is identified by “AK” for alternate key, and
can be numbered, as in AK1, AK2, etc. No alternate keys are used in this design.
We also want to be able to reference orders by some kind of identifier. In a busy
coffee shop, employees might need to communicate to each other about which
order they are working on. And since we plan for the business to grow, we know
we’re going to want to analyze order data that’s been collected across many
coffee shops and many days. So orders need keys, too. The key for the Order
data type is particularly interesting. It is composed of two components (two data
attributes). We call such a key a composite (or compound) key. A key with a
single data attribute is a simple key. Since an order is always placed within a
single coffee shop, we have designed orders to be identified by a simple integer
sequence, Order ID, but qualified by the Coffee Shop ID—the data attribute
playing the role of describing In (which) Coffee Shop the order was placed. This
design allows a database in each coffee shop to assign key values—identifiers—
to orders that don’t overlap with other coffee shops’ key values, simply because
each coffee shop has its own unique Coffee Shop ID.
Figure 18-4. Coffee Shop Order Data Types with Components
The Customer «unique index» symbol shows that only the Customer ID is
included. This use of projection of a record collection expresses physical data
copying. Fortunately, the DBMS takes care for us that the copied data in the
index always stays in sync with the original data in the document collection. The
non-dashed arrow pointing to the Customer «document collection» indicates that
each record in the index has a one-to-one physical relationship to a record in the
document collection. If an index were non-unique, then it would have a plus sign
at the document collection end of the arrow, showing that a single index entry
might reference multiple records, but would always reference at least one.
Most, but not all, of the logical record collections and types are represented by
document collections. We intend each coffee shop to have its own copy of the
database, and we don’t see the need to store data about all the other coffee shops
in each coffee shop’s database, so there’s no document collection representing
the coffee shop record collection. The Order Item Type is aggregated into the
Order record collection, so it doesn’t need a separate representation by its own
document collection. All the rest of our logical record collections are represented
by document collections.
The unique indexes on the primary keys of the document collections enable fast
lookup of customers, employees, and products, which is all we need for fast
order entry in the shop. But when we want to analyze this data later, we will
need much more flexible navigation through the data. Figure 18-6 shows a data
warehouse designed to represent the same logical data with an entirely different
physical structure. This is a dimensional warehouse design to be implemented in
a SQL database. In this design, the Order Item «SQL Table» is the central fact
table, and there are separate tables for the Customer, Employee, Product, Order,
and Coffee Shop dimensions. Unlike our operational database, Coffee Shop data
is now represented, as we’re collecting data in our warehouse from all of our
coffee shops.
Following the physical data modeling pattern presented for the coffee shop
operational database design, we can use this diagram to confirm that we’ve
represented multiplicities correctly, then create another diagram that drops the
logical data and expands the tables to show their components. We can show
unique and non-unique indexes on the tables. Following traditional techniques
for dimensional data warehouse modeling, we can add additional fact tables and
additional dimensions.
Using the expressive Concept and Object Modeling Notation, we can model the
types of entities—types of concepts and types of objects—that are present in a
problem domain, model at a logical level the data we will use to identify and
describe those types of entities, and model at a physical level how we will
arrange our representations of that data.
Figure 18-6. The Order Item Fact Table in the Coffee Shop Data Warehouse
We can model how the data will represent the real-world entities in the problem
domain, and confirm at every stage of modeling that we have preserved the same
relationships in the data as exist in the real-world entities. We can take the
physical model to a level of detail where implementation in a NOSQL database
and/or a SQL database can be mechanically derived from the physical model,
and therefore something that could be fully automated. COMN enables model-
driven development from the identification of problem-space entity types all the
way through to multiple physical implementations of the same data.
APPENDIX
COMN Quick Reference
A complete reference for the Concept and Object Model Notation can be found
at the author’s Web site at http://www.tewdur.com/. This appendix provides a
quick reference for use with this book.
Figure A-1 shows the hexagons, rectangles, and rounded rectangles used by
COMN to represent entities, types of entities, and the states or values of entities.
It also illustrates the meaning of the four different types of outline. A shadow on
a shape represents a collection of that which is represented by the shape. The
symbols as shown represent composite entities. Add crossed lines through a
shape to indicate that it is simple, representing entities that have no components.
Figure A-2 shows the kinds of relationship lines that do not express composition,
and Figure A-3 shows the kinds of relationship lines that express composition.
Arrowheads on the lines indicate the direction of reference (and not data flow).
Arrows that are part of the label text indicate reading direction. Labels on the
lines having arrowheads are unnecessary, as such lines only ever have a single
meaning. However, labels are strongly recommended to provide the names and
readings of relationships, and to identify the roles played by participants in the
relationships. Labels on unadorned lines are always necessary unless the lines
are between dissimilar polygons, in which case they have only one meaning.
Line weights (normal, bold) and style (solid, dashed) indicate whether the
relationship is in the computer (normal weight) or the real world (bold weight)
and whether the relationship is physical (solid line) or conceptual (dashed line).
Figure A-4 shows the pentagon symbol for restriction (that is, subtyping) and the
triangle for extension. Extension can only apply to composite base types or
classes. The relationships may be read in either direction. The labels on the lines
are unnecessary, as the meaning of the lines is fixed when connecting these
kinds of symbols. An arrowhead may be placed on a line to indicate direction of
reference. An X in the center of the pentagon or triangle indicates exclusivity.
Restriction and extension also applies to variables and objects.
Figure A-1. COMN Polygons
Figure A-5 shows the symbols for composite types, variables, and concepts,
when it is desired to list their components. The outlines of these symbols may be
varied to depict real-world (bold) and/or physical (solid) entity types, entities,
and concepts or states. If a type or class does not encapsulate its components,
then the bottom section for methods is crossed out.
Figure A-5. Symbols for Composite Entities Showing Components
Glossary
Definitions marked (Merriam-Webster) are taken from Merriam-Webster’s Online Dictionary at
http://www.merriam-webster.com/dictionary/.
aggregation : combining two or more objects in such a way that they retain their integrity, but it is difficult
or impossible to separate them again (like a layer cake) analytics : information derived from other
information or data array : a collection of some integral number of variables or objects of the same type or
class
assembly : combining two or more objects in such a way that they retain their integrity, and it is relatively
easy to separate them again (like an engine) attribute : an inherent characteristic (Merriam-Webster); see
also data attribute
blending : combining two or more objects in such a way that they lose their integrity (like eggs, flour, milk,
and sugar in a cake) class : a description of the structural and/or behavioral characteristics of potential or
actual objects collection : a set of objects having a single owner
component : a constituent part (Merriam-Webster)
composite : made up of distinct parts (Merriam-Webster) composite type : a type that designates a set
whose members have components computer object : a stateful material object whose state can be read
and/or modified by the execution of computer instructions concept : something conceived in the mind :
thought, notion (Merriam-Webster) conceptual : relating to a concept or concepts container : an object
that can contain other objects (like an egg carton) contents : the objects inside a container (like the eggs in
an egg carton) data : plural of datum data attribute : a <name, type> pair. The name gives the role of a
value of the given type in the context of a tuple scheme, relation scheme, or composite type. See also
attribute.
data attribute value : a <name, type, value> triple data independence : the ability to re-order data
without losing any information datum : that which is intended to be given to a predicate as a value for one
of its variables encapsulate : to authorize only a certain set of routines (called methods of the class) to
operate on the components of objects of a class entity : something that has separate and distinct existence
and objective or conceptual reality (Merriam-Webster)
extending class (or type) : a class (or type) that is defined in terms of another class (or type), called a base
class (or type), by defining components and/or methods to what are already available in the base class (or
type) extension : the addition of components to a base class or type; its inverse is projection fact : a
proposition that is true or believed to be true fact type : see relationship type hardware object : a
computer object which is part of the physical composition of a computer identifier : any value that
represents exactly one member of a designated set inclusion relationship : a relationship between types,
where the supertype is defined as a union of its subtypes; inverse of restriction relationship information : a
collection of propositions insight : information derived from analytics juxtaposition : arranging objects in
a fixed spatial relationship without connecting them (like a place setting) logical predicate : a statement
containing variables which, when the variables are bound, yields a proposition logical record type : a
composite type that is intended to be used as the type of data records stored singly or in a collection of
records measure : a composite type consisting of a number and a type of thing being measured or counted
method : a routine authorized to operate on the components of software objects of the class of which it is a
part object : something material that may be perceived by the senses (Merriam-Webster) predicate : short
for logical predicate
projection : the removal of components from a base class or type; its inverse is extension
proposition : an expression in language or signs of something that can be believed, doubted, or denied or is
either true or false (Merriam-Webster) relation : a relation value relationship type : a logical predicate
relation scheme : the specification of the data attributes of a relation, together with other information such
as keys and other constraints on a relation value as a whole relation value : a set of tuple values all having
the same tuple scheme; informally, a table without significance to the order of or repetition of the values of
its rows relation variable : a symbol which can be bound to a relation value relationship : a proposition
concerning two or more entities relationship type : a logical predicate restriction relationship : a
relationship between two types, where one type, called the subtype, is defined in terms of a restriction on
members of the set designated by the other type, called the supertype; inverse of inclusion relationship
semi-structured data : collections of data items stored in a way that supports but does not enforce a
structure simple type : a type that designates a set whose members have no components software object :
an object composed of hardware objects and/or other software objects by exclusively authorizing only
certain routines to access the component objects state : the physical condition of an object stateful : having
more than one state stateless : having only one state structured data : collections of data items stored in a
database that imposes a strict structure on that data subtype : something that designates a subset of the set
designated by another type, called the supertype tuple : a tuple value tuple scheme : the specification of
the data attributes of a tuple, together with any constraints referencing a tuple value as a whole tuple value
: a set of data attribute values tuple variable : a symbol which can be bound to a tuple value type :
something that designates a set unstructured data : data representing text, audio, video or other data which
have no structure imposed on what they represent value : a concept that is fully specified by a symbol for
the concept; also, a symbol for such a concept
Photo and Illustration Credits
p. 2, oil refinery: Philadelphia Energy Solutions
http://pes-companies.com/wp-content/themes/responsive/_images/home_page_hero.jpg
p. 7, Figure 1, Defects per Object: courtesy of Ron Huizenga p. 17, Figure 2-1,
elementary particles:
https://commons.wikimedia.org/wiki/File:Standard_Model_of_Elementary_Particles.svg
p. 30, Russian matryoshka dolls: https://commons.wikimedia.org/wiki/File:Russian-
Matroshka2.jpg
p. 31, engine
https://commons.wikimedia.org/wiki/File:Mercedes_V6_DTM_Rennmotor_1996.jpg