Ilovepdf Merged
Ilovepdf Merged
bnfv~.rslty c i ',Vashrngton ~
3737 Brooklyn Avenue N.E.
Guest Editor's .Introduction •
Seattle, Washington 98105 ~"
E. H. SIBLEY
Department o/Information Systems Management, University of Maryland, College park, Maryland 20742,
and National Bureau of Standards, Washington, D.C. 20284
I should like to thank Elliott Organick for their implementation in current conuner-
inviting me to serve as Guest Editor for cial and experimental systems.
this issue of COMPUTING SURVEYS, which We were faced with one major problem
has become our industry's most effective in trying to provide an integrated issue:
educational journal. I believe this issue every model uses its own terminology. We
deals with the most important topic in therefore decided to attempt to use both a
computing today: data-base technology. common terminology-:and a single• example
Here we attempt an integrated approach wherever possible. This is. b y no means a
to a very disoriented field. Problems abound: simple task; we must not only define t~#rmi-
differences in terminology, differences in nology and apply it to descriptions of vari-
modeling, and •differences in implementa- ous models, but also show haw these terms
tion confuse the potential user, who is faced differ f r o m those used by others in dis-
with almost unanswerable questions, such cussing the same ideas. There is no stand-
as: ard, and we are forced to be arbitrary:
•Should I wait for the dust to settle, or It was not possible to discuss every model
start to use data-base technology or implementation in one issue; in fact; it is
now?; and difficult to deal with anything but basic con-
• If I go data-base, which type of sys- cepts. We admit to errors and omissions, and
tem do I choose? apologize for them; a real attempt was made
Such problems are found in any evolving to solicit aid from a wide variety of experts.
technology, especially when it is associated We issue them a blanket vote of thanks and
with a fast developing industry, such as apologize for inadvertent omissions:
computing; while some problems appear
more philosophical than real, others arise OUTLINE OF THE ISSUE
from a poor understanding of new con-
cepts. Here we shall try to answer some of Figure 1 provides a graphic overview of this "
the questions and reduce the confusion. issue for readers. The first article,by Fry and
But obviously, one issue of COMPUTING Sibley, could be called an "entry" to the
SURVEYS cannot be all encompassing; data issue. Dependent.on ~this are t w o a titles,
technology is a field which already boasts essentially at the same "level"; one by
hundreds of articles, and textbooks by the Chamberlin, the other by Taylor and Frank.
dozen. Thus, this issue confines itself to an They discuss tmJ different and independent
explanation of various models of data-base approaches. The article by Tsicliritzis and
systems, showing their differences and simi- Lochovsky, using, the common terms and
larities, while trying .to relate the models to example of the first paper .and discussing
Copyright © 1976, Association for Computing Machinery, Inc. General permission to republish,
but not for profit, all or part of this material is granted provided that ACM's copyright n0tiee is
given and that reference is made to the publication, to its date of issuE, and to the fact that reprin;ting
privileges were granted by permission of the Association for Computing Machinery.
t
2 • E . H . Sibley
differences between the hierarchic and other interests started somewhere; thus, our first
approaches, depends on all three previous question is:
papers. Finally, the paper by Michaels, • Why did it all happen?
Mittman, and Carlson compares the ap- The pressures of the "Computer Age" arise
proaches described in the second and third from a fascinating new technology supplied
papers; it is consequently dependent on with inexpensive equipment. The efficient,
both of these. fast, accurate, and economical way a com-
puter can perform numeric and logical opera-
PRESSURES TOWARD DATA-BASE tions (compared to the slow, inaccurate hu-
TECHNOLOGY man counterpart) has forced automatk)n of
many previously manual operations in uni-
Data-base technology is one of the most versities, government, and the private sector.
rapidly growing areas of computer and infor- In the early days of computers, automation
mation science. In less than twenty years, merely entailed conversion from manual
with the greatest part of the development in operations, with little attempt to integrate
the past eight years, data-base systems have any resulting system. The typical data-proc-
come from nothing to be a major topic of essing operation of the fifties and sixties was
current interest. Top management of major created in this manner. Data* was used in
corporations have grown to appreciate the the same way as before; the inefficient dupli-
importance of their data bases; government
regulatory agencies are already worrying
* In COMPUTINGSURVEYS,the word data is used
about the implementation of privacy and as a collective noun. Although "datum" may be
freedom of information acts and their rela- correct for hard-line grammarians, it will not be
tion to data banks. This proliferation of found here.
cation of data and effort was continued with DISADVANTAGES OF DATA INTEGRATION
duplication of computerized data and pro-
cedures. The operation seemed successful: it The industry congratulated itself on reducing
reduced overall cost, and therefore little data redundancy and improving its availa-
thought was given to further improvement of bility, but it also introduced the potential
the system. In time, astute data processing for disaster. The first problem with integra-
managers recognized a problem: while stored tion arises because the data base is now more
within the data vaults of the organization vulnerable to destruction through machine
(probably in the form of bits on a tape), data malfunction, personal error, or deliberate
was essentially unavailable. The programs human tampering. The loss of "quality" in a
that input, stored, and used the data were data base (including total destruction) by
essentially the owners of the data. Any other any of these means may be considered a
user found it difficult to obtain, integrate, or threat to the organization, because data is
transform the "available" data for use in one of its most valuable assets. "Integrity"
another program. Thus, every new need for techniques are therefore a necessity.
data involved writing a new program to ob- The other threat to the integrated data
tain the data before it could be proc- base relates to its "security" and accuracy.
essed by yet another program, and even Most enterprises have some secret processes
this was difficult--the data formats were or private material which should be pro-
"locked" in the original programs, and some- tected against theft or access by unauthorized
times the original object code had been lost! people. To achieve this protection, the system
This essential unavailability of otherwise must ensure that it is secure, i.e., that any
transferable data gave rise to the question: information which is private to the organi-
• Why not integrate the data? zation is safe from unauthorized dissemina-
This led to the thought that integration, tion or tampering. The problem with erro-
were it possible, could be achieved by de- neous information is that it might result in
fining the data format, storing it as a "data an incorrect (adverse) decision about a
definition," and allowing general-purpose person or enterprise. Government and other
"data-base management" software to access enterprises are vitally interested in both of
it. And this gave rise to the prime concept these aspects, which have been termed "in-
of the generalized data-base management formation privacy." Traditionally, privacy
system. has been defined as the right of an indivi-
In dealing with a system to store and ac- dual or organization to be "left alone'--it
cess data for a set of different programs, usually is considered the right to retain cer-
and consequently for a set of different types tain non-public information without threat
of users, two further questions arose. First: of disclosure. An integrated data base
• Can we access this data through our cur- threatens privacy: it becomes easier to col-
rent computer languages? lect, and to unwittingly divulge informa-
This involves either specification of addi- tion of a confidential nature (which may
tional commands in conventional program- have been legally obtained from the individ-
ming languages, or the provision of calls to ual or enterprise) to some other unauthorized
special subroutines which allow access to the person oragency. Today, information privacy
data base. And second: has been expanded to include the right of
• Why not allow a higher-level language the individual or organization to know
for ad hoc use of the data base? what "personal" information is retained
This can be achieved by providing a special on any data base. It also implies that the
query language as an interface. While in- individual or enterprise has the right to
efficient, the first prototype systems appeared challenge the data, causing either its cor-
very successful to users, and the first com- rection, or the additirn of a statement that
mercial systems started to appear. Then the the fact is under dispute. As an example: a
first problems arose. person may wish to ensure that data given
confidentially to an agency for the purpose managers at the fourth or fifth level of peck-
of obtaining a credit card is not divulged ing order. This had idistorted the organiza-
capriciously to a neighbor; the person may tion, and it has alreudy led to organizational
also wish to know what other information is changes: in fact, we now see "Vice President
on file at a credit bureau, and be able to of Information Systems"--a far cry from the
correct or dispute any potentially threaten- lowly "Data Processing Operations Man-
ing fact. ager" of yesterday.
When data is integrated and readily avail-
able through either program or ad hoc query DATA BASE VERSUS DATA PROCESSING
interfaces to a community of users, the possi-
bility of loss of integrity, and the ability to We have seen that data which had previously
penetrate the security (thereby threatening been duplicated and spread throughout the
privacy) obviously increases. Early data- organization is now being drawn into a uni-
base management systems often needed to fied and sometimes monolithic system. This
be retrofitted to ensure reasonable integrity represents the concentration of a valuable
and security. asset. In the past, the only major identifiable,
The placing of controls within the system tangible asset of a corporation was its money,
brought another new issue into view--the or objects immediately convertible to money.
question of who was to make the policy de- Accountants had learned how to audit and
cisions, who would issue passwords and who control the flow of money. The advent of
was really allowed to make decisions on data data management suggested the possibility
formats. A new post was needed. of the audit and control of data.
In exactly the same way that an account-
THE DATA ADMINISTRATION CONCEPT ant determines the accuracy of money flow,
a data-base management auditor could de-
Data administration is really a special form termine accuracy, quality, and privacy as-
of managerial control which includes both pects of the data. This gives rise to the con-
authority over data integrity and security, cept of data auditing. New and pending
and responsibility for overall efficiency. Be- state and federal government regulations on
cause the data base for the enterprise was privacy and freedom of information would
growing in total size and complexity, new make the enterprise legally accountable for
measures for improving efficiency were its data. Consequently, the matter of ad
possible. The question arose: hoc use and poor control over the applica-
• Should programmers be allowed to define tion and dissemination of information is sud-
their own data? denly a real concern. Someday soon a high-
The implementation of central authority level administrator will be sentenced (fined
for data definition made it impossible to al- and maybe even given a prison term) for
low the programmer this flexibility. By intro- contravening these regulations--and then
ducing a single data authority, and by pro- the entire industry will tighten its controls--
viding information about the community of almost overnight. Thus three factors have
users, the "best:' data structure could be de- combined to imply a need for more effective
fined: this data structure is efficient for the and automated control mechanisms to be
community of users rather than for any one built into the automated systems, with
particular user. reasonable safeguards against unauthorized
As a consequence of the advent of a new access. The three motivating factors are:
technology and new managerial control • the regulations of government to in-
mechanisms, the computing operation started sure privacy;
to take on a new importance. Government • the aspirations of management for
and business agencies found themselves with more effective control over its
expensive data-processing operations which operations; and
were being run by relatively low-level man- • the understanding of auditors of the
agement. Some commercial organizations need to retain quality in all financial
found that equipment requiring presidential and nonfinancial data.
(or even board) approval, was being run by The data-base management system today
The essential concepts of the relational data model are defined, and normalization,
relational languages based on the model, as well as advantages and
implementations of relational systems are discussed.
Keywords and Phrases: Data base, data-base management, data independence,
data model, relational systems
CR Categories: 8.5I, ~.3~,~.~
Copyright © 1976, Association for Computing Machinery, Inc• General permission to'republish,
but not for profit, all or part of this material is granted provided that ACM's copyright notice is
given and that referenceis made to the publication, to its date of issue, and to::the fact that reprinting
privileges were granted by permission of the Associationfor Computing Machinery.
44 • DonaldD. Chamberlin
An excellent introduction to relational con- of Figure 1, the second and third columns
cepts can also be found in Date's recent text- are both based on the same domain: the
book [-Zt1]. set of names of Presidential candidates.
In mathematics, the term relation may However, ,each column has a different role-
be defined as follows: Given sets D1, D ~ , . . . , name to describe its meaning in this par-
D~ (not necessarily distinct), a relation R ticular relation: WINNER-NAME and
is a set of n-tuples each of which has its LOSER-NAME.
first element from D~, second element from The individual entries in each mple are
D2, etc. The sets D~ are called domains. called its components. Thus, we may say
The number n is called the degree of R, and that in the tuple whoseYEAR-eomponent is
the number of tuples in R is Called its "1952,'.' the LOSER-NAME-component is
cardinality. "Stevenson."
It is customary (though not essential) A column or set of columns whose values
when discussing relations to represent a uniquely identify a row of a relation is
relation as a table in which each row repre- called a candidate key (often shortened to
sents a tuple. An example of this representa- simply key) of the relation. In Figure 1,
tion is shown in Figure 1, which illustrates a YEAR is a key for. ELECTIONS •since no
relation describing Presidential elections. two rows have the same YEAR. I t is pos-
In the tabular representation of a relation, sible for a relation to have more than one
the following properties, which derive from key. For example, if the ELECTIONS re-
the definition of a relation, should be ob- lation had an additional column ADMIN-"
served: ISTRATION-NUMBER, it would also be
a key. When a relation has more than one
1) no two rows are identical;
key, it is customary to designate one as the
2) the ordering of rows is not signifi-
cant; and
primary key.
Often a column or set of columns in one
3) the ordering of columns is significant
relation will correspond to a key of another
(i.e., the meanings of the tuples
relation. For example, consider the PRESI-
(1972, Nixon, McGovern) and (1972,
DENTS relation of Figure 2, whose key is
McGovern, Nixon) are quite differ-
NAME. The values of WINNER-NAME
ent).
in the ELECTIONS relation correspond to
When a rdation is represented as a table, its values of the key-column NAME in PRESI-
degree is the number of columns' and its DENTS. Consequently, WINNER-NAME
cardinality is the number of rows. in ELECTIONS is called a foreign key.
In the tabular representation of a rela- Two facts should be noted: 1 ) a foreign
tion, it is customary to name the table and" key need not be (and often is not) a key of
to name each column, as shown in Figure 1. its own relation; and 2) the foreign key need
The columns of the table are called attributes." not have the same role-name (e.g.,
(Sometimes the name of a column is referred WINNER-NAME) as the corresponding
to as a role name.) It is important to dis'- key in the other relation (e.g., NAME).
tinguish between attributes and domains. In an integrated data-base management
For example, in the ELECTIONS relation system, different users may have a need to
46 • Donald D. Chamberlin
see different subsets of the universe of data. in first normal form are sometimes called
The term data model denotes the universe "flat tables". If we look carefully at the re-
of data--the complete set of relations stored lation in Figure 1, we see that it is not in
in the system. A schema is a set of declara- first normal form. This is because an elec-
tions which describe the data model. The tion, while it has only one winner, may
term data submodel denotes the set of re- have several losing candidates. Thus, for
lations which is available to a particular example, the tuple for the election of 1968
user, and a subschema is a set of declarations contains the component {"Humphrey",
for the data submodel. A complete data- "Wallace"}. In fact, the LOSER-NAME
management system must provide a means component of each election tuple is a list.
for defining the schema and a subschema for whose length depends on the number of
each distinct class of users of the system. votes a candidate must receive to merit in-
clusion in the data base.
We can convert the ELECTIONS rela-
NORMALIZATION
tion into first normal form by breaking it
The issue of designing a schema and sub- Up into two relations, one containing infor-
schemas for a data base leads us to a discus- mation on winning candidates and the other
sion of normalization. The concept of nor- on losing candidates. This also gives us a
malization was introduced by Codd in [M2] good opportunity to record other attributes
and dealt with more rigorously in his later of interest about the candidates, such as
papers [N1] and [N2]. A number of other their party and number of votes received.
authors have also made contributions to the This leads us to the data base shown in
theory of normalization (see bibliography). Figure 3, which is in first normal form.
Normalization theory begins with the The key of ELECTIONS-WON is YEAR;
observation that certain collections of rela- the key of ELECTIONS-LOST is (YEAR,
tions have better properties in an updating LOSER-NAME}.
environment than do other collections of To illustrate the advantages of the higher
relations containing the same data. The normal forms, we need to make updates to
theory then provides a rigorous discipline the data base by inserting new tuples, de-
for the design of relations which have favor- leting existing tuples, and making changes
able update properties. The theory is based to existing tuples. These updates are not
on a series of normal forms--first, second, particularly well motivated for our example
and third normal form--which provide suc- data base, in which data is mostly static
cessive improvements in the update prop- and unchanging. Of course, in an operational
erties of a data base. We will discuss these data base describing, for example, the in-
normal forms on an intuitive basis; for a ventory of a store, updates would be very
thorough treatment, see [N1], IN8], or frequent. For the sake of consistency, we will
[z11]. continue with our Presidential example.
Almost all references to relations im- (You may imagine that some data was found
plicitly deal with relations in first normal to be in error and is being updated to correct
form. A relation in first normal form is a the data base.)
relation in which each component of each Relations in first normal form may be
tuple is nondecomposable; i.e., the com- used with any of the relational languages
ponent is not a list or a relation. Relations which are described in the next section.
However, a relation in first normal form may worse, it leads to the; possibility that differ-
exhibit three kinds of misbehavior, which ent tuples may contain inconsistent values
are called update anomalies, insertion of HOME-STATE for the same President.
anomalies, and deletion anomalies. All these Insertion anomalies: Suppose we wish to
anomalies arise because more than one insert a fact about a candidate which is
"concept" may be mixed together in the independent of any election, e.g., "Dewey,
same tuple. Consider the ELECTIONS- was a Republican." This is difficult in our
WON relation of Figure 3. Mixed together example data base because there is no rela-:
in one tuple of this relation are facts about tion for candidates. We are forced to invent
candidates (e.g., "Eisenhower came from a tuple in ELECTIONS-LOST (or ELEC-
Texas") and facts about elections (e.g., TIONS-WON?) having null values for
"In 1952 Eisenhower received 442 elec- YEAR and the o~,er irrelevant attributes.
toral votes"). In some applications it may In many systems we would be unable to
be important that each of these facts be store this fact because null values are not
independently updated, inserted, and de- permitted in the primary key.
leted. This gives rise to the three anomalies, Deletion anomalies: Suppose we wish to
which we can now illustrate by the following delete the information about elections as
examples. they fall beyond a certain number of years
Update anomalies: Suppose the fact that in the past. When we delete the 1952-tuple
"Eisenhower's home state is Texas" is from ELECTIONS-WON, we still retain
found to be in error, and his home state the fact that Eisenhower was a Republican.
must be changed to Nebraska. Since Eisen- But when we delete the 1956-tuple, all
hower appears in more than one tuple of facts about Eisenhower are lost. In some
ELECTIONS-WON, this erroneous fact applications, this might have •very serious
may be represented many times (in general, consequences. For example, consider a rela-
a time-varying number of times). This tion describing orders for various items,
makes it difficult to update this particular shown in Figure 4. As orders are filled we
fact, since all tuples where it is represented delete their tuples from the relation. When
must be searched out and updated. Even we have deleted the last order for toasters,
LOSER- LOSER-
ELECTIONS-LOST YEAR PARTY
NAME VOTES
o
48 • Donald D. Chamberlin
QUANTITY-
ORDERS ITEM PRICE DATE
ORDERED
we find we n o longer have any information variety of ways. The original definition was
about the price of toasters--possibly an given by Boyce and Codd in IN1]. Later
unintended result. This kind of relation writers, including Kent [N8], Codd [M14],
burdens the user with the responsibility of and Sharman [N15], proposed alternate
making sure that the tuple he deletes is not definitions which framed the same concept
the last tuple of some "category" (e.g., in simpler terminology. We present two of
toasters), and therefore the sole bearer of these equivalent definitions:
information about that category (e.g.,
price). Definition, Boyce and Codd [M14]:
An important objective of normalization A relation R is in third normal form if it is in
first normal form and, for every attribute
is the elimination of the update, insertion, collection C of R, if any attribute not in C is
and deletion anomalies. The most widely- functionally dependent on C, then all attri-
known result of normalization theory is butes in R are functionally dependent on C.
third normal form. Since second normal form
is of little significance except as a stopping- Definition, Sharman [N15]:
A relation is in third normal form if every
off place on the way to third, we will proceed determinant is a key.
directly to the definition of third normal
form. Both definitions are formal ways of ex-
In order to understand how third normal pressing a very simple idea:-that each re-
form avoids the three anomalies, we must lation should describe a single "concept,"
discuss the concept of functional dependence and if more than one "concept" is found in:
among the attributes of a relation. We say a relation, the relation should be split into
that an attribute B of relation R is func- smaller relations. The result of applying
tionally dependent on attribute A if, at every this "splitting" process to the sample data
instant of time, each A-value in R is as- base of Figure 3 is shown in Figure 5. A
sociated with only one B-value. We ex- moment's examination will show that the
press this relationship by the notation A --~ update, insertion, and deletion anomalies
B, and say "A determines B" or "B de- we discussed are not present in the data
pends on A." Similarly, a set of attributes in base of Figure 5.
R may be functionally dependent on an- The design of a data base in third normal
other attribute or set of attributes. The form depends on knowledge of the func-
attribute (or set of attributes) on the left tional dependencies among the attributes
side of the arrow (A in our example) is of the data. This knowledge cannot be
called the determinant. discovered automatically by a system (un-
Clearly, from our definition of key in the less the data base is completely static), but
previous section, every relation contains at must be furnished by a data-base designer
least one functional dependence: all attri- who understands the semantics of the in-
butes of the relation are dependent on the formation. In fact, there is not a mlique
key. (The dependence may be trivial if the third normal form representation for a
relation contains only a key.) If a relation given data base. In IN1] Codd briefly ad-
has more than one key, then all its attributes dressed the problem of choosing an "Optimal
are dependent on each key. Third Normal Form" from among the
Third normal form has been defined in a various alternatives.
1952 Stevenson 89
1956 Stevenson 73
1960 Nixon 219 •
1964 Goldwater 52
1968 Humphrey 191
1968 Wallace 46
1972 McGovern 17
Stevenson Democrat
Nixon Republican
Goldwater Republican
Humphrey Democrat
Wallace. Am. Indep.
McGovern Delnocrat "
FIGURE 5. a t a ~ a s e in third normaLform.
tots can serve both as a d a t a sublanguage A typical query in ALPHA has two parts:
and as a query language. a target, which specifies the particular at-
This section will explore the approach tributes of the particular relation which are
taken by various relational languages to to be returned, and a qualification, which
providing facilities for query, data manipu- selects particular tuples from the target
lation (e.g., insertion, deletion, and update relation by giving a condition which they
of tuples), data definition (e.g., creation of must satisfy. We will illustrate ALPHA (and
new relations and other structures), and other languages) by some sample queries
data control (e.g., authorization and control based on the data base of Figure 5.
of data integrity). We will then briefly In Q1) below, the RANGE statement de-
consider some ways in which languages clares P be a variable ranging over the rows
can be evaluated and compared, and discuss of the PRESIDENTS relation. The next
the role of natural language as a data-base statement retrieves into workspace W the
interface. HOME-STATE of row P whenever the
NAME of row P is " K E N N E D Y . "
Query Facilities The qualification part of an ALPHA query
may be quite complex and may use the
Query, or retrieval of information from the universal and existential quantifiers: "for
data base, is perhaps the aspect of relational all" (V), and "there exists" (3). For ex-
languages which has received the most at- ample, see display Q2) below.
tention. We will illustrate the variety of Various other languages based, like
approaches to query by presenting ex- ALPHA, on the relational calculus, have been
amples of four classes of languages: rela- proposed. This class of languages imfludes
tional calculus, relational algebra, mapping- QuEL [S15], CO]bARD [L3], and RIL [L7].
oriented languages, and graphics-oriented
languages. Although we deal only with
query facilities in this section, all the lan- Relational Algebra
guages discussed have facilities for update
and other operations in addition to query. A second major class of languages is based
on the relational algebra, which was in-
Relational Calculus troduced by Codd in [M2] and refined in
[M3]. The relational algebra is a collection
Codd's 1970 paper [M2] laid the ground- of operators that deal with whole relations,
work for two families of relational lan- yielding new relations as a result. The
guages which came to be called the rela- major operators of relational algebra in-
tional calculus and the relational algebra. The elude the following:
relational calculus family grew from the • Projection: The projection operator re-
observation that a first-order applied predi- turns only the specified columns of the
cate calculus can be used as a data sub- given relation, and eliminates dupli-
language for normalized relations. In ILl] cates from the result. For example, to
Codd presented the details of such a calculus- find all the unique (party, home-state}
based sublanguage, called ALPHA. pairs in the PRESIDENTS relation,
Q2) List the election years in which a Republiban from Illinois was elected.
RANGE PRESIDENTS P
RANGE ELECTIONS-WON E
GET W E.YEAR: 3 P (P.NAME = E.WINNER-NAME &
P. PARTY ffi'REPUBLICAN' & P. HOME-STATE = 'ILLINOIS').
KENNEDY I P. NEVADA
Q2) List the election years in which a Republican from Illinois was elected.
P.1948 WILSON
ment in SEQUEL[L10] has the effect of giving condition. Our first call to GAMMA-0 uses
a 10 % raise to all programmers: the operator CREATE-SCAN, which creates
a scan on the EMP relation to search for
UPDATE EMP tuples according to their EMPNO attribute.
SET SALARY = SALARY*I.1 The system returns a~ identifier, called a
WHERE JOB = 'PRDGRAMMER' SCANID, by which We may refer to the
newly created scan in future calls. Next we
All the languages we have discussed so call the operator SET-SCAN and furnish
far have been high level and nonprocedural the value which is to be searched for (in this
in nature. Indeed, one of the advantages of case the EMPNO, which is the parameter of
the relational model is that it is readily our transaction). Our next call is to the
compatible with high-level languages. But operator NEXT-SUBTUPLE, which re-
it should not be concluded that t h e rela- turns an actual tuple satisfying the cri-
tional model is incompatible with a lower- terion we established by the previous calls:
level, more procedural programming inter- (NEXT-SUBTUPLE ,could be called re-
face. In fact, several low-level, host-lan- peatedly if we expected many tuples to
guage relational interfaces have been pro- satisfy the criterion.) Having obtained the
posed, including GAMMA-0 [L4], XRM [$6], desired employee-tuple, we can compute a
and MINIZ [$8]. These interfaces are well new salary-value in our host program and
suited for writing programs that are to be then call UPDATE SUBTUPLE, which puts
called repeatedly and which update the the new salary-value into the data-base.
data base according to parameters furnished GAMMA-0allows a program to have as many
with the call. active scans as it wishes, and to control the
We will illustrate how one low-level re- position of each by explicit culls. When a
lational language, GAMMA-0, might be used .program has no further use for a scan, it
to write a transaction which finds the em- may drop it by .culling the operator DROP-
ployee-tuple having a given employee SCAN.
number and updates its salary component Although it i s a low-level, procedural
according to some computation. GAMMA-0 language, GAMMA-0 is considered .a rela-
consists of a set of operators which may be tional language because the means of ac-
called from a host language such as P L / I . cess to tuples is not predetermined. A rela-
GAMMA-0 is based on the concept of a tion may be accessed associatively through
"scan," which is like a cursor that moves any of its attributes--the attribute to be
through a relation testing tuples for some matched is declared when a scan is opened.
Data Definition and Control the view as though it were a stored relation.
The supportability of updates to the data
In addition to query and data manipulation
base made by means of derived views is a
facilities, a complete data sublanguage
complicated question, one which requires
needs facilities for data definition and data
more research [M14].
control. Data definition has two main as-
pects: The issue of authorization is closely re-
lated to the issue of derived views. In fact,
• Specification of the characteristics of one approach to authorization is to grant to
data to be stored, e.g., the column- each user a particular restricted view [C6].
names and data-types for each rela- Another approach is to automatically add
tion; and certain predicates to the queries and up-
• definition of alternative "views" dates issued by a user in order to restrict
which are derived from the stored their scope to the set of authorized tuples
data. In relational terminology, a [C31.
view is a dynamic "window" on the This unified approach to language design
data base. Updates made to stored can be extended into the aTea of assertions
relations are visible through the concerning data integrity. An assertion is a
various views which are defined on statement about the data base which the
these relations. system automatically enforces by refusing
any update which fails to satisfy the as-
Data control also has two m a i n aspects:
sertion. In language terms, an assertion is
• control over authorization of various simply a predicate, which is syntactically a
users to perform various operations fragment of a query, and which may con-
on the data base; and tain other queries nested inside it. For
• ability to make integrity assertions example, suppose we wish to assert that for
that protect the validity of data and any given election the number of votes re-
define the set of permitted transitions ceiveed by the winner is greater than the
in the data base. number of votes received by any loser.
This assertion may be made as follows in
The relational model permits a language to S~QvEL (the variable X represents a tuple
take a consistent, unified approach to query, of the ELECTIONS-WON, relation):
data manipulation, data definition, and
data control. Several relational languages ASSERT ON ELECTIONS-WON X:
have gone to great lengths to provide such a WINNER-VOTES >
unified approach; these languages include (SELECT MAX (LOSER-VOTES)
S~QUEL [L10, LS, C6, I5], QvEL [S15, C3, FROM ELECTIONS-LOST
I4], and Query By Example [L21, L24]. WHERE YEAR=X.YEAR)
An important observation to be made in
data definition is that the definition of a Language Evaluation
view is simply a process of deriving a rela-
tion from the set of stored relations, and The great variety of proposed relational
that this is similar to the process of stating a languages leads us to the question: How can
query. Therefore, the full power of a query languages be evaluated and compared?
language may be applied to the definition of There are at least three criteria involved in
views. This is possible because all the re- any objective attempt to evaluate a lan-
lational query languages we have discussed guage: completeness, level, and learnability.
have the property of closure, i.e., they ope- Space constraints permit us to touch only
rate on relations to construct or define new briefly on each of these.
relations. A view may be a selected subset Codd [M3] was the first to establish a
of a stored relation, or it may span over careful definition of completeness for data-
more than one stored relation, as in the base sublanguages. He defined a language
ease of a join. Once the definition of a view to be relationally complete if it permits ex-
has been made, queries may be directed to pression of any query expressible in the
[SLID which uses XRM inversions to limit ing a relational prototype is the INGRES
the search space for a given query. The (Interactive Graphics and Retrieval Sys-
SEQUEL prototype has been extended by tem), of the University of California at
IBM at Cambridge and by t h e ' M I T Sloan Berkeley [$7, $9, $15]. INGRES, which runs
School of Management to accommodate a on a P D P - 1 1 / 4 0 under the UNIX operating
multiple-user environment. The resulting system, implements QUEL, a relationally
system, called GMIS, is being used at MIT complete query language based on the re-
•as an ~information system for modeling New lational calculus. The INGRES system im-
England energy resources. [A12, $19]. plements a variety of features by automatic
Another prototype system based on XRM modification of the QUEL statement sub-
is being developed at IBM Research in mitted by the user. Alternative views are
Yorktown Heights, to implement Query supported by substituting the view-defini-
By Example. The system contains an tion into the user's statement [I4]. Authori-
optimizer which interprets Query By Ex- zation and integrity control are provided by
ample queries in terms of operations similar adding extra predicates to the user's state-
to those of the relational algebra (join, re- ment which limit its scope [C3]. Concurrent
striction, etc). At present, the system sup- update requests are kept from interfering
ports only a single user and does not pro- with each other by analyzing their respec-
vide update facilities. tive scopes and allowing an update to
A large-scale prototype data-base man- proceed only when it is "safe" [I2]. Finally,
agement system, called System R, is pres- the QUEL statement, which may contain
ently under construction a t ' I B M Research m a n y variables, is broken up by a "de-
in San Jose [$20]. System R is the first at- composition" algorithm into a series of
tempt to apply the relational data model to one-variable statements which are executed
an environment of many concurrent users one at a time. The physical data structures
and a high volume of requests. It will pro- used by INGRES include hashed tables (in-
vide an operationally complete data-man- cluding "order-preserving" hash functions
agement capability, with facilities for au- which permit sequential scanning in key-
thorization, logging and recovery, definition value order) and "generalized directories,"
of alternative views, and enforcement of which employ a tree-structure to map a
data consistency and integrity. System R key into an address interval, and then use
will support the SEQVEL language as an an order-preserving function to compute
external interface, as well as a set of pro- an address within the interval [$9].
cedural operators for host-language pro- Implementation of another relational
gramming. Requests to the system will be system, called ZETA, is presently under way
executed by an optimizer which chooses at the University of Toronto [$8, S14].
among various physical access methods, The ZETA system is constructed in three
including inversions maintained in the form levels. The lowest level is a language called
of B-trees IT1], physical pointer-chains, and MINIZ, which provides such basic operations
a sort-merge facility. A user is not con- as scanning a relation and accumulating a
strained to protect himself against the up- list of identifiers of tuples which satisfy a
dates of other concurrent users by explicit given condition. The middle level imple-
locking statements; the system automati- ments views ("derived relations") and has
cally generates locks as needed at the level an optimizer/interpreter which accepts
of individual tuples. Deadlocks are auto- queries spanning multiple relations. Three
matically detected and resolved. Some of types of end-user interfaces are supported
the locking techniques developed as part of by ZETA :
the System R project have been described
in [C1, C4, C8]. System R is being imple- • a host-language facility which pro-
mented on an IBM 370, using a VM/370 vides features similar to SEQUEL;
operating system modified for the data- • a query language generator system
base environment [T13]. whereby a user may create his own
Another large-scale attempt at construct- self-contained query language using
m
60 • Donald D. Chamberlin
The author i s also grateful to his colleagues at systems: a tutorial," Proc. Fourth In-
the IBM Research Laboratory in San Jose for ternatl. •ymposium on Computer and In-
their support and discussions. formation Sciences, Dec. 1972, Plenum
Press, New York, 1972.
[M8] CORD, E . F . "Understanding relations,"
CLASSIFICATION OF REFERENCES continuing series of articles published in
FDT, the quarterly bulletin of ACM-
Models and Theory SIGMOD, beginning with Vol. 5, 1 (June
M 1) General 1973),* ACM, New York, 1973.
N 2) Normalization, Decomposition, and [M9] HAWRYSZKIEWYCZ, I. T. "Semantics of
Synthesis data base systems," M I T Project, MAC
Z 3) Relationships between CODASYL Report MAC TR-112, Cambridge, Mass.,
D D L / D B T G and the Relational Dec. 1973.
Model [M10] BRACCHI, G. ; FEDELI, A. ; AND PAOLINI, P.
L Languages and Human Factors " A multi-level relational model for data-
Implementations base management systems," Data Base
S 1) Software Management, Proc. I F I P TC-2 Working
H 2) Hardware Conf. on Data-Base Management Systems,
T Implementation Technology April 1974, North-Holland Publ. Co.,
C Authorization, Views, and Concurrency Amsterdam, The Netherlands, 1974.
I Integrity Control [Mll] STONEBRAKER, M. "A functional view
A Applications of data independence," Proc. ACM-
D Deductive Inference and Approximate SIGFIDET Workshop on Data Descrip-
Reasoning tion, Access, and Control, May 1974,*
E Natural Language Support ACM, New York, 1974, pp. 63-81.
Y Sets and Relations (prior to 1969) [MI2] MBLTZER, H. S. "Relations and rela-
Certain references include asterisks with the tional operations," IBM Report to GUIDE
following meaning: 38 Information Systems Division, Dallas,
* Proceedings of ACM-SIGFIDET and Texas, May 1974.
ACM-SIGMOD Workshops are obtain- [M13] HI~CHCOCK, P. "Fundamental opera-
able from ACM Headquarters, 1133 Ave- tions on relations in a relational data
nue of the Americas, New York, N.Y. base," IBM Scientific Centre Report
10036 UKSC 0051, Peterlee, England, May 1974.
** Proceedings of the 1975 ACM Pacific [MI4] CorD, E. F. "Recent investigations in
Conference, San Francisco, April 17-18, relational data base systems," Informa-
1975 are obtainable from: Mail Room, tion Processing 74, Proc. I F I P Congress,
Boole & Babbage, 850 Stewart Drive, August 1974, Vol. 5, North-Holland Publ.
Sunnyvale, California 94086 Co., Amsterdam, The Netherlands, 1974,
~Vp. 1017-1021.
[M15] ~D~XI~D, H. "Datenbanksysteme 1,"
Models and Theory Reihe Informatik/16 (1974), Bibliogra-
1) General phisches Institut, Mannheim, W. Ger-
[M1] Coon, E. F. "Derivability, redundancy many.
and consistency of relations stored in [M16] HALL, P. A. V.; TODD, S. J. P.; AND
large data banks," IBM Research Re- HITCHCOCK, P. " A n algebra of relations
port RJ599, August 1969. for machine computation," IBM Scien-
[M2] Cony, E. F. " A relational model of tific Centre Report UKSC 0066, Peterlee,
d a t a for large shared d a t a banks," England, Jan. 1975.
Comm. ACM 13, 6 (June 1970), pp 377-397. [M17] SCHMID, H. A.; ANDSWENSON,J . R . "On
[M3] CODD,E . F . "Relational completeness of the semantics of the relational data
data-base sublanguages", Courant Com- model," Proc. ACM-S1GMOD C o n f . ,
May 1975,* ACM, New York, 1975, pp 211-
•
uter Science Symposia 6, "Data Base
vstems," New York, May 1971, Pren- 223.
t~ce-Hall, Englewood Cliffs, N.J., 1971,
pp. 65-98.
[M4] STRNAD,A . L . " T h e relational approach Models and Theory
to the management of data bases," Proc. 2) Normalization, Decomposition, and Synthesis
I F I P Congress, August 1971, Vol. 2, [N1] CODD,E. F. " F u r t h e r normalization of
North-Holland Publ. Co., Amsterdam, the data base relational model," Courant
The Netherlands, 1971, pp. 901-904. Computer Science Symposia 6, "Data
[MS] DURCHHOLZ,R. " D a s Datenmodell bei Base Systems," New York, May 1971,
Codd," Technical Report No. 69, Gesell- Prentice-Hall, New York, 1971, pp. 33-64.
schaft fiir Mathematik und Datenver- [N2] CODD, E. F. "Normalized data base
arbeitung, Bonn, W. Germany, July 1972. structure: a brief tutorial," Proc. 1971
[M6] HAWRYSZKIEWYCZ,I T.; AtqD DENNIS, ACM-SIGFIDET Workshop on Data
J . B . " A n approach to proving the cor- Description, Access, and Control, Nov.
rectness of data-base operations," Proc. 1971,* ACM, New York, 1971, pp. 1-17.
ACM-SIGFIDET Workshop on Data [N3] HEA'rH, I. J. "Unacceptable file opera-
Description, Access, and Control, Nov.- tions in a relational data base," Proc.
Dec. 1972,* ACM, New York, 1972, pp. 1971 ACM-SIGFIDET Workshop on Data
323-348. Description, Access, and Control, Nov.
[M7] DATE, C. J. "Relational data base 1971, ACM, New York, 1971, pp. 19-33.
IN4] DELOBEL, C. "Aspects theoretiques sur Holland Publ. Co., Amsterdam, The
la structure de l'information dans une Netherlands, 1974.
base de donn~es", Revue Francaise d'In- [Z5] Co•D, E. F.; AND DATB,:C. J. " I n t e r -
formatique el de Recherche Operationelle, active support for non-prbgrammers: the
B - 3 (Sept. 1971). relational and network approaches,"
INS] DELOnEL, C. " A theory about data in Proc. 1974 ACM-SI(YMOD Dsbate "Data
an information system," IBM Research Models: Data Structure Set versus Rela-
Report, RJ964, San Jose, Calif., Jan. 1972. tional," May 1974,* ACM, New York,
[N6] RISSANEN, J.; AND DELOBEL, C. " D e - 1974.
composition of files, a basis for data stor- [Z6] DATE, C. J.; ANvCoDv, E . F . " T h e re-
age and retrieval," IBM Research Re- lational and network approaches: com-
port R J1220, San Jose, Calif., May 11,973. parison of the application programming
[N7] DELOBEL, C.; AND CASEY, R . G . De- interfaces," Prec. 1974 ACM-SIGMOD
composition of a data base and the theory Debate "Data Models: Data Structure Set
of Boolean switching functions," IBM versus Relational~" May, 1974,* ACM,
J. R. & D. 17, 5 (Sept. 1973), pp. 374-387. New York, 1974. •
[NS] KENT, W. " A primer of normal forms," [Z7] BACHMAN,C. W. " T h e data structure
IBM Technical Report TR 02.600, San set model," PreC. 1975 ACM-SIGMOD
Jose, Calif., Dec. 1973. Debate "Data Models: Data Structure Set
[N9] ARMSTRONG,W.W. "Dependency struc- versus Relational," May 1974,* ACM,
tures of data base relationships," In- New York, 1974.
formation Processing 7~, Prec. I F I P Con- [Z8] SZBLEY,E. H. "On the equivalences of
gress, August 1974, Vol. 3, North-Holland data based systems," Prec. ACM-
Publ. Co., Amsterdam, The Netherlands, SIGMOD Debate "Data Models: Data
1974, pp. 580-584. Structure Set versus Relational," May
[N10] DELOBEL, C.; AND LEONARD, M. " T h e 1974,* ACM, New York, 1974.
decomposition process in a relational [Z9] EVEREST,G . C . " T h e futures of data-
model," Technical Report, Laboratoire base management," Prec. ACM-SIGMOD
d'Informatique, Univ. of Grenoble, Workshop on Data Description, Access,
France, Sept. 1974. and Control, May, 1974, ACM, New
[Nll] WANG, C. P.; AND WEDEKIND, H. "Seg- York, 1974, pp. 445-.462.
ment synthesis in logical data base de- [Z10] OLLE,T . W . "Current and future trends
sign," I B M J. R. & D. 19, 1 (Jan. 1975) in data base management systems," In-
pp 71-77. formation Processing 7~, Prec. I F I P
[N12] ~ERNSTEIN, P. A.; SWENSON,J. R.; AND Congress, August, 1974. Vol. 5, North-
TSICHRITZIS, D. " A unified approach to Holland Publ. Co., Amsterdam, The
functional dependencies and relations," Netherlands, 1974, pp 998-1006.
Proc. ACM-S[GMOD Conf. May 1975,* {Zll] DATE, C. J. " ~ n introduction to data
ACM, New York, 1975, pp. 237-245. base systems," Addison-Wesley, Reading,
[N13] FADOUS, R. Y.; AND FORSYTH, J. " F i n d - Mass., 1975.
ing candidate keys for relational data [Z12] KAY, M. H. " A n assessment of the
bases," Prec. ACM-SIGMOD Conf., May CODASYL DDL for use with a rela-
1975,* ACM, New York, 1975, pp. 203-210. tiona 1 schema, " Data Base Description,
[N141 FADers, R. Y. "Mathematical founda- B. C. M. Douque aad G. M. Nijssen
tions for relational data bases," PhD. (Eds.), North-Holland Puhl. Co., Am-
Thesis, Michigan State Univ., Lansing, sterdam, The Netherlands, 1975, pp.
1975. 199-214.
IN15] SHARMAN,G. C. H. " A new model of [Z13] ROnINSON, K. A. " A n analysis of the
relational data base and high level lan- uses of the CODASYL set concept,"
guages," Technical Report TR. 12.136, Data Base Description, B. C. M. Douque
IBM Hursley Park Laboratory, England, and G. M. Nijssen, (Eds.), North-Holland
Feb., 1975. Publ. Co., Amsterdam, The Netherlands,
1975, pp. 169-182.
[Z14] TAYLOR, R. W. "Observations on the
Models and Theory attributes of database sets," Data Base
3) Relationships between CODASYL D D L / Description, B. C. M. Douque and G. M.
DBTG and Relational Model Nijssen (Eds.), North-Holland Publ. Co.,
{Z1] CODASYL Data Base Task Group Re- Amsterdam, The Netherlands, 1975, pp.
port, April 1971, ACM, New York. 73-84.
[Z2] CANNING,R . G . "Problem areas in data [Z15] OLLE, T. W. " A n analysis of short-
management," EDP Analyzer 12, 3 comings in the schema DDL with an
(March 1974). outline of proposed improvements,"
[Z3] EARNEST,C. P. " A comparison of the Data Base Description, B. C. M. Douque
network and relational data structure and G. M. Nijssen (Eds.), North-Holland
models," Technical Report, Computer Publ. Co., Amsterdam, The Netherlands,
Sciences Corp., El Segundo, Calif., April 1975, pp. 283-298.
1974. [Z16] HuiTs, M. "Requirements for languages
[Z4] NIJSSEN, G. M. " D a t a structuring in in data-base systems," Data Base Descrip-
tion, B. C. M. D o u q u e and G. M.
DDL and relational d a t a m o d e l , " Prec. Nijssen (Eds.), North-Holland Publ. Co.,
I F I P TC-2 Working Conf. on Data Base Amsterdam, The Netherlands, 1975, pp.
Management Systems, April 1974, North- 85-110.
. ~.~*~?,. ~ - : U~
66 • Donald D. Chamberlin
[D1] CHANG,C. L.; AND LEE, R. C. T. ,Sym- Center, Yorktown Heights, New York,
bolic logic and mechanical theorem proving, July 1973.
Academic Press, New York, 1973. leg] CODD, E. F. "Seven steps to REN-
[D2] MINKER,J. "Performing inferences over DEZVOUS with the casual user," Proc.
relational data bases," Proc. ACM- IFIP TC-~ Working Conf. on Data Base
,SIGMOD Conf. May 1975,* ACM, New Management ,Systems, April 1974, North-
York, 1975 pp 79--91. Holland Publ. Co., Amsterdam, The
{D3] Z-ADEH,L. A. "Calculus of fuzzy re- Netherlands, 1974.
strictions," Report ERL-M562, Elec-
tronics Research Lab., Univ. of Calif.,
Berkeley, Calif., Feb. 1975. Sets and Relations (prior to 1969)
These references are included to enable the reader
to trace work published prior to 1969 on computer
Natural Language Support support for (mathematical) sets and relations.
[El] SIMMONS, R. F. "Natural language [YI] CODASYL Development Committee.
question-answering systems: 1969," "An information algebra", Comm. A C ~
Comm. ACM 15, 1 (Jan. 1970), 15-30. 6, 4 (April 1962), 190-204.
[E2] SCH-ANK,R. C. ; ANDCOLnY,K.M. (Eds.), [Y2] LEVIEN,R. E.; AND MARON, M.E. " A
Computer models of thought and language computer system for inference execution
W. H. Freeman, San Francisco, 1973. and data retrieval," Comm. ACM 10,
[~] RusTIN, R-AND-ALL(Ed.), "Natural lan- 11 (Nov. 1967), 715-721.
guage processing," Courant Computer [Y3] CmLDS, D. L. "Feasibility of a set-
Science Symposia 8, New York, Dee. theoretical data strueture--a general
1971, Prentice-Hall, Englewood Cliffs, structure based on a reconstituted defi-
N.J., 1971. nition of relation," Proc. IFIP Congress
[E4] THOMPSON, F. P.; LOCKEM-ANN, P. C.; 1968, North-Holland Publ. Co., Amster-
DOSTERT, B. H.; .AND DEVERILL, R. dam, The Netherlands, 1968, pp 162-172.
"REL: a rapidly extensible language [Y4] CmLvS, D. L. "Description of a set-
system," Proc. ~th ACM National Conf., theoretic structure," Proc. AFIP,S 1968
New York, 1969, ACM, New York, 1969, Fall Jr. Computer Conf., Vol. 33, AFIPS
vv 399--417. Press, Montvale, N.J., 1968, pp 557-564.
[E5] KELLOGG,C. H.; BURGER, J.; DILLER, T.:
-ANDFOGT, K. "The CONVERSEnatural [Y5] lSH~elWa tiLo~ N~nSe~rE;, EwitHh ~'e~uR:~i:~
language data management system: cur- capabilities," Proc. ACM ~$rd National
rent status and plans," Proc. ACM ,Sym- Conf., August 1968, Brandon/Systems
posium on Information ~torage and Re- Press, Princeton, N.J., 1968, pp 143-156.
trieval, 1971, ACM, New York, 1971, [Y6] FELDMAN,J. A.; AND ROVNER, P. D.
pp 33-46. , " A n ALGoL-based associative language,"
[E6] WINgeR-An,T. 'Procedures as a repre- Comm. ACM 12, 8 (August 1969), 439-449.
sentation for data in a computer program [Y7] KUHNS,J. L. "Logical aspects of ques-
for understanding natural language,"
MIT Project MAC Report MAC TR-84,
Cambridge, Mass., 1971. f - . ( ),
[ET] MONTGOMERY,C. A. " I s natural lan- • Dec. 1969, Academic Press, New York,
guage an unnatural query language?" 1969.
Pros. ACM National Conf., New York, [Y8] KOCHEN,M. "Adaptive mechanisms in
1972, ACM, New York, 1972, pp 1075-- digital concept processing," in Discrete
1078. Adaptive Proeesses--,symposium and
[E8] PETRICK, S. R. "Semantic interpreta-
tion in the REQUESTsystem," IBM Re- Panel Discussion, AIEE, New York, 1962
search Report RC4457, IBM Research pp 50--58.
D. C. TSICHR|TZIS
and
F. H. LOCHOVSKY
Department of Computer Science, University o] Toronto, Toronto, Ontario, Canada MS~ IA7
Copyright © 1976, Association for Computing Machinery, Inc. General permission to republish, but
not for profit, all or part of this material is granted provided that ACM's copyright notice is given
and that reference is made to the publication, to its date of issue, and to the fact that reprinting privi-
leges were granted by permission of the Association for Computing Machinery.
I PRES,OEN, I
L.=,o.i/ \ \
STATE
q ~ ~°M INISTRATiOHS
HEADED
ADMITTED
DURING~ ADMiNiSTRATiONI
FIGURE 1. A general relationship graph.
I PRESIDENT [
ELEC,,ONSWON/
// /
\ \ ~oo.o.....o
,-" / \ ~1 CO.ORES.RES'IN~
1
[ '<EOT'ON I / \ f
/ \ / ""'""""
""'°7 I o"o--1
I .ATE I ~
FIGURE 2. A d a t a s t r u c t u r e diagram.
[PRESIDENT] LEVEL1
I STATE I LEVEL3
FIGURE 3. Hierarchical definition tree.
the root, as in Figure 3. Because there can a hierarchical definition Lre¢. A hierarchical
be at most one arc between any two record definition tree specifies both what record
types the arcs do not need to be labeled. Such types are allowed to be included in the data
a restricted data structure diagram is called base and the permissible relationships be-
ADMINISTRATION
(
ELECTION
//
I STATE
tween record types. Figure 3 represents a the record occurrences can be identified in a
hierarchical definition tree--one that is a natural way. A hierarchical path is a se-
subset of the data structure diagram shown quence of records in which the records,
in Figure 2. The hierarchical data model is starting at a root record, follow alternately
defined as the data model v~hich organizes in a parent-child relationship. For example,
data logically according to the structural the sequence: PRESIDENT record, AD-
relationships of hierarchical definition trees. MINISTRATION record, STATE record,
The level of a record type in the hier- defines a hierarchical path.
archical definition tree :.s a measure of its In a hierarchical data base that is struc-
distance from the root of the tree. The root tured like Figure 3, each P R E S I D E N T
record type is the highest level record type in record occurrence can have many ADMIN-
the tree (level 1). The PRESIDENT record ISTRATION record occurrences connected
type in Figure 3 is an example of a root to it. Each ADMINISTRATION record oc-
record type. The other record types, called currence may (in turn) have several STATE
dependent record types, are at lower levels in record occurrences connected to it, as ex-
the tree (levels 2, 3, etc.). In this instance emplified in Figure 4. Each STATE record
the ELECTION and ADMINISTRATION occurrence, however, has exactly one parent
record types are both at level 2. record occurrence: ADMINISTRATION
A data base that corresponds to the hier- (during which the state was admitted), and
archical definition tree of Figure 3 is shown one grandparent record occurrence: PRESI-
in Figure 4. The hierarchical data base is a D E N T (who served when the state was
collection or forest of trees called data-base admitted).
trees whose record occurrences appear as Two things should be noted in Figure 4.
nodes. There are, for instance, three data First, there can be a varying number of oc-
base trees in Figure 4. All of the trees are currences of each record type at each level.
constructed according to the relationships Second, each record occurrence (except for a
permitted explicitly in the hierarchical root record occurrence, PRESIDENT)
definition tree. In a hierarchical data base, must be connected to an occurrence of an
parents and children, or ancestors (parents, ancestor record type as constrained by the
parents of parents, etc.) and descendants hierarchical definition tree. There can be no
(children, children of children, etc.) among "independent" record occurrences of record
0 PRESIDENT
I
/\
ADMINISTRATION
+ STATE ÷ ÷ 4"
(a)
÷ STATE +
II +
I
-IF
(b)
FIGURE 5. TwO d a t a - b a s e h i e r a r c h i e s .
types ELECTION, ADMINISTRATION, since they carry no data) serve only to main-
or STATE. If we regard the relationships tain the structure and to qualify the data.
between parent and child record types as a Since, logically, the two notations are
data structure set, then membership in the equivalent, for simplicity and uniformity
set is always mandatory in terms of the only the former notation will be used.
DBTG network systems, as Taylor and All features of the example of a presi-
Frank have discussed in the paper already dential data base shown in Figure 2 cannot
mentioned [page 67]. be completely captured by one hierarchical
The reader can find in the literature a definition tree. The data structure diagram
similar, but not identical, notation for is itself a network. However, we can repre-
representing a hierarchical data base. This sent the same information b y using more
notation allows the data to reside only at the than one hierarchical definition tree as
terminal nodes of the tree [5l. All inter- shown in Figure 6.
mediate nodes of the data-base trees are The record types contain the following
present only to maintain the hierarchical data items:
relationships. Only the terminal nodes have
record occurrences with data-item values. P R E S I D E N T - - P R E S N U M B E R , PRES
NAME, BIRTHDATE, DEATH
Figure 5(a) shows a hierarchical definition
DATE, PARTY, SPOUSE.
tree and a data-base tree of the type pre-
ELECTION--YEAR, PRES VOTES,
viously discussed; Figure 5(b) represents LOSER VOTES, LOSER, LOSER
the same information in terms of a hier- PARTY.
archical definition tree where the record A D M I N I S T R A T I O N - - A D M I N NUM-
types are present only at the terminal nodes. BER, INAUG DATE, VICEPRESI-
The intermediate nodes (marked with a dot DENT.
STATE ADMITTED J
(a)
STATE } I CONGRESS J
I NATIVEPRESIDENT ]
l
PRESIDENTSERVED J
(b) (¢)
FXGURE6. Presidential data-base hierarchical definition trees.
CONGRESS SERVED--CONGRESS parent record occurrence and inserting a
NUMBER. child. For example, in Figure 5, before one
STATE ADMITTED--STATE NAME. can insert an occurrence of a STATE record
STATE--STATE NAME, POPULA- type, an occurrence of an ADMINISTRA-
TION, STATE VOTES. TION record type would first have to be se-
NATIVE P R E S I D E N T - - P R E S NUM- lected.
BER. When a record occurrence is deleted, all of
CONGRESS--CONGRESS NUMBER, its descendant record occurrences are also
SENATE REP PERCENT, SENATE deleted. The hierarchical data model does
DEM PERCENT, HOUSE REP PER- not allow non-root records to exist without
CENT, HOUSE DEM PERCENT. ancestors. In Figure 5, if an ADMINISTRA-
P R E S I D E N T SERVED--PRES NUM- TION record occurrence is deleted, all oc-
BER. currences of STATE record types connected
to the occurrence of the ADMINISTRA-
TION record type also have to be deleted.
HIERARCHICAL LANGUAGES To retain the descendant records, but not a
parent record, as is sometimes necessary,
A hierarchical system is a DBMS which pro- some systems provide commands which de-
vides facilities for inserting, modifying, de- lete the data-item values, but not the
leting, and retrieving record occurrences record itself [6, 9, 10, 17]. Essentially,
in a hierarchical data base. Because of the this facility permits null (empty) records to
nature of the hierarchical data model, each exist in the data base. In this way, the data
new record that is inserted (except for a associated with the descendant record is re-
root record occurrence) has 'to be connected tained.
to an occurrence of a parent record type. Records retrieved in a hierarchical data
Usually this action is effected by selecting a base may be both selected and qualified ac-
cording to the tree structure [12, 13, 18]. normalization, while the qualifying of de-
Records are usually selected by means of a scendants is called downward hierarchical
qualification which expresses the criterion of normalization [16].
selection. The qualification consists of con- The retrieval operation in a hierarchical
ditions of the form data base may be performed in one of two
ways. According to the first method, a user
((data item name~(conditional operator~ explicitly uses the tree structure of the data
(value)) base to traverse the data-base trees in a
specified order. Traversal may be inde-
connected by Boolean operators AND, OR,
pendent of record selection and qualifica-
and NOT. The usual conditional operators
tion, in which case it resembles sequential
are < , _<, > , > , = , and ~ or their mne-
processing. On the other hand, records may
monic equivalent. The qualification may se-
be selected and qualified, but in a very re-
lect records of several different record types.
stricted manner only determined by the
In our example, a qualification such as
traversal order. According to the second
((PRES N A M E = E I S E N H O W E R ) AND
retrieval method, the user selects and quali-
(YEAR = 1956)) selects both P R E S I D E N T
fies records based on the relationships be-
and ELECTION records, In general, the
tween the data items of the record types.
qualification can be associated with data
Although the user has to be aware of the
items from any record type in the hierarchi-
tree structure of the data base, he does not
cal definition tree. However, almost all sys-
explicitly use this structure to retrieve
tems permit the qualification to contain
records. Instead, the system utilizes the
data items only from record types that lie in
hierarchical structure to determine which
a hierarchical path. In this way, they avoid
records qualify among those selected by the
the ambiguity that arises when the Boolean
user. Each method will be discussed in de-
negation operator (NOT) is specified in the
qualification [12, 13]. tail in the following sections.
After a record has been selected, other
records may qualify for retrieval. Every Tree Traversal
record has, for example, a unique set of I n systems that use a tree traversal language,
ancestors in the data base. All ancestors of record types and record occurrences are
the selected record may qualify for retrieval. sometimes called segment types and segment
Further, a record may have a set 'of de- occurrences or simply segments [14]. A single
scendants, all of which can also qualify for data-base tree consisting of one root record
retrieval. For example, an ADMINISTRA- occurrence and all its assoeiateddeseendants,
TION record may have several STATE is sometimes called a data-ba~e record. The
records connected to it. Notice that each se- root record type of the hierarchical defini-
lected record always has no more than one tion tree is referred to a s the root segment
ancestor record of each ancestor record type. type. The other segment types are called
That is, each record has, at most, one parent dependent segment types. Each segment type
which (in turn) has, at most, one parent, is further divided into data items called
and so on. However, in general, a selected fields. In this section, the original terminol-
record may have several descendant records ogy of the Hierarchical Data Model section,
of each descendant record type, since it may page 105, will be used.
have several children which (in t u r n ) m a y One of the DBMSs that uses a tree traver-
have several children, etc. When most sys- sal language is+ the Information Manage-
tems qualify descendants, this qualification ment System (IMS) [14]. IMS uses a pre-
is performed along only one hierarchical order tree traversal to traverse a data-base
path. This means that from a P R E S I D E N T tree. A preorder tree traversal is defined as
record, one can qualify ELECTION records follows [151:
or ADMINISTRATION records, but not
both. The qualifying of ancestors of a se- 1) Visit the record if it has not already
lected record is called upward hierarchical been visited.
2) Else, visit the leftmost child not pre- data-base tree. The calls to D L / I require
viously visited. several parameters which are passed through
3) Else, if no children, grandchildren, a subroutine call in the host language. These
etc., remain to be visited, go back parameters identify communication buffers,
to the parent record. I / O buffers, the type of call, and qualifica-
tions. However, only the last two param-
These steps are applied to each record of the eters are relevant to the following discussion.
data-base tree when the record is reached. During the operation of retrieving records,
I t is assumed that the children of each several records may be selected by means of
parent are ordered according to the appear- one or more qualifications. After selection,
ance of the child record type in the hier- other records may be qualified. No more
archical definition tree, i.e., all ELECTION than one record of each record type, lo-
records under a P R E S I D E N T record come cated in the hierarchical path, may be quali-
before any A D M I N I S T R A T I O N records. fied. A qualification on a record is specified
The traversal begins at the root record. by means of a segment search argument
The traversal essentially visits all records (SSA). An SSA specifies a qualification that
in the tree going in top-to-bottom, left-to- applies to only one record type. The form of
right order. Taking as an example the data- an SSA is:
base tree given in Figure 7, a preorder tree
traversal would visit the records in the order (RECORD N A M E ) ( C O M M A N D CODE)
indicated. If one imagines individual oc- (QUALIFICATION)
currences of data-base trees to be connected
The (RECORD NAME) is the name of a
to an imaginary head record, a single data- record type in the hierarchical definition
base tree is formed; this procedure makes it
tree. A (QUALIFICATION) is optional in
possible to visit all the records in the data
the SSA and is expressed in the form speci-
base. During the traversal, the selection of
fied previously. The only Boolean operators
record occurrences can take place. In addi-
permitted are AND and OR.
tion, the records may be qualified.
The (COMMAND CODE) is optional and
Requests to IMS are specified through pro-
specifies the various options of the call. Some
cedure calls to Data Language/One (DL/
of the more important options permit:
I) from application programs written in
P L / I , COBOL, or Assembler Language. A • retrieval or insertion of some or all of
position pointer marks the progress of an the records from the root to a speci-
application program through the data base fied record type in a single D L / I call
according to a preorder traversal of the (path call);
• backing up to the first child under a
record at any or all levels (except at
the root record level);
• retrieval of the last occurrence of a
record that meets all specified condi-
tions under a parent;
• setting of the parentage to a specific
record (see the GET N E X T W I T H I N
P A R E N T call below).
The set of SSAs in a D L / I call specify a
O PRESIDENT path that passes from a root record down the
data-base tree to a specified record. There
• ELECTION
may be no more than one SSA for each level
ADMINISTRATION in the hierarchical path. If an SSA, located
at a particular level, does not uniquely
4- STATE identify a record, or if no SSA is specified,
FIGURE 7. Preorder tree traversal. then IMS selects the hierarchically next
(preorder) record, except when modified by under a given parent is reached. A GET
command codes. N E X T call continues to retrieve records
Every D L / I call results in the setting of auntil IMS sets the status code to signal
status code parameter in the communication that the end of the data base has been
buffer used by IMS to communicate with an reached. A GET N E X T W I T H I N P A R E N T
application program. The status code may call sets the status code to signal that there
be checked after every D L / I call to de- are no more children under the parent.
termine whether the result of the call is suc- A GET HOLD call is used to retrieve and
cessful, unsuccessful, warning, or some other hold a record for a DELETE or REPLACE
condition. It is then up to the application call. A GET HOLD call allows only one ap-
program to determine the appropriate action plication at a time to alter a record by
to be taken. The various status codes will not serializing access to the record.
be discussed here. The interested reader An I N S E R T eM1 is used to load a data
should refer to the appropriate manuals base or to add new records to a data base.
[141. SSAa are used to select the position (in the
Data-base calls are categorized into three data-base tree) where the record is to be in-
general types: GET calls; I N S E R T calls; serted. Notice that an I N S E R T call per-
and, DELETE and REPLACE calls. All forms two operations: it stores the new
calls may optionally include some form of record, and connects it to its parent. This
SSAs. dual operation is necessary, since every
The GET UNIQUE call retrieves a record, record, except a root record, must have a
as selected by the SSAs, independently of parent.
its current position in the data base. This A DELETE call deletes a record and all
call is used for nonsequential processing or to of its descendant records from the data base.
establish a start position for sequential" The DELETE call is a "triggered" delete--
processing of the data base. that is, the record selected and all of its
The GET N E X T call processes in a for- descendants are deleted.
ward direction only, starting from the cur- A REPLACE call updates records in the
rent position in the data base. Records (op- data base. Any nonkey data item may be
tionally of a particular type) are retrieved changed in the record. Attempting to change
in the order established by a preorder traver- a key data item results in an error.
sal of the data-base trees. GET N E X T can To demonstrate the use of the D L / I calls,
be used without SSAs to retrieve the records the following query will be implemented
in the data base sequentially. It can also be using P L / I : Print the names of all the states
used to search for a particular record if an that were admitted during a Democratic
SSA is included in the call. administration. The query is first presented
A GET N E X T W I T H I N P A R E N T call in its entirety, and then an explanation of
obtains records (optionally of a particular its various parts is given. It is assumed that
type) within the family of a parent record. there is a hierarchical data-base structure
The parent record is established by the last that conforms to the hierarchical definition
GET N E X T or GET UNIQUE call, or by tree shown in Figure 6(a). Only the PRESI-
the (COMMAND CODE) option in an D E N T and STATE A D M I T T E D record
SSA of a preceding GET N E X T or GET types are required for this query. To con-
UNIQUE call. The only difference between form to the naming conventions of IMS, these
a GET N E X T and a GET N E X T W I T H I N two record types are renamed PRES and
P A R E N T call is the result obtained after SADMIT, respectively.
the last child (optionally of a given type)
DLITPLhPROCEDURE (QUERY__PCB) OPTIONS (MAIN);
DECLARE QUERY__PCB POINTER;
/* CommunicationBuffer * /
DECLARE 1 PCB BASEDo(QUERY--PCB),
2 DATA.__BASLNAME CHAR (8),
2 SEGMENT__LEVELCHAR (2),
2 STATUS__CODE CHAR (2),
2 PROCESSING__OPTIONSCHAR (4),
2 RESERVED__FOR__DLIFIXED BINARY (31,0),
2 SEGMENT__NAME___FEEDBACKCHAR (8),
2 LENGTH~OF__KEY__FEEDBACK__AREAFIXED BINARY (31,0),
2 NUMBER_..OF SENSITIVE_SEGMENTSFIXED BINARY (31,0),
2 KEY__FEEDBACK~AREA CHAR (28);
/ * I/O Buffers */
DECLARE PRES__IO__AREA CHAR (65),
1 PRESIDENTDEFINED PRES__IO__AREA,
2 PRES__NUMBERCHAR (4),
2 PRES._NAME CHAR (20),
2 BIRTHDATECHAR (8),
2 DEATH__DATE CHAR (8),
2 PARTY CHAR (10),
2 SPOUSE CHAR (15);
DECLARE SADMIT__IO__AREA CHAR (20),
1 STATE ADMITrED DEFINED SADMIT_IO_AREA,
2 STATE_NAME CHAR (20);
/ * Segment Search Arguments */
DECLARE 1 PRESIDENT__SSASTATIC UNALIGNED,
2 SEGMENT__NAME CHAR (8) INIT ('PRES '),
2 LEFT__PARENTHESISCHAR (1) INIT ('('),
2 FIELD__NAME CHAR (8) INIT ('PARTY '),
2 CONDITIONAL_.OPERATOR CHAR (2) INIT (' ='),
2 SEARCH__VALUE CHAR (10) INIT ('DEMOCRAT '),
2 RIGHT__PARENTHESISCHAR (1) INIT (')');
DECLARE 1 STATE__ADMII"rED_SSA STATIC UNALIGNED,
2 SEGMENT_NAME CHAR (8) INIT ('SADMIT ');
/ * Some necessary variables * /
DECLARE GU CHAR (4) INIT ('GU '),
GN CHAR (4) INIT ('GN '),
GNP CHAR (4) INIT ('GNP '),
FOUR FIXED BINARY (31) INIT (4),
SUCCESSFUL CHAR (2) INIT (' '),
RECORD~NOT_FOUND CHAR (2) INIT ('GE');
/*This procedure handles IMS error conditions */
ERROR:PROCEDURE(ERROR__CODE);
END ERROR;
/ * Main Procedure */
CALL PLITDLI(FOUR,GU,QUERY__PCB,PRES__IO.~AREA,PRESIDENT__SSA);
DO WHILE (PCB.STATUS__CODE=SUCCESSFUL);
CALL PLITDLI (FOUR,GNP,QUERY__PCB,SADMIT~IO._~AREA,STATE__ADMITrED_SSA);
DO WHILE (PCB.STATUS_CODE=SUCCESSFUL);
PUT EDIT (STAT~ NAME) (A);
CALL PLITDLI (FOUR,GNP,QUERY~PCB,SADMIT_IO_._AREA, STATE.__ADMITTED_SSA);
END;
IF PCB.STATUS_CODE--1 = RECORD_.NOT_FOUND
THEN DO;
CALL ERROR (PCB.STATUS_CODE);
RETURN;
END;
CALL PLITDLI (FOUR,GN,QUERY_PCB, PRES_IO._AREA, PRESIDENT~SSA)~
END;
IF PCB.STATUS~CODE~ = RECORD__NOT_FOUND
THEN DO;
CALL ERROR (PCB.STATUS__CODE);
RETURN;
END;
END DLITPLI;
To run this query, the user would invoke the application program. The use of P L / I
IMS. After IMS has performed the necessary structures, applied, for example, to PRESI-
initializations, it passes control to the pro- DENT, facilitates access to the data items
cedure called DLITPLI. Within his pro- of the record type. All records are assumed
gram, a user must declare a mask for a com- to be stored as character data. In this ex-
munication buffer, allocate I / O buffers, and ample, only buffers for the I~RESIDENT
set up the formats of the various SSAs that and STATE A D M I T T E D record types are
will be needed. required.
The declaration of the P L / I structure Finally, the SSAs are declared. There may
named PCB is an outline of the structure for be one SSA for each record type in the hier-
the communication buffer, established by archical definition tree. In the example
IMS, that is needed for communication with given, only the P R E S I D E N T and STATE
the user's program. The buffer is allocated A D M I T T E D record types require SSAs.
by IMS in the initialization phase, and a Although the SSAs in this example do not
pointer to it (QUERY__PCB) is passed as change after they have been declared and
a parameter, to the application program. In initialized, it is possible to change the search
general, there may be several communica- value and/or the conditional operator by
tion buffers, one for each data base accessed means of the host languages facilities.
by the application. The declaration of the The E R R O R procedure will not be de-
P L / I structure, PCB, only serves to facili- scribed in detail here. It is used to print
tate access to the communication buffer via error messages and to take appropriate ac-
the structure entry names. The buffer is tion when an error condition arises.
used by D L / I to advise an application pro- As mentioned earlier, a program accesses
gram of the results of its D L / I calls. All en- an IMS data base by making a subroutine
tries in the buffer are read-only, and most call to DL/I. In P L / I , such D L / I calls are
are self-explanatory. The only entry of in- characterized by the starting sequence
terest in this example is the STATUS__ CALL PLITDLI. These calls may have a
CODE. The STATUS CODE indicates varying number of parameters. However,
whether the result of a D L / I call is success- in the present example each of the calls
fnl, or an error, a warning, or other condi- uses five parameters, which are in order of
tion. appearance within the call:
Next, the I / O buffers for the record types
are declared. There is usually one I / O buffer • the number of parameters to follow
for each record type in the hierarchical (in this example, four);
definition tree. The I/O buffers are used to • the type of call, e.g., GET NEXT,
hold the record, corresponding to the ap- INSERT, etc.;
propriate record type, that is retrieved by • the pointer to the communication
IMS, and to make the record available in buffer for the call;
• the location of the I / O buffer; and although they operate on a tree, their
• the SSA(s) for the call. processing looks sequential. The nature of
the commands they use can also influence
The processing of the query is performed their implementation. It would be nice if the
ia several steps as follows: record in the hierarchy that is logically
1) Retrieve the first occurrence of a next would also be the next physical record.
P R E S I D E N T record where the If performed in this way, sequential proce-
P A R T Y data item is equal to essing of the data-base tree would be very
DEMOCRAT. This action is per- efficient. In the Implementation section,
formed by means of a GET [see page 118], we will examine some imple-
U N I Q U E (GU) call using the mentations of a hierarchical data base.
PRESIDENT__SSA. If no record
is selected, then the processing is General Selection
complete.
2) Get the first STATE A D M I T T E D The terminology employed in systems that use
record that is a descendant of the general selection languages differs from that
P R E S I D E N T record. This action of systems using tree traversal languages
is performed by means of a GET [5, 6, 7, 9, 10, 11, 17, 20]. Among the general
NEXT WITHIN PARENT selection languages, a record type is some-
(GNP) call using the STATE__ times referred to as a repeating group. A
ADMITTED__SSA. Notice that record occurrence is called a repeating group
it is not necessary to specify an occurrence or data set. A data-base tree con-
SSA for A D M I N I S T R A T I O N rec- sisting of one root record and all its descend-
ords. We are not really interested ants is called a logical entry. Finally, a data
under which administration the item is called a data element. However, to
state was admitted, but only under avoid terminological confusion, repeating
which president. IMS, therefore, groups will be called record types, and the
selects each A D M I N I S T R A T I O N other terms will also be named as before.
record in turn as required and In the following discussion, a hierarchy
processes its associated STATE structured according to the hierarchical
A D M I T T E D records. If no states definition trees given in .Figure 6 will be
were admitted during a president's used.
tenure, then we go to step 4. General selection languages treat the
3) Get all the other STATE AD- record occurrences of a record type as sets of
M I T T E D records, in turn, by data-item values. Records are selected and
means of a GET N E X T W I T H I N qualified according to the relationships
P A R E N T (GNP) call until there found among the data items of the record
are no more STATE A D M I T T E D types of the hierarchical definition tree. A
records for this president. Print the qualification in the general selection lan-
name of each state. guages is usually specified by a W H E R E
4) Get the next P R E S I D E N T record clause [12, 13, 17]. A W H E R E clause con-
where the PARTY data item is sists of th= keyword W H E R E and a Boolean
equal to DEMOCRAT. This action combination of the conditions discussed
is effected by a GET N E X T (GN) earlier. The W H E R E clause specifies the
call using the same SSA as for the records to be selected. After these specifica-
GET UNIQUE call in step 1. If no tions have been effected, upward and/or
P R E S I D E N T record is selected, downward hierarchical normalization can
then the processing is complete. be performed. Downward hierarchical nor-
Otherwise go to step 2. malization is usually restricted to one hier-
archical path.
Tree traversal languages usually operate The common characteristic features of
on one record at a time. They essentially general selection languages will be illus-
perform a linear tree traversal. As a result, trated by examples using the "Natural Lan-
guage" 'feature of the SYSTEM 2000 [ 9, query is answered by selecting and qualifying
17]. This feature is typified by an Eng- only A D M I N I S T R A T I O N records. In this
lish-like, interactive query language. The case again, several records may be selected
commands of this language consist of two and qualified.
parts: an action part and a W H E R E clause. Now consider the query:
The action part specifies the operation to be
performed, e.g., retrieval, update, etc., and P R I N T PRES N A M E WHERE VICE-
the data items on which to operate. The P R E S I D E N T EQ AGNEW AND
basic retrieval command is the P R I N T V I C E P R E S I D E N T EQ FORD:
command. The following query illustrates
upward hierarchical normalization. The casual user would perhaps expect the
response to this query to be Nixon as the
P R I N T V I C E P R E S I D E N T WHERE president who had both Agnew and Ford as
STATE NAME EQ ALASKA: vice-presidents. However, the semantics of
the W H E R E clause are such that the answer
This query specifies that we want VICE- to this query is null, i.e., no P R E S I D E N T
P R E S I D E N T data-item values in AD- record qualifies. This problem arises because
MINISTRATION records. In addition, the all Boolean operations must be performed at
ADMINISTRATION records must have a the same level at which records are selected,
STATE A D M I T T E D descendant that sat- that is, on A D M I N I S T R A T I O N records.
isfies the W H E R E clause. To provide this, The result is that the intersection (AND) of
we select all STATE A D M I T T E D records two conditions which specify the same data
satisfying the W H E R E clause and then per- item--but different values--mus be ~ull.
form an upward hierarchical normalization The same record cannot have two different
to qualify ADMINISTRATION records. In values for the same data item.
general, several ADMINISTRATION rec- If the Boolean operations could be per-
ords may qualify. However, according to the formed on qualified records at a higher level,
semantics of the example, only one AD- then the problem would be solved. The pur-
M I N I S T R A T I O N record would qualify. pose of the HAS clause is to raise the level
The next query illustrates downward at which the Boolean operations are per-
hierarchical normalization. formed by carrying out an upward hier-
archical normalization on the selected rec-
P R I N T STATE NAME WHERE VICE- ords. For instance, the query:
P R E S I D E N T EQ NIXON:
P R I N T PRES NAME WHERE PRES
We examine the A D M I N I S T R A T I O N NAME HAS V I C E P R E S I D E N T
records to select those records that satisfy EQ AGNEW AND PRES NAME
the W H E R E clause. Although, in general, HAS V I C E P R E S I D E N T EQ FORD:
several records may be selected, we are con-
strained in this example to select only one will produce the answer "Nixon," as ex-
record. A downward hierarchical normaliza- pected. In the previous query, using only the
tion is then performed to qualify STATE W H E R E clause, the intersection (AND) was
A D M I T T E D records. Many STATE AD- performed on A D M I N I S T R A T I O N rec-
M I T T E D records may qualify. Note that ords. In this query, the intersection is per-
this and the preceding query are concep- formed on P R E S I D E N T records, since the
tually symmetric queries. However, they are level of Boolean operations is raised to the
answered in quite different ways. P R E S I D E N T record by means of the HAS
clause. (The level to which upward hier-
P R I N T A D M I N N U M B E R WHERE archical normalization is performed is spe-
V I C E P R E S I D E N T EQ NIXON: cified by the data-item name preceding the
keyword HAS.) In the first query, no AD-
This query involves neither upward nor M I N I S T R A T I O N record satisfies both con-
downward hierarchical normalization. The ditions of the W H E R E clause. In the second
archical sequence. Physical child/physical ing the pointer to the data base. Records in
twin pointers relate all records of a given the data base are related by pointers, as
type under a parent record to each other, discussed previously.
and the parent to its first child. Both pointer To summarize, a successful hierarchical
organizations may optionally have backward implementation can utilize contiguity and
pointers. It is possible to specify any com- pointers with the following objectives:
bination of pointer organizations within a • it is important to have efficient re-
data base and different organizations for trieval by eliminating costly pointer
different record types. chasing and resulting secondary
The pointers in the hierarchical direct
storage accesses;
organization are stored with each record. • the space required both for the data
The record consists of two parts in this case:
and the pointers should be minimized;
a prefix and a data part. The data part con-
and
tains the record data as supplied by the user.
• costly reorganizations should be
The prefix, which is system controled and not
avoided by providing a flexible en-
available to an application, contains system vironment that allows easy expan-
data and the pointers. The system data con-
sion of the data base.
sists of a record code, a delete flag, and a
counter. The record code identifies the record There are other ways of implementing a
type, and the delete flag indicates whether hierarchical data base. For example, a
the record has been deleted. The counter is method can be used to assign a logical ad-
optional and is only present if the record dress to a record in a data-base tree. This
type participates in a logical relationship logical address can then be mapped into a
[14]. A simple prefix consisting of the record physical one. A logical address to a record in
code and the delete flag is stored with every a data-base tree is called a trace [16]. An ex-
record in a multiple record type HISA2¢I ample of assigning traces to records in a
data base. data-base tree will be outlined.
Within the hierarchical direct organiza- Each record type in a hierarchical defini-
tion, the Hierarchical Direct Access Method tion tree can be identified by a type-number
(HDAM) organization is used to access root as indicated in Figure 9(a). Any record oc-
records via a hash algorithm. Records are currence in a data-base tree can then be
hashed into a primary storage area called the identified by its type-number and a genera-
root segment addressable area. The root record tion tuple. The generation tuple defines a
type must contain a key data item. The path, in the data-base tree, which leads to
hash is performed on the key data item. A the record occurrence.
fixed portion of a data-base tree, including For example, suppose that Figure 9(b)
the root record, is stored in the root segment corresponds to the third data-base tree in a
addressable area. Additional records of a hierarchical data base. The root record type
data-base tree are stored in an overflow of this tree is identified by the trace 1 (3).
area. A direct address pointer relates records The number 1 is the type-number of the
of a data-base tree in the root segment ad- record type. The number 3 indicates that
dressable area with its extension in the over- this is the third record of type-1. The first
flow area. type-2 child of this root is assigned the trace
The Hierarchical Indexed Direct Access 2 (3, l ) - - T h a t is, it is the first type-2 child
Method (HIDAM) organization provides under the third root record. The first type-4
indexed direct access to records in a data descendant under the first type-3 child of the
base. The index in this organization is a se- third root record has as its trace 4 (3, 1, 1).
quential file called I N D E X . Each record in The first number identifies the record type.
the I N D E X file contains the key data item The numbers in parentheses define a path
value of a root record and a pointer to the to a particular record occurrence. Using a
root record in the data base. Root records type-number and a generation tuple, any
are accessed by searching for the key data- record occurrence in the data base can be uni-
item value in the I N D E X and then follow- quely addressed by the path to it. In this rep-
lO
2/\
4+ + 4(3 1 1) • 413 1 21
(a) (b)
FIGURE9. Assigning traces to records.
resentation, all traces are valid provided the then we have a nice implementation of the
bounds of the hierarchical definition tree are hierarchical data structure [4].
adhered to, i.e., type-numbers and levels are
respected. However, traces may correspond
to records which have not yet been inserted. CONCLUDING REMARKS
Given a trace, there is a straightforward
algorithm to obtain the traces of ancestors, To summarize the essential points made by
descendants, and brothers. The rules that this survey, a hierarchical system is a DBMS
govern our example representation of traces which presents to the users of the system cer-
are: tain explicit views of the data base that are
characteristic of the hierarchical data model.
1) ancestor trace--drop some (greater The hierarchical data model has the follow-
than level of ancestor) digits at the ing characteristics i
end of the generation tuple and
change the type-number. 1) There is a set of record types {R1,
2) Descendant traces--add (descendant R2, . . . , R n } .
level minus current level) digits in 2) There is a set of relationships con-
next generation tuple places and necting all record types in one data
change the type-number. structure diagram.
3) Next brother trace--add one to the last 3) There is no more than one relationship
digit in the generation tuple. between any two record types R i
4) Previous brother trace--subtract one and Rj. Hence~ relationships need
from the last digit in the generation not be labeled,
tuple (if digit ~ 1). 4) The relationships expressed in the
data structure diagram form a tree
Sometimes, valid traces are restricted by with all arcs pointing toward the
specifying some bounds such as maximum leaves.
number of levels in the hierarchical defini- 5) Each relationship is I : N , and it is
tion tree, maximum number of children at total--that is, for every R j record
each level, etc. occurrence there is exactly one R i
Traces for record occurrences already in record occurrence connected to iS, if
the data base are kept in a trace table. A R i is the parent of R j in the defini-
trace table gives a mapping between a trace tion tree.
and the location where the record occurrence
Hierarchical systems deal with relation-
is stored. This mechanism captures all the
ships among attributes (~f the same entity in
structure of a hierarchical data base. Given a manner similar to relational and network
the trace of a record occurrence, one can systems discussed in the companion papers
find its ancestors, the record occurrence in this issue. All of the systems organize the
itself, its brothers, and its descendants. If attributes as items in groups: Their main
we can efficiently implement the trace table, difference is found in the way they treat re-
lationships between entities. Relational sys- COMPANY and child record type REGIS-
tems provide operations on relations which TRATION.
construct new relations representing relation- Note that the N : M relationship could be
ships among entities. As a result, relational represented in a single hierarchical defini-
systems do not use any additional concepts tion tree with one record type as the root
to the relationships among entities. The sys- and the other as the child. However, data
tem is homogenous in that all relationships duplication is again required: for example,
are represented by relations. Both hier- if the root is STATE, then the COMPANY
archical and network systems use the idea record occurrences would have to be re-
of a link or connection between record types peated under each state in which the com-
to handle relationships between entities. In panies are registered. If the amount of data
network systems, the data structure . sets associated with each company is quite large,
serve as the logical links between record then a great deal of storage space is required
types in a network. In the case of hierarchi- for duplication. It should be noted, however,
cal systems, the parent-child relationships that some existing hierarchical systems have
represented by the hierarchical definition facilities which essentially eliminate this
tree are the links. problem: they implement N : M relation-
Hierarchical systems have been available ships by using logical pointers among hier-
and well accepted for a long time [5, 6, 7, 9, archies. They allow logical hierarchical views
10, 11, 14, 17, 20]. It is difficult to relate the that are different from the physically imple-
success of a particular system to its data mented hierarchical structures [ 9, 14, 17].
model. There are many other parameters Hierarchical systems also handle con-
which influence the quality of a commercial ceptually symmetric queries in a very differ-
system. However, for some applications a ent manner. For example, consider the rela-
hierarchical data model seems very natural, tionship SERVED in Figure 1. If PRESI-
e.g., a corporate management structure is D E N T is the root, then a query such as:
truly hierarchical. In addition, most applica- "Find all congresses in which President P
tions can be modeled by a hierarchical or- served," is simple to answer. For President
ganization of the data, although some appli- P, all CONGRESS descendants are found.
cations produce more difficulty and redund- However, the symmetric query: "Find all
ancy than others. presidents who served in Congress C," in-
A hierarchical data model provides no volves a data base search of CONGRESS
means for implementing direct N : M rela- records. For every President, it is necessary
tionships between record types. Such a rela- to determine if he served in Congress C.
tionship can only be effected within record Sometimes, content addressibility, e.g., in-
types. However, most hierarchical systems verted files, may be used to speed the search.
do provide the ability to handle many hier- In addition, if two definition trees, one with
archical definition trees. By using data the root CONGRESS and one with the root
duplication, one can represent an N : M re- P R E S I D E N T , are used, then the problem
lationship by two hierarchical definition disappears.
trees, each representing a I : N relationship. Some specific advantages are widely ac-
For instance, consider the record types cepted for the hierarchical approach:
STATE and COMPANY and the N : M re-
lationship "registration" between them. • It is a simple data model which pro-
vides the user with relatively few,
That is, a company may be registered in
many states and a state may have many easy to master, commands.
companies registered in it. The N : M rela- • Because of the constraints on the
tionship beween STATE and COMPANY types of relationships allowed, it can
can be handled by using two hierarchical allow an easier implementation than
definition trees, one with root record type other, more complex structures.
STATE and child record type REGISTRA- Some specific disadvantages are also
TION, the other with root record type associated with hierarchical systems:
scendants if null records are not per- system (RFMS) users manual," TRM-16,
Computation Center, Univ. of Texas at
mitted. Consequently, users have to Austin, Texas, August 1971.
be careful when performing a delete [111 FRANKS, E. W,; "A data management sys-
operation. tem for time-shared file processing using a
cross-index file and self-defining entries," in
• It is sometimes not possible to answer Proc. AFIPS, Spring Jr. Computer Conf.,
symmetrical queries easily in a hier- 1966, Vol. 28, Spartan Books, New York,
archical system. Therefore, the strue- [12] 1966, pp. 79-86.
HARDGRAW:,W.T., "Theoretical aspects of
ture of the data base may tend to re- Boolean operations on tree structures and
flect the needs of the application. implications for generalized data manage-
ment," TSN-26, Computation Center, Univ.
Sometimes criticism is concentrated on of Texas at Austin, T~xas, August 1972.
HAItl)GRAVE, W. T., "BoLTs: a retrieval
the nature of the hierarchical commands [13l language for tree-structured data base sys-
which are claimed to be too procedural. How- tems," Information ~ystems, COINS-IV,
ever this criticism can be answered, as we [14] Plenum Press, New York, 1974.
IBM, INFORMATIONMANAGEMENTSYSTEM//
have seen in the section, General Selection, VIRTUAL STORAGE (IMS/VS) PUBLICATIONS
see page 116, by showing t h a t higher level 1975:
interfaces can be implemented. It is in this General information manual, GH20-1260-3.
System~application design guide, SH20-9025-2.
way, that even a casual user interface can be Application programmi~ reference manual,
easily accommodated. SH20-9026-2.
,System programming reference manual, SH20-
9027-2.
Operator's reference manual, SH20-9028-1.
Utilities reference manual, SH20-9029-2.
REFERENCES Messages and codes reference manual, SH20-
903O-2.
[1] ABRIAL, J. R., "Data semantics," Data [15] KNUTH, D.E., The art "of computer program-
base management, Klimbie, J. W., and Koffe- ming, Vol. 1, Addison-Wesley Publ. Co.,
man, K. L., [Eds.], North-Holland Publ. Co., Reading, Mass., 1968, p. 316.
Amsterdam, The Netherlands, 1974, pp 1-59. [16] LOWENTHAL,E.I., " ~ functional approach
[2] BACHMAN, C. W., "Data structure dia- to the design of storage structures mr gen-
grams," Data Base 1, 2 (1969), 4-10. eralized data management systems," PhD
[3] BAYER,R.; AND McCREIGHT, E., "Organi- Thesis, Univ. of Texas at Austin, Texas,
zation and maintenance of large ordered in- 1971.
dexes," Acta Informatica 1, 3 (1972), 173-189. [17] MRI SYSTEMSCORP., "SYSTEM~000 general
[4] BERNSTEIN, P. A.; AND TSICHRITZI$,D. C., information manual," Austin, Texas, 1972.
"Allocating storage in hierarchical data [18] PARSONS, R. G.; D.~L~, A. G.; ANY YURKA-
bases," Technical Report CSRG-34, Com- NAN, C. V., "Data manipulation language
puter Systems Research Group, Univ. of requirements for database management sys-
Toronto, May 1974 (to appear in Information tems," Computer J. 17, 2 (May 1974), 99-103.
Systems Journal). [19] SCHMID,H. A.; ANYSWENSON,J.R., "On the
[5] BLEIER,R.E., "Treating hierarchical data semantics of the relational data model,"
structures in the SDC time-shared data man- in Proe. ACM SIGMOD, Internatl. Conf. on
ement system (TDMS)," in Proc. ACM Management of Data, 1975, ACM, New York,
tional Conf. 1967, ACM, New York, 1967, 1975, pp. 211-223.
pp 41-49. [.20] UNITED COMPUTING SYSTEMS, Isc., UCS-
[6] BLEIER,R. E.; AND ~ORHAUS, A.H., "File VI UNIDATA data managey_ent system refer-
organization in the SDC time-shared data ence manual, Kansas City, Missouri, 1970.
EDGAR H. SIBLEY"
Departme.nt of Information Systems Mana~eme~lt, University of Maryland, College Park, Maryland 207~.,
and National Bureau of Standards, Washington, D.C. 20285
This paper deals with the history and definitions common to data-base
technology. It delimits the objectives of data-base management systems,
discusses important concepts, and defines terminology for use by other papers in
this issue, traces the development of data-base systems methodology, gives a
uniform example, and presents some trends and issues.
Keywords and Phrases: data base, data-base management, data definition,
data manipulation, generalized processing, data model, data independence,
distributed data base, data-base mach!nes, data dictionary
CR Categories: 3.51, 4,33, 4.34
~i.̧!111 ~o
8 * James P. Fry and Edgar H. Sibley
formance of the system. Although it is erent machine and some had incompatible
possible to add functional capabilities to an formats (different structures on different
existing system, the cost of retrofitting is tapes). Moreover, none of t h e data defini-
often prohibitive, and the post-design addi- tions were easily available. The manager
tion may adversely affect the system per- who needed the data for important predic-
formance. Although quality, security, and tions was unable to obtain answers i n n
control factors are given relatively scant reasonable amount of time.
treatment in other papers in this issue of There are two important mechanisms for
SVRVEYS, it should not be inferred that these making data available: the "data definition"
are unimportant. In fact, the consequences and the "data dictionary." A data definition
of excellent or poor satisfaction of these is a more sophisticated version of a DATA
needs may make or break a working system. DIVISION in COBOL,or a FORMAT state-
ment in FORTRAN; however, a data defini-
tion is supplied outside t h e user program or
Data Availability
query and must be attached to it in some
Everest [G12] states that the major objective way. The data definition (as specified by a
of a DBMS is to make data sharing possible. data administrator) generally consists of a
This implies that the data base as well as statement of the names of elements, their
programs, processes, and simulation models properties (such as Character or numerical
are available to a wide range of users, from type), and their relationship to other ele-
the chief executive to the foreman (Everest ments (including complex groupings) which
and Sibley [GS]). Such sharing of data re- make up the data base. The data definition
duces its average cost because the com- of a specific data base is often called a
munity pays for the data, while individual schema.
users pay only for their share. However, When the data definition function is cen-
under these circumstances the data cannot tralized (which is necessary to achieve the
"belong" to any individual, program, or de- objectives of DBMS), control of the data-
partment; rather, it belongs to the organiza- base schema is shifted from the programmer
tion as a whole. to the data administrator [A1]. The pro-
What, then, is the overall cost of data? One grammer or the ad hgc user of a query lan-
way to answer this question is by observ- guage is no longer a ~ e to control many of
ing data entry. Keypunching and verifying, the physical and logical relationships. While
or other types of data entry involving hu- this restricts the programmer to some ex-
man keystroking, tend to cost about 50¢ per tent, it means that all programs use the
thousand characters input. Thus, if the same definition; thus any new program can
average-sized file is two million characters retrieve or update data as easily as any other.
(a figure representative of much of today's Furthermore, greater: data definition capa-
industry and government), it costs $1000 to bilities are provided, the storage and re-
input each average-sized file. Under certain trieval mechanisms are hidden from the
conditions the cost of collecting data could program, the formats cannot be lost, and the
be substantially higher, e.g., when the data programmer's task is si~apler.
must be collected by telemetry, or in long Centralized data definition facilitates the
and complicated experiments. control of data duplication, which generally
Another expense is associated with the entails some storage inefficiency. However,
lack of data, the so-called "lost opportunity not all duplication of dam is bad; a con-
cost." If data is not available when an im- troled duplication may be ~ e s s a r y to allow
portant decision is to be made, or if duplicate special classes of users t o , b r a i n especially
but irreconcilable data exists, an ad hoc and fast responses without penalizing quality
possibly wrong decision results. Nolan for other users.
[A4] gives a scenario of a typical business The data definition facility is inherent to
where a manager knew that data existed, all DBMS. W i t h o u t it, the data base is
but some of it had been produced on a diff- owned by its progra~as, difficult to share,
ComputingSurveys,V o l .8 , N o . 1, M a r c h 1976
• i ? .
10 * James P. Fry and Edgar H. Sibley
and generally impossible to control. This, elements, and the data validation state-
then, is the cornerstone of data-base man- ments can be used to!generate procedures for
agement systems. input editing or other quality checking.
Whereas the data definition facility is the The data dictionary is extremely impor-
data administrator's control point, the data tant as part of the DBMS security mecha-
dictionary [D1] provides the means of broad- nism. If an adversary knows you are gather-
casting definitions to the user community. ing data, that adversary has already violated
The data dictionary is the version of the your security. For this reason, the data dic-
data definition that is readable by humans. tionary should be as secure as the DBMS.
It provides a narrative explanation of the Furthermore, if security requirements are re-
meaning of the data name, its format, etc., tained in the dictionary they can be auto-
thus giving the user a precise definition of maticaliy checked (and special procedures
terms; e.g., the name TODAYS-DATE may can be invoked) every time a data defini-
! b e defined narratively and stated to be tion is produced for the DBMS. This would
stored in ANSI standard format as Year: improve security monitoring.
Month: Day.
Within the past five years a number of Data Quality
data dictionary packages have appeared on
the market [D2]. Some of these are an in- Perhaps the most neglected objective of
tegral part of the data definition function, DBMS is the maintenance of quality. Prob-
while others provide an interface to multiple lems relating to the quality of data and the
DBMS, and still others are stand-alone integrity of systems and data go hand-in-
packages. hand. Data may have poor quality because
The dictionary will normally perform it was:
some, if not all, of the following functions: • never any good (GIGO--garbage in,
storage of the definition, response to inter- garbage out);
rogation, generation of data definition for • altered by human error;
the DMBS, maintenance of statistics on use, • altered by a program with a bug;
generation of procedures for data validation, • altered by ~ machine error; or
and aid in security enforcement. Obviously, • destroyed by a major catastrophe
storage of the data definitions in the dic- (e.g., a mechanical failure of a disk).
tionary is obligatory. Maintenance of quality involves the de-
The dictionary wiil normally be able to tection of error, determination of how the
either provide formatted dictionaries (on error occurred (with preventive action to
request) or respond to a simple query for a avoid repetition of the error), and correction
data entry, or to do both. This facility al- of the erroneous data. These operations en-
lows ad hoe users to browse through the tail precautionary measures and additional
definitions (on- or off-line) to determine cor- software functions within the data-base
rect data names. management system. The prevention and
In some dictionary systems, especially correction of the five listed causes of error
those that augment a DBMS, the data ad- will now be briefly discussed.
ministrator can invoke a data definition gen- In dealing with normal data-processing
erator. This allows the administrator to pick applications, the programmer is faced with a
names of elements from the dictionary, great deal of input validation. A survey by
group them, and then produce a new data the authors showed that about 40 % of the
definition. P R O C E D U R E divisions of present-day in-
The dictionary may be both a collector dustrial COBOL programs consists of error-
for, and a repository of statistics on DBMS checking statements. If the validation re-
usage. These statistics can be utilized to im- quirements can be defined at data definition
prove the efficiency of the DBMS by re- time, then error checks may be applied auto-
grouping elements for better accessing. matically by the system at input, update,
The dictionary may contain information manipulation, or output of data, depending
on techniques for validation of particular on the needs specified by the data adminis-
reprocessing the logged update transactions. It includes the establishment of the data ad-
This procedure tends to be very slow. ministration function and the design of
The quality and integrity of data depend effective data bases. Data administration
on input-validation techniques in the origi- currently uses primitive tools; a discussion of
nal data definition, logging of data-base them would be beyond the scope of this
changes, periodic snapshots of the entire paper (see [A1, 2, and 3]). However, it is
machine status, and total or incremental important to note that data-base design in-
data-base dumping. These operations require volves tradeoffs, because users may have
additional software in the data-base man- quite incompatible requirements. As an ex-
agement system, both for initiation of the ample, one group may require very rapid
protective feature and for its utilization to response to ad hoc requests, while another
reconstitute a good data base. Howexrer, they requires long and complicated updating
entail an overhead expense which adds to the with good security and quality control of
normal running cost. the data. The implementation of a system
responsive to the first need may suggest a
Privacy and Security storage technique quite different from that
needed by the second. The only way to re-
The third major objective of data-base solve such a conflict is to determine which
management systems is privacy--the need to user has the major need. If the requirements
protect the data base from inadvertent ac- are equally important, a duplicate data base
cess or unauthorized disclosure. Privacy is may be necessary--one for each class of user.
generally achieved through some security Although the installation of a data-base
mechanism, such as passwords or privacy management system is an important step to-
keys. However, problems worsen when con- ward effective management control, today's
trol of the system is decentralized, e.g., in data administrator faces a challenge: the
distributed data bases, where the flow of available tools are simplistic and seldom
data may overstep local jurisdictions or cross highly effective. They involve simulation,
state lines. data gathering, and selection techniques.
Who has the responsibility for the privacy Some new analytical methods appear promis-
of transmitted data? When data requested ing [G3]. These methods select the "best"
by someone with a "need to know" is put among several options of storage techniques,
into a nonsecure data base and subsequently but they are usually associated with one
disseminated, privacy has been violated. particular DBMS rather than with several.
One solution to this problem is to pass the
privacy requirements along with the data, Data Independence
which is an expensive, but necessary addi-
tion. The receiving system must then retain Many definitions have been offered for the
and enforce the original privacy require- term data independence, and the reader
ments. should be aware that it is often used am-
Security audits, another application of the biguously to define two different concepts.
audit trail; are achieved by logging access But first, we must define other terms. A
(by people and programs) to any secure in- physical structure2 describes the way data
formation. These mechanisms allow a se- values are stored within the system. Thus
curity officer to determine who has been ac- pointers, character representation, floating-
cessing what data under what conditions, point and integer representation, ones- or
thereby monitoring possible leakage and pre- 2 The terms data structure a n d storage structure,
venting any threat to privacy. Much of this which were promulgated by the CODASYL Sys-
technology is, however, still in its infancy. tems Committee [U2] can be attributed to l)'Im-
perio [DL2]. However, in computer science, the
term data structure is more closely associated with
Management Control physical implementation techniques such as linked
lists, stacks, ring structures, etc. To prevent am-
The need for management control is central biguity we opt for the more basic terms, logical
to the objectives of data-base management. and physical structure.
Mapping from the Logical to the Physical request so t h a t any logical relations m a y be
Structure derived. As an example, the request:
PRINT SP.OUSI,:WIIERE PlIES-NAME :="FOIID"
The need to create and load a d a t a base, i.e.,
to m a k e the d a t a definition and then popu- does not mention t h a t we are dealing with a
late it with data, leads to the physical struc- P R E S I D E N T entity; it is left to the D B M S
ture, which is the representation of d a t a in to discover this fact from the logical struc-
storage. The accessing process for the d a t a ture. T h e physical mapping m u s t have some
base m a n a g e m e n t system is shown in some- mechanism t h a t will determine which d a t a
what oversimplified form in Figure 5. T h e to retrieve (using the key P R E S - N A M E if
definition of the logical structure is stored possible), and then will call the relevant
operating system access method and apply
within the D B M S and associated with the
a n y deblocking t h a t is necessary to return
the required portion of the character stream.
USER OR PROGRAM T h e process of mapping from occurrences
REOUEST
of data to their bit-string representation on
. T , STORED
disk or tape is generally system-dependent;
LOGICAL ASSOCIATIONOF/ DEFINITION therefore, these factors are discussed in the
iNAMES IN REQUESTWITHL
,DATA DEF,.NITION J " ~
(LOGICALSTRUCTURE)
separate papers in this issue of SURVEYS.
Most D B M S format (block and manage)
the pages or records themselves, and most
use the operating system access method to
IOFACCESS
T.O
} IB I store and retrieve the d a t a from secondary
devices.
FIGURE 5. Logical and physical aspects of a I n fact, because most modern D B M S use
DBMS. the available operating system, they gen-
erally use many of its facilities. Therefore, Evolution of Data Definition Languages
communication management facilities, pro-
gram library management, access methods, One important factor in the evolution of
job scheduling, special program manage- DBMS is the development of data defini-
ment (e.g., sorting and compiling), concur- tion languages. They provide a facility for
rent access prevention, checkpoint facility, describing data bases that are accessed by
etc. typically are all "adopted" by the multiple users and by diverse application
DBMS, though some rewrite and additions programs.
may be necessary.
Centralized Data Definition: Fifties and
Sixties
4. HISTORICAL PERSPECTIVES
Probably the first data definition facility
The origin of DBMS can be traced to the was the COMeOOL [DL1] developed at the
data definition developments, the report M I T Lincoln Laboratory for the SAGE Air
generator packages, and the command-and- Defense System in the early fifties. COMPOOL
control systems of the fifties--a time when provided a mechanism for defining attributes
computers were first being used for business of the SAGE data base for its hundreds of
data processing. Many systems have been real-time programs. The COMPOOL concept
developed since the fifties (See the surveys was later carried over to JOVIAL [PL4] (a
by Minker, [U1, 4]). M I T R E [U3, 8] and programming language), but some of the
CODASYL (U2, 7] show numerous system capability was lost when the language was
implementations that have generated wide implemented under a generalized operating
interest among users. system; the data definition became local to
In 1969 Fry and Gosden [U5] analyzed the language rather than global to the sys-
severM DBMS and developed a three-cate- tem.
gory taxonomy: Own Language, Forms About the same time, hardware vendors
Controled, and Procedural Language Em- were developing programming languages for
bedded. Succinctly stated, these categories business applications: FACT [PL1] was de-
can be contrasted as follows: Own Language veloped by Honeywell, GECOM [PL3] by the
Systems (such as GIS IV16]) have a high- General Electric Company, and Commercial
level, unconventional programming lan- Translator [PL2] by IBM; all provided some
guage; Forms Controled Systems (such as form of data-definition facility. GEcoM and
MARK IV IV12]) use the "fill-in-the-blank" Commercial Translator provided the capa-
approach, and Procedural Language Sys- bility of defining intrarecord structures, and
FACT offered the more advanced capability
tems (such as I-D-S IV9]) take advantage of
of providing inter-record hierarchical struc-
existing higher-level programming languages. tures.
In 1971 the CODASYL Systems Com- Under the aegis of CODASYL, these
mittee [I6] observed that the most significant vendor efforts were merged into COBOL[PL5]
difference among DBMSs was the method in the late fifties. This language has a cen-
employed in providing capabilities to the tralized DATA DIVISION which achieves
user. The Committee developed a two- the separation of the description of data
category classification scheme, Self Contained from the procedures operating on it. While
(which included the Forms Controled cate- the DATA DIVISION initially mirrored the
gory) and Host Language. data as stored on tape or cards, implementors
It is impossible to survey all systems, but soon found themselves using different ways
of physically storing data. This inherent in-
it is possible to trace the evolution of the compatibility between physical data stored
DBMS by tracing the evolution of two pre- by different manufacturers becomes an im-
cursors of data base management: data portant factor when data must be exchanged
definition languages and the development of between two systems.
generalized 1RPG systems. Approaches which attempt to mitigate the
data-transfer problem are the subiect of port generators cani perform complex table
recent research on the description of physi- transformations and produce sophisticated
cal structures and the development of stored- reports from a data base. These, then, al-
data definition languages. lowed the user to dxamine and manipulate
large volumes of data, and they may be
Stored-DataDefinition Languages: theSeventies said to be a precursor, or a particular type of
modern DBMS.
One of the first efforts in this area was
mounted by the CODASYL Stored-Data The Hanford/RPG Family (Figure 6)
Definition and Translation Task Group
[SL2] in 1969 with the goal of developing a The patriarch of today's RPG system was
language to describe stored data. At the 1970 developed at the Hanford (Washington)
ACM SIGFIDET (now SIGMOD) meet- operations of the Atomic Energy Commis-
ing, a preliminary report was made [SL3], and sion, which was then managed by the Gen-
later reports were published in 1972 [SL5]. eral Electric Company. In 1956 Poland,
Notable basic research efforts in the develop- Thompson, and Wright developed a gen-
ment of these languages were reported by eralized report generator [G1] (MARK I) and
Smith [SL1] and Sibley and Taylor [SL4, 7] a generalized sort routine for the IBM 702.
in 1971. The capability was extended in 1957 by the
The Data Independent Accessing Model development of a report and file maintenance
(DIAM) [DL3], developed by Senko and his generator (MARK II). These routines pro-
colleagues at the IBM San Jose Research vided the basis for a joint development by
Laboratory, provides a multilevel data de- several users under the SHARE organiza-
scription capability. The description starts tion of the 709 Package (9PAc) [Wl] for
at the information level, structures this into the IBM 709/90.
a logical definition, adds encoding informa- 9PAc is the principal ancestor of most
tion, and ends with a physical description of commercial report generators developed
the storage device and its logical-to-physical since 1960. Foremost among these is the
mapping structure. Each level provides aug- Report Program Generator (RPG) de-
mentation of the description at the preceding veloped for the IBM 1401 in 1961; this has
level. Recent work by Senko [DL4,5] ex- evolved into the RPG for the IBM System/
tends the information level in a new language 360 and an enhanced RPG II for the IBM
called FORAL. System/3, System/360, and several other
Thus, the single-level data description computers [W2, 3]. Other members of the
facility of the fifties, made incompatible by Hanford family include the COGENT sys-
storage developments in the sixties, led to tems, developed by Computer Sciences
the recent development of stored-data de- Corporation for the IBM 709 and System/
scription facilities in the seventies. 360 between 1965 and 1969 [Y5], and the
SERZ~S system [Y9].
Development of Report Generator Systems Another system, also based on MARK II
ideas, was being defined during the late
The development of programming languages fifties in. a SHARE 704 Project under Flet-
originally allowed the user (a programmer) cher Jones. This IBM 704 system, called
to define reports by giving simple definitions S u R ~ [W4], was the predecessor of GZRLS,
of the format of the lines and then writing the partiarch of the Postley/MARK IV
procedures to move data into buffers prior family.
to printing each line. Therefore, the program
written to produce a complete report could 5. DEVELOPMENT OF DBMS
consist of large numbers of statements in-
volving expensive programming. The de- The development of the data-base manage-
velopment of report generators stems from a ment systems may be divided into three
need to produce good reports without this somewhat overlapping periods: the early
large programming effort. In most cases, re- developments, prior to 1964; the establish-
I
RPG ] I (|BM SYSTEM/3)
\
SERIES
meat of families during the period 1964- first to discuss the translation of a query
1968; and the vendor/CODASYL develop- language. They designed a language, QueRY
ments from 1968 to the present. Since the IX7], and developed techniques for analyzing
characteristics of the data-base management its syntax and compiling statements into
systems differ considerably during these machine code.
periods, we discuss them separately. One of the first identifiable data-base man-
agement systems to appear in the literature
Early Developments: Prior to 1964 was an elegant generalized tape system de-
veloped by Climenson for the ROA 501 in
The impetus for DBMS development came 1962. This system, called RetfievM Com-
originally from users in government, par- mand-Oriented Language [KS], provided
ticularly from the military and intelligence five basic commands, with Boolean state-
areas, rather than from industry and compu- ments permitted within some of them. The
ter manufacturers. Although these prototypes user had to specify the data description with
bear little resemblance to today's systems the query so that a program could be bound
and were somewhat isolated, they provided to its data.
some interesting "firsts" in the evolution Another early and ambitious develop-
of data-base technology. They also provided meat was ACSI-MA'rm IX1] sponsored by
the beginnings of several significant DBMS the US Army in the late fifties. This system
families. was designed by Minker to emphasize effec-
In 1961 Green [X2] and his colleagues de- tive memory utilization and inferential
veloped a natural-language system called processing. It could make inferences such
BAsE-BALL. Though not a data-base man- as: if John is the son of Adam, and Mary is
agement system by current definition, it the sister of John, then Mary is the daughter
made a contribution to the technology by of Adam. It contributed the first generalized
providing access to data through a subset of data-retrieval accessing package for a disk-
natural language (a limited vocabulary of oriented system with batched requests, a dy-
baseball-related terms). At approximately namic storage algorithm for managing core
the same time, the first implementation of a storage, and the first assembler to use a
B-Tree was described by Collilla and Sams dynamic storage allocation routine. Because
ix6]. disks were not reliable at that time, the
Cheatham and Warshall were probably the ASCI-MATiC system was never fully imple-
ETF
1962 MITRE
/
ADAM (IBM 7030) COLINGO ( ! BM 1401)
///'
1965 MITRE /
\
J
C-IO (IBM 14101
/ \
/ \
/ \
/ \
1967 SDC LUCID (see, Fig. 13) \ ACS[ -MATIC
MITRE BRANCH I \ //
\\\ /
DM-I (U 1218)
1969 AUERBACH
,.6ooo,
FIGURE 7. T h e M I T R E / A u e r b a c h Family.
1962 INFORMATICS
i
MARK I ( IBM 1401/60)
1964 INFORMATICS
I
MARKH ( IBM 1401/60)
1966 INFORMATICS
I
MARK m ( IBM 1401/60)
SAGE
I
1958 J IRS (DTMB)(IBM704)
1959
, I
TUFF/TUG(DTMB)(IBM 704/91
support the needs of the Command-and- cessing System (Nn~s)i[X17]. NIPs added the
Control and the Intelligence communities. concepts of logical file maintenance, im-
Perhaps the most prolific of these was the proved query language, and on-line process-
Formatted File family, which spans all three ing. In 1968 NxPs was converted from IBM
development periods. Its origins can be 1410 to IBM System/360 and named NIPs-
traced to a series of systems developed at the 360 [Y12].
David Taylor Model Basin ~by Davis, Todd, A cousin of NIPs was also developed for
and Vesper. One of the principal systems-- the intelligence community--the Intelligence
Information Retrieval (IR) IX3, 9J--was an Data-Handling Formatted File System
experimental prototype developed in 1958 [X26]. This emphasized efficient large-file
for the IBM 704. This was followed by two processing and provoked interest in machine-
formatted file-processing packages: Tape independent implementation using COBOL.
Update for Formatted Files, TUFF [X16, 20], Prototype development of such a system be-
and Tape Updater and Generator, TuG gan in 1968 by the Defense Intelligence
[X5] (both developed to run on the IBM 704). Agency. The effort was first named the CO-
Later this family split into two branches in BOL Data Management System (CDMS)
the Air Force and Navy. The Air Force [Y8]; later (1970) it was renamed the Ma-
branch, SAC/AiDs Formatted File System chine Independent Data Management Sys-
[X14], was developed in 1961 for the Stra- tem (MIDMS) [Yll]. It was originally im-
tegic Air Command 438L system. Its major plemented on the IBM System/360 and
contribution to data technology was the was later coded (in 1973) for the H6000
development of a file format table, i.e., a series.
"self describing" data base. By storing a SAC FFS is considered to have inspired
machine-readable data definition with the IBM's Generalized Information System
data, each data base was directly accessible (GIS) [V16, 17]. This was originally de-
by FFS. veloped as a stand-alone program product
The Navy branch, Information Process- for System/360 (1965), but has been ex-
ing System (IPS) [Xll, 12, and Y10], was tended and enhanced to act as either a
also developed in 1963 for the CDC 1604 by stand-alone system or ad hoc interrogation
NAVCOSSACT. IPS also made contribu- interface for the IMS family.
tions to data-base technology in the imple-
mentation of a multilevel hierarchically Vendor/CODA$YL Developments: 1968 to the
structured data base on sequential media, Present
and in its implementation on several differ-
ent hardware systems, such as the IBM The trend in this period shifts from in-house
709/90 [X19] and the AN/FYK-1 [X32]. family-oriented activities to proprietary
During the implementation of IPS in vendor development. As a result, some ad-
1963, another branch of the family was de- vances made by commercially available
veloped for the Naval Fleet Intelligence DBMSs disappeared into a veil of secrecy.
Center in Europe (FICEUR) [Xl0]. This While few references have appeared recently
FFS was patterned after the SAC FFS and on the internals of particular DBMSs, the
implemented on the IBM 1410. SAC also technical literature abounds with articles on
added an FFS on the IBM 1401 for the mathematical and theoretical aspects, espe-
Pacific Air Force Headquarters. This system cially of relational systems. Chamberlin's
was later reprogrammed for the IBM Sys- article (see page 43) provides an excellent
tem/360 and is still in use on smaller models. bibliography of this development. Recent
About 1965 the SAC and F I C E U R years also show the entry of CODASYL into
branches of the formatted-file family merged, the data-base field.
resulting in the NMCS Information Proc-
CODAS YL/DBTG Family (Figure 11)
3 The D a v i d T a y l o r Model B a s i n is now called the
D a v i d T a y l o r N a v a l Ship Research and Develop- Based upon the pioneering ideas of I-D-S
m e n t Center. and APL, the CODASYL Programming
Language Committee started a new task CODASYL took two significant actions:
group to work on a proposal for extending • a new standing committee was created
COBOL to handle data bases [PL6]. This to deal exclusively with the data de-
group was originally called the List Process- scription, the Data Description Lan-
ing Task Group, though its name was later guage Committee (DDLC); and
changed to the Data Base Task Group-- • the DBTG was replaced by a new
DBTG--its major acronym, which will be task group to deal only with COBOLex-
used here. The first semipublic recommenda- tensions, the Data Base Language
tions of the DBTG were made in 1969 IS1]. Task Group (DBLTG).
These recommendations detailed the syntax Since that time, a new subcommittee has
and semantics of a Data Description Lan- also been formed to add DML statements to
guage (DDL) for describing network-struc- FORTRAN.
tured data bases, and the definition of The DDLC was charged with taking the
Data Manipulation Language (DML) state- Schema DDL and developing a common
ments to augment COBOL. The task group data description language to serve the major
intended that the DDL specifications should programming languages. In January 1974 a
be available to ail programming languages, first issue of the Data Description Language
while extensions like the DML would be Committee's publication, the Journal of
needed for every language. Development, was published [$3]. This re-
The initial DBTG specification was re- port specifies only the syntax and semantics
viewed by many user and implementation of the DDL.
groups. Their recommendations were further The DBLTG was charged with making
considered, and a new report was issued in the 1971 report of the DBTG consistent
1971 [$2]. The major change involved separa- with CODASYL COBOL specifications. In
tion of the data description into two parts; a February 1973 the DBLTG submitted its
Schema DDL for defining the total data base, report to the CODASYL Programming Lan-
and a Sub-schema facility for defining various guage Committee. This report is very similar
views of the data base consistent with differ- to the 1971 DBTG report, with nomencla-
ent programming languages. ture and relatively cosmetic changes. New
Based on the reviews of the 1971 report, items in the 1973 report included an ex-
tension to the facility for dealing with error (Generalized Update Access Method), the
returns. forerunner of Data Language/One (DL/I).
Implementation of systems which con- The other was the implementation of two
formed to the 1969, 1971, and 1973 DBTG teleprocessing applications, EDmT (Engineer-
specifications started in 1970 with the ing Document Information Collection Task)
UNIVAC DMS 1100 [V22] for the 1108, and and LIMs (Logistics Inventory Management
since then for the UNIVAC 1110 series com- System). The software package which sup-
puters. At about the same time, B. F. ported EDICT and LIMS, the Remote-Access
Goodrich implemented a system called In- Terminal System (RATs), was jointly de-
tegrated Data Management System, IDMS veloped by Rockwell International and IBM
IV7], for the IBM System/360. This has during 1964-65. Both GuAM and RATS were
since been extended to IDMS-11 for the originally implemented on the IBM 7010
Digital Equipment Corporation PDP 11/45. with 1301 disk Storage.
The IDMS series is marketed by Cullinane In 1966, IBM, Caterpillar Tractor Corpo-
Corporation. The Digital Equipment Corpo- ration, and Rockwell International agreed to
ration has implemented DBMS-10 [VS] for a joint development effort to produce a
its PDP 10 computer system. DBMS, the Information Management Sys-
Some extensions to self-contained facilities tem (IMS) for the IBM System/360. When
for ad hoc interrogations have been imple- the system had to be frozen in 1968 (to meet
mented by Control Data Corporation, the Apollo commitment), Rockwell and
Query/Update IV6], and by Xerox Data IBM each continued with separate develop-
Systems, EDMS [V23]. In the Netherlands, ments, while Caterpillar withdrew entirely
Philips implemented a family of systems from the effort. The development at Rock-
termed PHOLAS IV19], and in Norway the well took the name of Information Control
SIBAS IV20] system has been developed by System/Data Language/I (ICS/DL/I).
Shipping Research Services. Honeywell has Originally, DL/I [X35] was a data descrip-
updated I-D-S to conform to 1973 specifica- tion facility which provided a means for
tions; this is the I-D-S/II IV9]. describing and organizing a hierarchically
structured data base. It also provided inter-
IMS Family (Figure 12) faces, which the programming user invoked
to access and store data from the host lan-
The IMS family of systems is an outgrowth
of the Apollo moon-landing program. Its guage (originally CoBoL). The on-line com-
origins can be traced to two developments ponent, ICS/DL/I [X84], added in 1968,
at The Space Division of North American allowed multiple access by using the DL/I
Aviation (now Rockwell International) in interface from COBOLor P L / I programs. In
1965. One was the implementation of GUAM, addition to running teleproc~ssing simul-
1969 IBM
I
IMS-I (IBMSYSTEM/360)
1969 IBM
I
IMS-2IIBM SYSTEM/360)
1969 IBM
I
]MS-VS (IBM SYSTEM/370)
FIGURE 12. The IMS Family.
1972 MRI
I
S2000(LEVEL 2)
( l BM SYSTEM/360)
FIGURE 13. The Inverted File Family.
cility. It has become one of the most widely BIRTH-DATE and DEATH-DATE, the
used data-base management packages today. party affiliation (PRES-PARTY), and the
The Data Manager-1 System (DM-1) name of his SPOUSE. It will also be con-
[X31], designed by Sable at the Auerbach sidered necessary to know the STATE-
Corporation, stems from the Army ACSI- NAME of which the President is a native
MATIC development and MITRE'S ADAM. son. However, since STATE will later be de-
DM-1 consists of a series of service routines fined as an entity, we could alternatively de-
for returning and storing data; using these fine a relationship NATIVE-SON between
routines, both high-level ad hoc user func- P R E S I D E N T and STATE.
tions and host-language application pro- Using the notation presented in Section 3
grams can be developed. DM-1 was imple- under the discussion of the "Elements of
mented at the Air Force Rome Air Develop- Logical Structure" (page 13) we have Dis-
ment Center on U1218 computer and the play 1 below. If, however, an explicit rela-
Honeywell H6000. Based on the design phil- tionship were to be used for the native son,
osophy of DM-1, the Western Electric and STATE-NAME is the key of STATE
Company, initially assisted by Auerbach, then the statement appears as in Display
developed System Control-1 [Y6] on the 2 below.
System/360. The next entity of interest is the Presi-
Another development, by the Burroughs dent's ADMINISTRATION, which con-
Corporation, is the Data Management Sys- tains items such as the administration num-
tem II [V2] for the B6700/B7700 computer. ber (ADMIN-NUMBER) (e.g., George
Basically a host-language type system using Washington was No. 1), the inauguration
COBOL, its data definition language is formed date (INAUG-DATE), and the Vice-
in set-theoretic terms. It also offers a storagePresident (VP). In order to identify the
definition option. President of each Administration, it is also
necessary to include the item PRES-NAME
6. THE PRESIDENTIAL DATA BASE EXAMPLE in the ADMINISTRATION entity.
The discussion of data-base models in other At this point, it is worth asking why the
articles in this issue of COMPUTINGSURVEYS P R E S I D E N T entity does not contain the
will use a unified example which deals with ADMINISTRATION entity. This is a de-
some parts of the Executive branch of the sign decision, and the reader must assume it
US Government, with data about the Presi- is based on consideration of usage and
dent, his Administration, elections, Con- modeling. It should be noted, however, that
gress, etc. We use this example because it is a President can have had more than one
almost self-explanatory; it was first enunci- Administration, and consequently, if AD-.
MINISTRATION is contained, it would
ated in a paper by Willner, et al. [G9]. need to be a repeating group. As another al-
Because the example deals with the Execu- ternative, we could assume that the two
tive branch, the most obvious entity is the separate entities have a relationship
PRESIDENT. The important items in the H E A D E D between ADMINISTRATION
P R E S I D E N T entity will be assumed to be: and P R E S I D E N T . Thus, we have Display
the President's name (PRES-NAME). 3) below.
Display 1:
PRESIDENT" = (PRES-NAME, BIRTH-DATE, DEATH-DATE, PRES-PARTY, SPOUSE,
STATE-NAME )
Display 2:
PRESIDENT-1 = (PRES-NAME, BIRTH-DATE, DEATH-DATE, PRES-PARTY, SPOUSE)
and
NATIVE-SON = (PRES-NAME, STATE-NAME).
Display 3:
either
(ADMINISTRATION) = (ADMIN-NUMBER, PRES-NAME, INAUG-DATE, VP);
or:
PRESIDENT-2 = (PRES-NAME, BIRTH-DATE, DEATH-DATE, PRES-PARTY,
SPOUSE, STATE-NAME, {(ADMIN-NUMBER, INAUG-DATE,
vP)});
or:
ADMINISTRATION-I = (ADMIN-NUMBER, INAUG-DATE, VP)
HEADED = (PRES-NAME, ADMIN-NUMBER).
C o m p u t i ~ Surv~ye~ VoL 8, No. 1, March 1976
!
30 • James P. Fry and Edgar H. Sibley
The next entity is that of the ELECTION. But there are some drawbacks to this ex-
The interesting items in the election a r e : ample: one is t h e / a c t that it represents a
the year (ELECTION-YEA,R), the presi- relatively constant idata base, for although a
dential votes in the Electoral College (PRES- President may be replaced, the data about
VOTES), the LOSER, the LOSER-PARTY, the Administration is still retained. Conse-
the year in which the party was first cre- quently there is little updating in our ex-
ated as a political entity (PARTY-FIRST- ample, though there may be substantial
YEAR), and the votes of the losing party addition to the data base in election years.
(LOSER-VOTES). Once again, because elec- Some business data bases, however, present
tions a r e w o n by a President, the election a greater propensity to change. For example,
entity may have to contain the PRES- a payroll data base regularly has changes to
NAME; otherwise there must be some re- many items such as YEAR-TO-DATE-
lationship WON between the P R E S I D E N T PAY (presumably after every payday) and
and the ELECTION entities. Thus, the SALARY (presumably after every increase).
alternatives are: Thus, the presidential data base, while form-
ELECTION = (ELECTION-YEAR, PRES-NAME, PRES-VOTES, LOSER, LOSER-PARTY,
PARTY-FIRST-YEAR, LOSER-VOTES), etc.
Another entity within the data base is the ing the major example, will not suffice alone.
STATE. It has a name (STATE-NAME), a Other authors contributing to this issue of
population (POP), and a number of votes in COMPUTING SURVEYS will introduce other
the Electoral College (STATE-VOTES). examples to illustrate particular fine points.
States are admitted t o the Union during
some Administration. This fact may be 7. TRENDS AND ISSUES .
shown either implicitly, by having some re-
lationship ( A D M I T T E D - D U R I N G ) be- Historically, we have traced the develop-
ment of DBMS from the early systems,
tween the ADMINISTRATION and
STATE entities, or explicitly, by including which supported primarily the nonprogram-
the A D M I N - N U M B E R in the STATE en- ming user for ad hoc requests, to the recent
predominance of host-language systems
tity. It might be noted that there is already
a link between the P R E S I D E N T and which support the programming user. A cur-
rent trend is, then, the establislunent of a
STATE entities because the NATIVE-SON
relation has been shown as an element balance---a comprehensive set of DBMS
(STATE-NAME) in the P R E S I D E N T functions for a full spectrum of users while
entity. maintaining the current DBMS objectives
We have now defined most of the data [FI, 2, and 3]. Some of the current research
base, and need only incorporate the entity is developing bridges between various models
CONGRESS to complete it. This entry will of data so that a single DBMS can support
contain items such as: CONGRESS- a variety of data models.
NUMBER, SENATE-REPUBLICAN- Three major trends and one important
PERCENT, SENATE-DEMOCRAT- issue will affect the future of DBMS: the
emergence of conversational systems, the
PERCENT, HOUSE-REPUBLICAN-
PERCENT, AND HOUSE-DEMOCRAT- need for geographic distribution of the in-
formation system, the technological impacts
P E R C E N T . Again, there is a relation be-
tween the P R E S I D E N T and CONGRESS, on DBMS architecture, and the question of
standardization of the DBMS interface.
which may be found explicitly by incorporat-
Each of these is now briefly discussed.
ing P R E S - N A M E in the CONGRESS en-
tity, or implicitly by arranging a relation
Ad Hoc versus Programming Systems
CONGRESS-SERVED between the entities.
Figure 14 shoWs a sample of the presi: Artificial intelligence research h a s already
dential data base in tabular form. Unavail- improved our understanding of the difficul-
able information is shown by a ~b, e.g., in the ties involved in providing a natural language
Death and Inauguration Date columns. interface for computers. And though there
' . • j
Evolution of Data-Base Managemen~Systems • 31
has been little that is immediately applicable, result, some DBMS already provide good
the fall-out from this research includes a languages for the nonprogrammer who is
better understanding of the structure and willing to learn a few rules, and there is
use of higher-level and very-high-level (re- growing interest in the development of the
stricted natural) language interfaces. As a casual-user interface (e.g., see IF4]).
PRESIDENT
ELECTION
PARTY-
ELECTION- PRES- LOSER- LOSER-
PRES-NAME LOSER FIRST
YEAR VOTES PARTY VOTES
YEAR
CONGRESS
STATE
Texas 16 11196730 26
Mass. 4, 5689170 14
Calif. 18 19953134 45
Mich. 12 8875083 19
ADMINISTRATION
A casual user is one who uses the system so problem in a busy industrial environment).
seldom that all rules and techniques are This advantage is offset by the relatively
likely to be forgotten between sessions, hence high cost of using what is essentially an in-
the need for special treatment. At the other terpretive system: the tradeoff is therefore
end of the user spectrum are the adept com- between people and machine costs. The
puter programmers who have technical people costs are in programming and de-
skills and a good knowledge of "system in- bugging, while the machine costs are in
ternals." In writing programs for nonpro- running. One presumes that the code pro-
grammers they presumably utilize all their duced from a high-level (query) interface
skills to produce procedures that will run costs more to run, therefore the question
efficiently. The assumption is that pro- arises: how many times must the program be
grammers cost more (they must be paid run before it pays for the cost of program-
while they understand the problem, write ming? And this is the classical question of
code, etc.), but their resulting programs are compiling, but now in the realm of even
cheaper to run. higher-level languages and with potentially
Thus, the case for ad hoc and host-lan- larg e data bases. There are, however, very
guage systems can be considered one of few jobs today which warrant the cost of
tradeoffs. The following is a partial list of special (assembler or machine language)
the advantages and disadvantages of the programming. This trend continues today in
use of higher-level interfaces: DBMS usage, and the self-contained ad
1) Their use facilitates more rapid running hoc user system is becoming more accepted
of the problem--the user asks the question by the user community.
directly, and he has no need to call on a pro- 2) The use of a higher-level language
grammer as intermediary (a process that simplifies the structure (removes DO-loops
sometimes takes weeks for even a simple and GO-TO statements) and is generally
• Is it better to store multiple copies? seems to admit it has merits, but finds ex-
How much extra will it cost to update cuses in order to stpp it from happening too
a data base from a remote location? soon in his own field of interest. The argu-
What parts of the data base should ments for and against standardization (in
be stored (i.e., how does one distribute any area) are now given.
the data efficiently)? What are the For standards, there is one maj or argument:
best places to run a program (it may The provision of a standard aids the user by
be cheaper for a user at A to trans- making objects interchangeable; the nut, if
port data at B to the program at C and of the same diameter, fits the bolt. Thus:
then just receive the answers at A)? • the programming language is the
The old problems have already been dis- same on all machines: so the pro-
cussed, but are now complicated by the extra grammer who knows COBOL, for ex-
complexity of the distributed system: ample, can be transferred, or may get
• What redundancy is necessary to en- a new job and not need retraining;
sure good reliability of both hardware • the company can change machines
and data? How much does this affect and run the same COBOL programs,
the user in terms of the response time after their recompilation, on the new
for updates, and the excess processing machine;
cost? • parts are interchangeable: magnetic
• What problems are likely to occur in tapes have standard densities; plug-
concurrent operation? The possibility to-plug compatibility of storage and
that several users will all contend for input/output units is possible;
the same resources, and consequently • data can be interchanged over the
will need effective scheduling and network;
control, is obviously more acute in a • the network protocol is the same, so
large, distributed, many-user system. all users have to learn only one proto-
• How can privacy be retained? The col; and
potential for breaking the system • the commands to enter (log-on) and
rises as its complexity increases. The leave (log-off) the system, and some
chance of message interception ob- other controls, are the same through-
viously increases also. out the network.
Thus, the trend to distributed data bases, Against standards, there is one major argu-
with concepts of data machines as special ment: if we do not know the correct tech-
resource nodes on the network, brings with it nology, standardization may mean costly re:
a new set of tradeoff decisions. fitting later, or may even stifle develop-
ment. This argument is reasonable, since a
Data-Base Machines large-scale data-processing shop may have
many thousands of programs representing
Distributed data bases, in conjunction with millions of dollars of investment. Rewriting
emerging technology, will have a significant all these (probably COBOL) programs in
impact on DBMS architecture and on the some new language is beyond the wishes of
DBMS functions. There already are com- most current DP managers, who hope that
puters dedicated to DBMS, e.g., the Data- their programs are "here to stay." Such
computer IF5, 6]. "Front-end" and "back- built-in conservatism will undoubtedly slow
end" computers are in the prototype stage down any change from one well-developed
[F7, 8]. Also, new disk technologies and asso- standard to another, no matter how good the
ciative devices will have a great impact on new standard may be. This stifles acceptance
DBMS architecture [F9, 10]. of new ideas.
Many groups are concerned about stand-
To Standardize or N o t e.
ardization and are actively working in this
The computing profession has ambivalent area. The DBTG report has been accepted
feelings about standardization: everyone by the Programming Language Committee
i
36 • James P. Fry and Edgar H. Sibley
AFIPS Press, Montvale, N.J., 1975, pp. [G7] DENNING, PETER J., "Third generation
569-576. computer systems," Computing Surveys
[F3] EVEREST, GORDON C., "The futures of 3, 4 (Dec. 1971), 175-216.
data-base management, '~ Pr oc. 1974 [GS] EVEREST, GORDON C.; AND SIBLEY, cEDGAR
SIGMOD Conf., May 1974, pp. 445--462. H., "A critique of the GUIDE-SHARE
[F4] CODD,E . F . , "Seven steps to rendezvous data-base management system require-
with the casual user," Proc. IFIP TC-~ ments," Proc. of the 1971 ACM-SIGFIDET
Working Conf. on Data Base Management Annual Workshop on "Data Description,
System Congress, April 1974, North- Access and Control," E. F. Codd and
Holland|Publ. Co., Amsterdam, The Neth- A. L. Dean, (Eds.), pp. 93-112. also
erlands, 1974. MISRC-WP-71-2.
[F5] MARILL, THOMAS; AND STERN, DALE, [G9] WILLNER, S. E.; BANDURSKI, A. E.;
"The datacomputer--a network data util- GORHAN, W. C.; AND WALLACE, M. A.,
ity," Proc. of AFIPS National Computer "COMRADE data management system,"
Conf., 1975, Vol. 44, AFIPS Press, Mont- Proc. AFIPS National Computer Conf.,
vale, N.J., 1975, pp. 389-395. 1973, Vol. 42, AFIPS Press, Montvale,
[F6] MARILL,T.; ANYSTERN,D.DATACOMPUTER N.J., 1973, pp. 339-345.
VERSlON I USER MANUAL,Working paper [G10] BACHMAN, C. W., "The programmer as
no. 11, Computer Corp. of America, Cam- navigator," Comm. ACM 16, 11 (Nov.
bridge, Mass., August 1975. 1973), 653-658.
[F7] CANADAY,R. n . ; HARRISON,R. D.; IVIE, [Gll] GARTH, W., "Design console technology
E. L.; RYDER, J. L.; AND WEHR, L. A., at General Motors," Proc. SHARE 1974
"A back-end computer for data base Conf., August 1974.
management," Comm. ACM 12, 10 (Oct. [G12] EVEREST, GORDON C., "The objectives
1974), 575-582. of data-base management," Information
[F8] HEACOX, H. C.;,, COSLOY, • E. S."; AND Systems COINS IV (Tou), Plenum Press,
COHEN, J . B . , An experiment m dedi- New York, 1974, pp. 1-3h, also MISRC-
cated data management," in Proc. of WP-71-64.
Internatl. Conf. on Very Large Data Bases, G[13] NAVATHE, S. B.; AND FRY, J. P., "Re-
Sept. 1975, ACM, New York, 1975, pp. structuring for large data bases: three
511-513. levels of abstraction," ACM, TODS, to
[F9] Su, S. Y. W.; COPELAND, G. P.; AND appear in June 1976.
LIPOVSKI, G. J., "Retrieval operations [G14] TEOREY, T. J.; AND DAS, K.S., "Applica-
and data representations in u context- tion of an analytical to evaluate storage
addressed disk system," Proc. ACM- structure", Data Translation Technical
SIGPLAN-SIGIR Interface Meeting on Report No. 76DE 7.1. Univ. of Michigan
Programming Languages and Information Graduate School of Business Administra-
Retrieval, Nov. 1973, pp. 144-160. tion, Ann Arbor, 1976.
IF10] LIN, C. S., AND SMITH, D. C. P., "The
design of a rotating associative array
memory for a relational data-base manage- (I) Introductory
ment application," ACM TODS, 1, 1
(March 1976), 53-65. [Ill LYON, J. K., Introduction to data base
[Fll] ANS1/X3/SPARC/STUDY GROUP--DATA design, Wiley Interscience, DiD. of John
BASE SYSTEMS, "Interim report,"ACM/ Wiley and Sons, New York, 1971.
SIGMOD Ncwsletter:fdt, 7, 2 (Dec. 1975). [I2] BACHMAN,C. W., "Data structure dia-
grams," SIGBDP: Data Base 1, 2 (1969).
[I3] BYRNES, C.; AND STEIG, D., "File man-
agement systems: a current summary,"
(G) General Datamation 15, 11 (Nov. 1969).
[I4] OLLE, T.W., "MIS: data bases," Data-
[G1] McGEE, W. C., "Generalization: key to mation 16, 15 (Nov. 1970).
successful electronic data processing," [I5] DIXON,PAUL, "The role of data manage-
J. ACM 6, 1 (Jan. 1959), 1-23. ment in management information sys-
[G2] McGEE, R. C.; AND TELLIER, H., " A tems," IAG Journal 3, 2 (August 1970).
re-evaluation of generalization," Data- [I6] CODASYL SYSTEMS COMMITTEE, "In-
marion, .(July-August 1960), 25-38. troduction to 'feature analysis of general-
[G3] YAO,S. B.; AND MERTEN, A.G., "Selec- ized data-base management'," Comm.
tion of file organization using an analytic ACM 14, 5 (May 1971), 308-318.
model," Proc. of the Internatl. Conf. on
Very Large Data Bases, Sept. 1975, ACM,
New York, 1975, pp. 255-267. (M) Data Models--Theory
[G4] STEEL, T., "Beginnings of a theory of
information handling," Comm. ACM 7, [M1] CODASYL DEVELOPMENT COMMITTEE,
2, (Feb. 1964), 87-103. "An information algebra, phase I report
[G5] ROSEN, SAUL, "Programming systems of the Language Structure Group," Comm.
and languages--a historical survey," Proc. ACM 5, 4 (April 19~2), 190-204.
of the Spring Jr. Computer Conf., 1964, [M2] CHILDS, D. L., 'Feasibility of a set-
V.ol 25, AFIPS Press, Montvale, N.J., theoretic data structure: a general struc-
1964, pp. 1-25. ture based on a reconstituted definition of
[G6] RosIN, ROBERT F., "Supervisory and relation," Proc. IFIP Congress 1968,
monitor systems," Computing Surveys North-Holland Publ. Co., Amsterdam,
1, 1 (March 1969), 37-54. The Netherlands, 1968, pp. 420--430.
• • . . ~ v • - • . . • ,. . . . . •. . ~ • ~ :
40 • James P. Fry and Edgar H. Sibley
Vol. 4. Information system design and' [X21] GRANT, E., LucID User's Manual,
utilization Tech Memo No. TM-2354/001, System
Vol. 5. Information retrieval. Development Corp., Santa Moniea, Calif.
[Xll] NAVAL COMMAND SYSTEMS SUPPORT Ac- [X22] June 1965.
SPITZER, J. F., et al., "The COLINGO
• IVlTY, "User's manual for NAVCOS- system design philosophy," in Informa-
SACT information processing system tion System Sciences, Proc. of the Second
ase I library maintenance system," Congress, 1965, Spartan Books, New York,
VCOSSACT Document No. 88MO08, 1965, pp. 36-39.
CM-52, August 1~63.'
[x12] NAVAL COMMAND SYSTEMS SUPPORT [X23] SDS MANAGE REFERENCE MANUAL,
Publication 90-10-46A, Scientific Data
ACTIVITY, "User's manual for NAVCOS- Systems, May 1966.
SACT information processing system [X24] CONNORS, T. L., "ADAM--a generalized
phase I , " NAVCOSSACT Document No. data management system," Proc. AFIPS
90S003A, CM-51, July 1963 Su lement I Spring Jt. Computer Conf., 1966, Vol. 28,
published Jan. 1964. " PP
[X13] SYSTEM DEVELOPMENT CORP., "System Spartan Books, New York, 1966, pp. 193-
203.
design specifications for LUCID phase I , " [X25] A USER'S GUIDE TO THE ADAM SYSTEM,
Tech. Memo No. TM-1749/0O0/O0, Santa MTR-268, M I T R E Corp., (AD 664 332),
Monica, Calif., Jan. 1964. August 1966.
Vol. 1. Lucid control system design [X26] IDHS 1410 FORMATTED FILE SYSTEM:
Part 1. The Master Tape, Tech Memo FILE MAINTENANCE AND FILE GENERA-
No. TM-1749/101/00. TION MANUAL, Defense Intelligence
Part 2. Parameter Load, Tech Memo Agency, DIAM-65-9-1, August 1966.
No. TM-1749/102/00. Also, IDHS 1410 FORMATTED FILE SYS-
Part 3. Operational Control, Tech Memo TEM: RETRIEVAL AND OUTPUT MANUALs
No. TM-1749/103/00. DIAM-69-9-2.
Part 4. Test Set-Up, Tech Memo No. [X27] A DESCRIPTION OF THE INTERNAL OPERA-
TM-1749/104/00. TIONS OF THE ADAM SYSTEM, MTR-216,
MITRE Corp., (AD 660 581), August 1966.
Vol. 2. "GENDARME data processing fa- [X28] DODD, G. G., " A P L - - a language for
cilities," Tech. Memo No. TM-1749/ associative data handling in PL/1," Proc.
201/00. AFIPS Fall Jr. Computer Conf., 1966,
Vol. 3. "Lucid program design: the Vol. 29, Spartan Books, New York, 1966,
grammar of OPAQUE,"Tech. Memo No. " 667--684.
TM-1749/301/O0. [X29] RHAUS, A.; AND MILLS, R., The Time-
[x14] BRYANT, J. H., "AIDS experience in Shared Data Management System: A New
managing data-base operation," Proc. of Approach to Data Management, Tech
the Symposium on Development and Man- Memo SP-2747, System Development
agement of a Computer-CenteredData Base, Corp., Santa Monica, Calif. 1967.
A. Walker, (Ed.), System Development WILLIAMS, W. D.; AND BARTRAM, ]~. C.,
Corp., Santa Monica, Calif., 1964, pp. COMPOSE~PRODUCE: A User-Oriented
36-42. Report Generator Capability Within the
[X151 BACHMAN, C. W.; AND WILLIAMS, S. B., SDC Time-Shared Data Management Sys-
"A general purpose programming system tem, Tech Memo SP-2634, System Develop-
for random access memories," Proc. ment Corp., Santa Monica, Calif. 1967.
AFIPS Fall Jt. Computer Conf., 1964, [X30] STEIL, G. P., "File management on a
Vol. 26, Spartan Books, New York, 1964, small computer," Proc. 1967 AFIPS
" 411--422. Spring Jt. Computer Conf., Spartan Books,
[X16] VAL COMMAND SYSTEMS SUPPORT AC- New York, 1967, pp. 199-203. ,
TIVITY, "User's manual 1401 TUFF tape [X31] DIXON, PAULJ.; AND SABLE, J., ' DM-1--
updater for formatted files," NAVCOS- A generalized data management system,"
SACT Document No. 90S012W, CM-108, Proc. AFIPS Spring Jt. Computer Conf.,
NM~YC1964. (30), 1967, 185-198.
(X17] S INFORMATION PROCESSING SYSTEM [X32] NAVAL COMMAND SYSTEMS SUPPORT AC-
(NIPs), IBM 1410, NMCS Support Cen- TIVITY, "User's manual for information
ter, Washington, D.C., 1964. processing" system phase 3A for.the AN/,,
[Xl8] INTEGRATED DATA STORE--A NEW CON- FYK-1 (V) data processing set,
CEPT IN DATA MANAGEMENT, Publica- NAVCOSSACT Document No. 88MO01A,
tion CPB-483 (5C10-16), General Electric CM-123, Revision 5, August 1967.
Co. [X33] A F L C / E S D / M I T R E , Advanced Data
[XI9] NAVAL COMMAND SYSTEMS SUPPORT AC- Management (ADAM) Experiments, Final
TIVITY, "7090 i n f o r m a t i o n processing Report, (AD 648 226), Feb. 1967.
system revised," NAVCOSSACT Docu- [X34] BROWN, R.; AND NORDYKE, G. P., "ICS
ment No. 90MO02, 0M-01, Oct. 1965. an information control system," Proc.
[x20] NAVAL COMMAND SYSTEMS SUPPORT AC- IFIPS Conf. Mechanized Information
TIVITY, "User's manual for 704/7090 TUFF Storage, Retrieval and Dissemination, 1967,
MOP I I I tape updater for formatted
files," NAVCOSSACT Document No. North-Holland Publ. Co., Amsterdam,
10S001, CM-74, Nov. 1963. Change 1 pub- The Netherlands, 1967.
lished Feb. 1964. Change 2 published [X35] Data Language No. 1 (DL-1) Encyclopedia
August 1965. Pub. $SM-F, North America Aviation,
Addresses Publications
EDP Analyzer EDP Analyzer
Canning Publications, Inc.
925 Anza Avenue
Vista, Calif. 92083
ACM Association for Computing Machinery SIGBDP
1133 Avenue of the Americas DBTG Specifications
New York, N.Y. 10036 CODASYL Systems Committee Re-
(212) 265-6300 port
SIGMOD Proceedings
SIGFIDET Proceedings
Comm. ACM
J.ACM
TOnS
Very Large Data Base Proceedings
Management Information Systems Research Center MISRC Publications
Graduate School of Business Administration
University of Minnesota
Minneapolis, Minn 55455
IFIP Administrative Data Processing Group CODASYL System Committee
6 Stadhouderskade DBTG Specification
Amsterdam 1013, The Netherlands IAG Journal
Addresses Publications
Technical Services Branch CODASYL COBOL Specification
Department of Supply and Services
88 Metcalfe Street
Fifth Floor
Ottawa, Ont., Canada K I A OS5
British Computer Society CODASYL System Committee
29 Portland Place DBTG Specifications
London Wl, England
National Technical Information Service Documents with AD or PB numbers
5285 Port Royal Road
Springfield, Va. 22151
SHARE Inc. SHARE Proceedings
One Illinois Center
111 E. Wacker Drive
Suite 600
Chicago, Ill. 60601
GUIDE Int. GUIDE Proceedings
Mr. Sandy Hill
Smith, Bucklin, and Associates
111 E. Wacker Drive
Chicago, Ill. 60601
System Development Corp. SDC Technical Reports,
2500 Colorado Boulevard Memorandums
Santa Monica, Calif.
The M I T R E Corp. M I T R E Technical Reports
Bedford Operations
Box 207
Bedford, Mass:
Washington Operations
Westgate Research Park
McClean, Va. 22101