SQL Nosql Databases Architectures 2nd
SQL Nosql Databases Architectures 2nd
Andreas Meier
Second Edition
Michael Kaufmann Andreas Meier
Informatik Institute of Informatics
Hochschule Luzern Universität Fribourg
Rotkreuz, Switzerland Fribourg, Switzerland
# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
The first edition of this book was published by Springer Vieweg in 2019
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
The term database has long since become part of people’s everyday vocabulary, for
managers and clerks as well as students of most subjects. They use it to describe a
logically organized collection of electronically stored data that can be directly
searched and viewed. However, they are generally more than happy to leave the
whys and hows of its inner workings to the experts.
Users of databases are rarely aware of the immaterial and concrete business
values contained in any individual database. This applies as much to a car importer’s
spare parts inventory as the IT solution containing all customer depots at a bank or
the patient information system of a hospital. Yet failure of these systems, or even
cumulative errors, can threaten the very existence of the respective company or
institution. For that reason, it is important for a much larger audience than just the
“database specialists” to be well-informed about what is going on. Anyone involved
with databases should understand what these tools are effectively able to do and
which conditions must be created and maintained for them to do so.
Probably the most important aspect concerning databases involves (a) the dis-
tinction between their administration and the data stored in them (user data) and
(b) the economic magnitude of these two areas. Database administration consists of
various technical and administrative factors, from computers, database systems, and
additional storage to the experts setting up and maintaining all these components—
the aforementioned database specialists. It is crucial to keep in mind that the
administration is by far the smaller part of standard database operation, constituting
only about a quarter of the entire efforts.
Most of the work and expenses concerning databases lie in gathering,
maintaining, and utilizing the user data. This includes the labor costs for all
employees who enter data into the database, revise it, retrieve information from
the database, or create files using this information. In the above examples, this means
warehouse employees, bank tellers, or hospital personnel in a wide variety of
fields—usually for several years.
In order to be able to properly evaluate the importance of the tasks connected with
data maintenance and utilization on the one hand and database administration on the
other hand, it is vital to understand and internalize this difference in the effort
required for each of them. Database administration starts with the design of the
database, which already touches on many specialized topics such as determining the
v
vi Foreword
consistency checks for data manipulation or regulating data redundancies, which are
as undesirable on the logical level as they are essential on the storage level. The
development of database solutions is always targeted on their later use, so
ill-considered decisions in the development process may have a permanent impact
on everyday operations. Finding ideal solutions, such as the golden mean between
too strict and too flexible when determining consistency conditions, may require
some experience. Unduly strict conditions will interfere with regular operations,
while excessively lax rules will entail a need for repeated expensive data repairs.
To avoid such issues, it is invaluable for anyone concerned with database
development and operation, whether in management or as a database specialist, to
gain systematic insight into this field of computer sciences. The table of contents
gives an overview of the wide variety of topics covered in this book. The title already
shows that, in addition to an in-depth explanation of the field of conventional
databases (relational model, SQL), the book also provides highly educational infor-
mation about current advancements and related fields, the keywords being NoSQL
and Big Data. I am confident that the newest edition of this book will once again be
well-received by both students and professionals—its authors are quite familiar with
both groups.
It is remarkable how stable some concepts are in the field of databases. Information
technology is generally known to be subject to rapid development, bringing forth
new technologies at an unbelievable pace. However, this is only superficially the
case. Many aspects of computer science do not essentially change. This includes not
only the basics, such as the functional principles of universal computing machines,
processors, compilers, operating systems, databases and information systems, and
distributed systems, but also computer language technologies such as C, TCP/IP, or
HTML that are decades old but in many ways provide a stable fundament of the
global, earth-spanning information system known as the World Wide Web. Like-
wise, the SQL language (Structured Query Language) has been in use for almost five
decades and will remain so in the foreseeable future. The theory of relational
database systems was initiated in the 1970s by Codd (relation model) and
Chamberlin and Boyce (SEQUEL). However, these technologies have a major
impact on the practice of data management today. Especially, with the Big Data
revolution and the widespread use of data science methods for decision support,
relational databases and the use of SQL for data analysis are actually becoming more
important. Even though sophisticated statistics and machine learning are enhancing
the possibilities for knowledge extraction from data, many if not most data analyses
for decision support rely on descriptive statistics using SQL for grouped aggrega-
tion. SQL is also used in the field of Big Data with MapReduce technology. In this
sense, although SQL database technology is quite mature, it is more relevant today
than ever.
Nevertheless, the developments in the Big Data ecosystem brought new
technologies into the world of databases, to which we pay enough attention too.
Non-relational database technologies, which find more and more fields of applica-
tion under the generic term NoSQL, differ not only superficially from the classical
relational databases but also in the underlying principles. Relational databases were
developed in the twentieth century with the purpose of tightly organized, operational
forms of data management, which provided stability but limited flexibility. In
contrast, the NoSQL database movement emerged in the beginning of the new
century, focusing on horizontal partitioning, schema flexibility, and index-free
neighborhood with the goal of solving the Big Data problems of volume, variety,
and velocity, especially in Web-scale data systems. This has far-reaching
vii
viii Preface
1 Database Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Systems and Databases . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 SQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Structured Query Language SQL . . . . . . . . . . . . . . . . . . . 6
1.2.3 Relational Database Management System . . . . . . . . . . . . . 8
1.3 Big Data and NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 NoSQL Database Management System . . . . . . . . . . . . . . . 12
1.4 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Graph-Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Graph Query Language Cypher . . . . . . . . . . . . . . . . . . . . 15
1.5 Document Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Document Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.2 Document-Oriented Database Language MQL . . . . . . . . . . 19
1.6 Organization of Data Management . . . . . . . . . . . . . . . . . . . . . . . . 21
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Database Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 From Requirements Analysis to Database . . . . . . . . . . . . . . . . . . . 25
2.2 The Entity-Relationship Model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Entities and Relationships . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Associations and Association Types . . . . . . . . . . . . . . . . . 29
2.2.3 Generalization and Aggregation . . . . . . . . . . . . . . . . . . . . 32
2.3 Implementation in the Relational Model . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Dependencies and Normal Forms . . . . . . . . . . . . . . . . . . . 35
2.3.2 Mapping Rules for Relational Databases . . . . . . . . . . . . . . 42
2.4 Implementation in the Graph Model . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Graph Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.2 Mapping Rules for Graph Databases . . . . . . . . . . . . . . . . . 51
2.5 Implementation in the Document Model . . . . . . . . . . . . . . . . . . . . 55
2.5.1 Document-Oriented Database Modeling . . . . . . . . . . . . . . 55
xi
xii Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Database Management
1
The evolution from the industrial society via the service society to the information
and knowledge society is represented by the assessment of information as a factor in
production. The following characteristics distinguish information from material
goods:
These properties clearly show that digital goods (information, software, multime-
dia, etc.), i.e., data, are vastly different from material goods in both handling and
economic or legal evaluation. A good example is the loss in value that physical
products often experience when they are used—the shared use of information, on the
other hand, may increase its value. Another difference lies in the potentially high
production costs for material goods, while information can be multiplied easily and
at significantly lower costs (only computing power and storage medium). This
causes difficulties in determining property rights and ownership, even though digital
watermarks and other privacy and security measures are available.
Information System
Communication
Database System Application Software network
or WWW
User
User guidance
Database Dialog design Request
Management
Business logic
Data querying Response
Data manipulation
Database Access permissions
Storage Data protection
One of the simplest and most intuitive ways to collect and present data is in a table.
Most tabular data sets can be read and understood without additional explanations.
To collect information about employees, a table structure as shown in Fig. 1.2 can
be used. The all-capitalized table name EMPLOYEE refers to the entire table, while
the individual columns are given the desired attribute names as headers, for example,
the employee number “E#,” the employee’s name “Name,” and their city of resi-
dence “City.”
An attribute assigns a specific data value from a predefined value range called
domain as a property to each entry in the table. In the EMPLOYEE table, the
attribute E# allows to uniquely identify individual employees, making it the key of
the table. To mark key attributes more clearly, they will be written in italics in the
table headers throughout this book.1 The attribute City is used to label the respective
1
Some major works of database literature mark key attributes by underlining.
4 1 Database Management
Table name
EMPLOYEE Attribute
E# Name City
Key attribute
E# Name City
E19 Stewart Stow
E4 Bell Kent
E1 Murphy Kent
E7 Howard Cleveland
places of residence and the attribute Name for the names of the respective employees
(Fig. 1.3).
The required information of the employees can now easily be entered row by row.
In the columns, values may appear more than once. In our example, Kent is listed as
the place of residence of two employees. This is an important fact, telling us that both
employee Murphy and employee Bell are living in Kent. In our EMPLOYEE table,
not only cities but also employee names may exist multiple times. For that reason,
the aforementioned key attribute E# is required to uniquely identify each employee
in the table.
1.2 SQL Databases 5
Identification Key
An identification key or just key of a table is one attribute or a minimal combination
of attributes whose values uniquely identify the records (called rows or tuples)
within the table. If there are multiple keys, one of them can be chosen as the primary
key. This short definition lets us infer two important properties of keys:
• Uniqueness: Each key value uniquely identifies one record within the table, i.e.,
different tuples must not have identical keys.
• Minimality: If the key is a combination of attributes, this combination must be
minimal, i.e., no attribute can be removed from the combination without
eliminating the unique identification.
Table Definition
To summarize, a table is a set of rows presented in tabular form. The data records
stored in the table rows, also called tuples, establish a relation between singular data
values. According to this definition, the relational model considers each table as a set
of unordered tuples. Tables in this sense meet the following requirements:
EMPLOYEE
E# Name City
E19 Stewart Stow
E4 Bell Kent
E1 Murphy Kent
E7 Howard Cleveland
Example query:
“Select the names of the employees living in Kent.”
As explained, the relational model presents information in tabular form, where each
table is a set of tuples (or records) of the same type. Seeing all the data as sets makes
it possible to offer query and manipulation options based on sets.
The result of a selective operation, for example, is a set, i.e., each search result is
returned by the database management system as a table. If no tuples of the scanned
table show the respective properties, the user gets a blank result table. Manipulation
operations similarly target sets and affect an entire table or individual table sections.
The primary query and data manipulation language for tables is called Structured
Query Language, usually shortened to SQL (see Fig. 1.4). It was standardized by
1.2 SQL Databases 7
2
ANSI is the national standards organization of the USA. The national standardization
organizations are part of ISO.
8 1 Database Management
Natural language:
Descriptive language:
SELECT Name
FROM EMPLOYEE
WHERE City = 'Kent'
Procedural language:
fact, there are modern relational database management systems that can be accessed
with natural language.
Databases are used in the development and operation of information systems in order
to store data centrally, permanently, and in a structured manner.
As shown in Fig. 1.6, relational database management systems are integrated
systems for the consistent management of tables. They offer service functionalities
and the descriptive language SQL for data description, selection, and manipulation.
Every relational database management system consists of a storage and a man-
agement component. The storage component stores both data and the relationships
between pieces of information in tables. In addition to tables with user data from
various applications, it contains predefined system tables necessary for database
operation. These contain descriptive information and can be queried, but not
manipulated, by users.
The management component’s most important part is the language SQL for
relational data definition, selection, and manipulation. This component also contains
service functions for data restoration after errors, for data protection, and for backup.
Relational database management systems (RDBMS) have the following properties:
1.2 SQL Databases 9
• Model: The database model follows the relational model, i.e., all data and data
relations are represented in tables. Dependencies between attribute values of
tuples or multiple instances of data can be discovered (cf. normal forms in Sect.
2.3.1).
• Schema: The definitions of tables and attributes are stored in the relational
database schema. The schema further contains the definition of the identification
keys and rules for integrity assurance.
• Language: The database system includes SQL for data definition, selection, and
manipulation. The language component is descriptive and facilitates analyses and
programming tasks for users.
• Architecture: The system ensures extensive data independence, i.e., data and
applications are mostly segregated. This independence is reached by separating
the actual storage component from the user side using the management compo-
nent. Ideally, physical changes to relational databases are possible without having
to adjust related applications.
• Multi-user operation: The system supports multi-user operation (cf. Sect. 4.1),
i.e., several users can query or manipulate the same database at the same time.
The RDBMS ensures that parallel transactions in one database do not interfere
with each other or worse, with the correctness of data (Sect. 4.2).
• Consistency assurance: The database management system provides tools for
ensuring data integrity, i.e., the correct and uncompromised storage of data.
• Data security and data protection: The database management system provides
mechanisms to protect data from destruction, loss, or unauthorized access.
NoSQL database management systems meet these criteria only partially (see
Chaps. 4 and 7). For that reason, most corporations, organizations, and especially
10 1 Database Management
SMEs (small and medium enterprises) rely heavily on relational database manage-
ment systems. However, for spread-out Web applications or applications handling
Big Data, relational database technology must be augmented with NoSQL technol-
ogy in order to ensure uninterrupted global access to these services.
The term Big Data is used to label large volumes of data that push the limits of
conventional software. This data can be unstructured (see Sect. 5.1) and may
originate from a wide variety of sources: social media postings; e-mails; electronic
archives with multimedia content; search engine queries; document repositories of
content management systems; sensor data of various kinds; rate developments at
stock exchanges; traffic flow data and satellite images; smart meters in household
appliances; order, purchase, and payment processes in online stores; e-health
applications; monitoring systems; etc.
There is no binding definition for Big Data yet, but most data specialists will
agree on three Vs: volume (extensive amounts of data), variety (multiple formats:
structured, semi-structured, and unstructured data; see Fig. 1.7), and velocity (high-
speed and real-time processing). Gartner Group’s IT glossary offers the following
definition:
Big Data
“Big data is high-volume, high-velocity and high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced
insight and decision making.”
Multimedia
With this definition, Big Data are information assets for companies. It is indeed
vital for companies and organizations to generate decision-relevant knowledge in
order to survive. In addition to internal information systems, they increasingly utilize
the numerous resources available online to better anticipate economic, ecologic, and
social developments on the markets.
Big Data is a challenge faced by not only for-profit-oriented companies in digital
markets but also governments, public authorities, NGOs (non-governmental
organizations), and NPOs (nonprofit organizations).
A good example are programs to create smart or ubiquitous cities, i.e., by using
Big Data technologies in cities and urban agglomerations for sustainable develop-
ment of social and ecologic aspects of human living spaces. They include projects
facilitating mobility, the use of intelligent systems for water and energy supply, the
promotion of social networks, expansion of political participation, encouragement of
entrepreneurship, protection of the environment, and an increase of security and
quality of life.
All use of Big Data applications requires successful management of the three Vs
mentioned above:
• Volume: There are massive amounts of data involved, ranging from giga- to
zettabytes (megabyte, 106 bytes; gigabyte, 109 bytes; terabyte, 1012 bytes;
petabyte, 1015 bytes; exabyte, 1018 bytes; zettabyte, 1021 bytes).
• Variety: Big Data involves storing structured, semi-structured, and unstructured
multimedia data (text, graphics, images, audio, and video; cf. Fig. 1.7).
• Velocity: Applications must be able to process and analyze data streams in real
time as the data is gathered.
• Value: Big Data applications are meant to increase the enterprise value, so
investments in personnel and technical infrastructure are made where they will
bring leverage or added value can be generated.
To complete our discussion of the concept of Big Data, we will look at another V:
Veracity is an important factor in Big Data, where the available data is of variable
quality, which must be taken into consideration in analyses. Aside from statistical
methods, there are fuzzy methods of soft computing which assign a truth value
between 0 (false) and 1 (true) to any result or statement.
12 1 Database Management
NoSQL
The term NoSQL is used for any non-relational database management approach
meeting at least one of two criteria:
Parallel execution
Weak to strong consistency
Document E
DI
Document D
Document
Document C D MOVIE
CT
Value = Order-Nr 1
ED
Document
DocumentB C
_B
Document A
Document B
Y
Key= Session-ID 2
AC
Key= Order-Nr A
Document
TE
Shopping cart
Key= Session-ID 3 Item 1
Item 2 ACTOR
Value = Order-Nr 3
Item 3
Figure 1.9 shows three different NoSQL database management systems. Key-
value stores (see also Sect. 7.2) are the simplest version. Data is stored as an
identification key <key = "key"> and a list of values <value = "value 1", "value
2", . . .>. A good example is an online store with session management and shopping
basket. The session ID is the identification key; the order number is the value stored
in the cache. In document stores, records are managed as documents within the
NoSQL database. These documents are structured files which describe an entire
subject matter in a self-contained manner. For instance, together with an order
number, the individual items from the basket are stored as values in addition to the
customer profile. The third example shows a graph database on movies and actors
discussed in the next section.
NoSQL databases support various database models (see Fig. 1.9). Here, we discuss
graph databases as a first example to look at and discuss its characteristics.
Property Graph
Property graphs consist of nodes (concepts, objects) and directed edges
(relationships) connecting the nodes. Both nodes and edges are given a label and
can have properties. Properties are given as attribute-value pairs with the names of
attributes and the respective values.
1.4 Graph Databases 15
MOVIE
Title HAS
Year
GENRE
Type
ACTED_IN
Role DIRECTED_BY
ACTOR
Name DIRECTOR
Birthyear Name
Nationality
A graph abstractly presents the nodes and edges with their properties. Figure 1.10
shows part of a movie collection as an example. It contains the nodes MOVIE with
attributes Title and Year (of release), GENRE with the respective Type (e.g., crime,
mystery, comedy, drama, thriller, western, science fiction, documentary, etc.),
ACTOR with Name and Year of Birth, and DIRECTOR with Name and Nationality.
The example uses three directed edges: The edge ACTED_IN shows which artist
from the ACTOR node starred in which film from the MOVIE node. This edge also
has a property, the Role of the actor in the movie. The other two edges, HAS and
DIRECTED_BY, go from the MOVIE node to the GENRE and DIRECTOR node,
respectively.
In the manifestation level, i.e., the graph database, the property graph contains the
concrete values (Fig. 1.11). For each node and for each edge, a separate record is
stored. Thus, in contrast to relational databases, the connections between the data are
not stored and indexed as key references, but as separate records. This leads to
efficient processing of network analyses.
Cypher is a declarative query language for extracting patterns from graph databases.
ISO plans to extend Cypher to become the international standard for graph-based
database languages as Graph Query Language (GQL) by 2023.
Users define their query by specifying nodes and edges. The database manage-
ment system then calculates all patterns meeting the criteria by analyzing the
16 1 Database Management
ACTOR
Name: Keanu Reeves
Birthyear: 1964
AC ole
k
ar
R
TE : N
M
DIRECTED_BY
D eo
D IN
_I
ak
e : D_
N
on
ol E
R CT
A
MOVIE MOVIE
Title: Man of Tai Chi Title: The Matrix
Year: 2013 Year: 1999
possible paths (connections between nodes via edges). The user declares the struc-
ture of the desired pattern, and the database management system’s algorithms
traverse all necessary connections (paths) and assemble the results.
As described in Sect. 1.4.1, the data model of a graph database consists of nodes
(concepts, objects) and directed edges (relationships between nodes). In addition to
their name, both nodes and edges can have a set of properties (see Property Graph in
Sect. 1.4.1). These properties are represented by attribute-value pairs.
Figure 1.11 shows a segment of a graph database on movies and actors. To keep
things simple, only two types of nodes are shown: ACTOR and MOVIE. ACTOR
nodes contain two attribute-value pairs, specifically (Name: FirstName LastName)
and (YearOfBirth: Year).
The segment in Fig. 1.11 includes different types of edges: The ACTED_IN
relationship represents which actors starred in which movies. Edges can also have
properties if attribute-value pairs are added to them. For the ACTED_IN relation-
ship, the respective roles of the actors in the movies are listed. For example, Keanu
Reeves is the hacker Neo in “The Matrix.”
Nodes can be connected by multiple relationship edges. The movie “Man of Tai
Chi” and actor Keanu Reeves are linked not only by the actor’s role (ACTED_IN)
but also by the director position (DIRECTED_BY). The diagram therefore shows
that Keanu Reeves both directed the movie “Man of Tai Chi” and starred in it as
Donaka Mark.
If we want to analyze this graph database on movies, we can use Cypher. It uses
the following basic query elements:
1.4 Graph Databases 17
For instance, the Cypher query for the year the movie “The Matrix” was released
would be:
The query sends out the variable m for the movie “The Matrix” to return the
movie’s year of release by m.Year. In Cypher, parentheses always indicate nodes,
i.e., (m: Movie) declares the control variable m for the MOVIE node. In addition to
control variables, individual attribute-value pairs can be included in curly brackets.
Since we are specifically interested in the movie “The Matrix,” we can add {Title:
“The Matrix”} to the node (m: Movie).
Queries regarding the relationships within the graph database are a bit more
complicated. Relationships between two arbitrary nodes (a) and (b) are expressed
in Cypher by the arrow symbol “->,” i.e., the path from (a) to (b) is declared as “(a) -
> (b).” If the specific relationship between (a) and (b) is of importance, the edge
[r] can be inserted in the middle of the arrow. The square brackets represent edges,
and r is our variable for relationships.
Now, if we want to find out who played Neo in “The Matrix,” we use the
following query to analyze the ACTED_IN path between ACTOR and MOVIE:
Cypher will return the result Keanu Reeves. For a list of movie titles (m), actor
names (a), and respective roles (r), the query would have to be:
Since our example graph database only contains one actor and two movies, the
result would be the movie “Man of Tai Chi” with actor Keanu Reeves in the role of
Donaka Mark and the movie “The Matrix” with Keanu Reeves as Neo.
18 1 Database Management
In real life, however, such a graph database of actors, movies, and roles has
countless entries. A manageable query would therefore have to remain limited, e.g.,
to actor Keanu Reeves, and would then look like this:
Similar to SQL, Cypher uses declarative queries where the user specifies the
desired properties of the result pattern (Cypher) or results table (SQL) and the
respective database management system then calculates the results. However,
analyzing relationship networks, using recursive search strategies, or analyzing
graph properties is hardly possible with SQL.
Graph databases are even more relationship-oriented than relational databases.
Both nodes and edges of the graph are independent data sets. This allows efficient
traversal of the graph for network-like information. However, there are applications
that focus on structured objects as a unit. Document databases are suitable for this
purpose, which will be described in the next section.
Digital Document
A digital document is a set of information that describes a subject matter as a closed
unit and is stored as a file in a computer system.
In contrast, as shown in the previous section, a graph database would use different
node and edge types. A separate data set would be stored for each node and for each
edge. The data would be divided in a network-like manner (cf. Fig. 1.12 to the right).
Data records in document databases have a structuring that divides the content
into recognizable subunits. Lists of field values can be nested in a tree-like manner.
For example, the invoice document in Fig. 1.12 contains an “Item” field. This
contains a list of items, each of which again has fields such as “Name” and
1.5 Document Databases 19
Fig. 1.12 Invoice data is stored in a self-contained manner in the document model
“Price” with corresponding values. More often than lists or arrays, the complex
object structure is used to organize documents. The JSON (JavaScript Object
Notation) format is a syntax for describing complex objects that is particularly
suitable for Web development in JavaScript (see Sect. 2.5.1).
For example, if we want to display the invoices of the company “Miller Elektro”
in Fig. 1.12, we can use the following MQL query:
This will make the database system return a list of invoices that match the filter
criterion. Each document is output in a complex object structure with a unique
identification key. This way, we get all the complete data for each invoice with
self-contained records.
In this code example, the constant “db” is an object that provides the functionality
of the database system. Collections of the database are accessible as child objects in
fields of the “db” object, e.g., db.INVOICES, providing methods such as find,
insertOne, updateOne, and deleteOne.
The query language MQL is structured with JSON. For example, the filter in the
find() method is passed as a parameter in JSON notation, which lists the filter criteria
as a field-value pair.
If we want to output a list of customers to whom the company “Miller Elektro”
has written an invoice, this is accomplished with a second argument:
db.INVOICES. find(
{"Vendor.Company": "Miller Elektro"},
{"customer.company": 1, _id: 0} )
The second list defines the projection with fields that are either included (value 1)
or excluded (value 0). Here, the field “Company” of the subobject “Customer” is
included in the result as an inclusion projection; the field _id is excluded. Thus, we
get a list of JSON documents containing only the values of the included fields:
Unlike SQL, MQL evolved in practice and is based on the JSON format, whose
creator says he did not invent it but “discovered” it because it already “existed in
nature.” Because of this organic development, many concepts of MQL appear
somewhat different from those of SQL, which have been theorized based on
mathematical principles.
Many companies and organizations view their data as a vital resource, increasingly
joining in public information gathering (open data) in addition to maintaining their
own data. The continuous global increase of data volume and the growth of
information providers and their 24/7 services reinforce the importance of
Web-based data pools.
The necessity for current information based in the real world has a direct impact
on the conception of the field of IT. In many places, specific positions for data
management have been created for a more targeted approach to data-related tasks
and obligations. Proactive data management deals both strategically with informa-
tion gathering and utilization and operatively with the efficient provision and
analysis of current and consistent data.
Development and operation of data management incur high costs, while the
return is initially hard to measure. Flexible data architecture, non-contradictory
and easy-to-understand data description, clean and consistent databases, effective
security concepts, current information readiness, and other factors involved are hard
to assess and include in profitability considerations. Only the realization of the data’s
importance and longevity makes the necessary investments worthwhile for the
company.
For better comprehension of the term data management, we will look at the four
subfields: data architecture, data governance, data technology, and data utilization.
Figure 1.13 illustrates the objectives and tools of these four fields within data
management.
Data utilization enables the actual, profitable application of business data. A
specialized team of data scientists conducts business analytics, providing and
reporting on data analyses to management. They also support individual
departments, e.g., marketing, sales, customer service, etc., in generating specific
relevant insights from Big Data. Questions that arise in connection with data use are
the following:
• What are the components, interfaces, and data flows of the database and informa-
tion systems?
• Which entities, relationships, and attributes are mapped for the use case?
• Which data structures and data types are used by the DBMS to organize the data?
• Who plans, develops, and operates the database and information systems using
what methods?
• Who has what access to the data?
• How are security, confidentiality, integrity, and availability requirements met?
Bibliography 23
Data technology specialists install, monitor, and reorganize databases and are in
charge of their multilayer security. Their field further includes technology manage-
ment and the need for the integration of new extensions and constant updates and
improvements of existing tools and methods. The data flows from and to the
database systems, and the user interfaces are also provided technologically. For
Big Data, it is of central importance that the speed of data processing is also
optimized for large data volumes. Thus, data engineering deals with the following
questions:
• Which SQL or NoSQL database software is used and for what reasons?
• How is the database system implemented and integrated?
• How is the data entered or migrated into the database?
• How is the data queried, manipulated, and transformed?
• How can the database system and queries be optimized in terms of volume and
speed?
Data Management
Data management includes all operational, organizational, and technical aspects of
data usage, data architecture, data administration, and data technology that optimize
the deployment of data as a resource.
Bibliography
Celko, J.: Joe Celko’s Complete Guide to NoSQL – What every SQL professional needs to know
about nonrelational databases. Morgan Kaufmann (2014)
Connolly, T., Begg, C.: Database Systems – A Practical Approach to Design, Implementation, and
Management. Pearson (2015)
Coronel, C., Morris, S.: Database Systems – Design, Implementation, & Management. Cengage
Learning (2018)
Edlich, S., Friedland, A., Hampe, J., Brauer, B., Brückner, M.: NoSQL – Einstieg in die Welt
nichtrelationaler Web 2.0 Datenbanken. Carl Hanser Verlag (2011)
Elmasri, R., Navathe, S.: Fundamentals of Database Systems. Addison-Wesley (2022)
24 1 Database Management
Fasel, D., Meier, A. (eds.): Big Data – Grundlagen, Systeme und Nutzungspotenziale. Edition
HMD, Springer (2016)
Hoffer, J., Venkataraman, R.: Modern Database Management. Pearson (2019)
Kemper, A., Eikler, A.: Datenbanksysteme – Eine Einführung. DeGruyter (2015)
MongoDB, Inc.: MongoDB Documentation (2022)
Perkins, L., Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks: A Guide to Modern
Databases and the Nosql Movement, 2nd edn. O’Reilly UK Ltd., Raleigh, NC (2018)
Ploetz, A., Kandhare, D., Kadambi, S., Wu, X.: Seven NoSQL Database in a Week – Get Up and
Running with the Fundamentals and Functionalities of Seven of the Most Popular NoSQL
Databases. Packt Publishing (2018)
Saake, G., Sattler, K.-U., Heuer, A.: Datenbanken – Konzepte und Sprachen. mitp (2018)
Silberschatz, A., Korth, H., Sudarshan, S.: Database Systems Concepts. McGraw Hill (2019)
Steiner, R.: Grundkurs Relationale Datenbanken – Einführung in die Praxis der
Datenbankentwicklung für Ausbildung, Studium und IT-Beruf. Springer Vieweg (2021)
Ullman, J., Garcia-Molina, H., Widom, H.: Database Systems – The Complete Book. Pearson
(2013)
Database Modeling
2
Data models provide a structured and formal description of the data and data
relationships required for an information system. Based on this, a database model
or schema defines the corresponding structuring of the database. When data is
needed for IT projects, such as the information about employees, departments, and
projects in Fig. 2.1, the necessary data categories and their relationships with each
other can be defined. The definition of those data categories, called entity sets, and
the determination of relationship sets are at this point done without considering the
kind of database management system (SQL or NoSQL) to be used for entering,
storing, and maintaining the data later. This is to ensure that the data and data
relationships will remain stable from the users’ perspective throughout the develop-
ment and expansion of information systems.
It takes three steps to set up a database structure: requirement analysis, conceptual
data modeling, and implementing database schemas by mapping the entity relation-
ship model to SQL or NoSQL databases.
The goal of requirement analysis (see point 1 in Fig. 2.1) is to find, in cooperation
with the user, the data required for the information system and their relationships to
each other including the quantity structure. This is vital for an early determination of
the system boundaries. The requirements catalog is prepared in an iterative process,
based on interviews, demand analyses, questionnaires, form compilations, etc. It
contains at least a verbal task description with clearly formulated objectives and a list
of relevant pieces of information (see the example in Fig. 2.1). The written descrip-
tion of data connections can be complemented by graphical illustrations or a
summarizing example. It is imperative that the requirement analysis puts the facts
necessary for the later development of a database in the language of the users.
Step 2 in Fig. 2.1 shows the conception of the entity-relationship model, which
contains both the required entity sets and the relevant relationship sets. Our model
depicts the entity sets as rectangles and the relationship sets as rhombi. Based on the
requirement catalog from step 1, the main entity sets are DEPARTMENT,
4. ...
2. Entity-relationship
DEPARTMENT model
Entity sets
Relationship sets
MEMBERSHIP
DEPARTMENT
EMPLOYEE
IS_MEMBER IS_INVOLVED
EMPLOYEE
DEPARTMENT PROJECT
PROJECT
3c. Document model
EMPLOYEE:
Name:
Place:
MEMBERSHIP
DEPARTMENT:
Name:
PROJECTS:
INVOLVED
Name:
Workload:
Name:
Workload:
1
The names of entity and relationship sets are spelled in capital letters, analogous to table, node, and
edge names.
28 2 Database Modeling
EMPLOYEE
E# City
Name Street
an entity-relationship model. This allows for the gathering and discussion of data
modeling factors with the users, independent from any specific database system.
Only in the next design step is the most suitable database schema determined and
mapped out. For relational, graph-oriented, and document-oriented databases, there
are clearly defined mapping rules.
An entity is a specific object in the real world or our imagination that is distinct from
all others. This can be an individual, an item, an abstract concept, or an event.
Entities of the same type are combined into entity sets and further characterized by
attributes. These attributes are property categories of the entity and/or the entity set,
such as size, name, weight, etc.
For each entity set, an identification key, i.e., one attribute or a specific combina-
tion of attributes, is set as unique. In addition to uniqueness, it also has to meet the
criterion of the minimal combination of attributes for identification keys as described
in Sect. 1.2.1.
In Fig. 2.2, an individual employee is characterized as an entity by their concrete
attributes. If, in the course of internal project monitoring, all employees are to be
listed with their names and address data, an entity set EMPLOYEE is created. An
artificial employee number in addition to the attributes Name, Street, and City allows
2.2 The Entity-Relationship Model 29
E# P#
Percentage
for the unique identification of the individual employees (entities) within the staff
(entity set).
Besides the entity sets themselves, the relationships between them are of interest
and can form sets of their own. Similar to entity sets, relationship sets can be
characterized by attributes.
Figure 2.3 presents the statement “Employee Murphy does 70 % of their work on
project P17” as a concrete example of an employee-project relationship. The respec-
tive relationship set INVOLVED is to list all project participations of the employees.
It contains a concatenated key constructed from the foreign keys employee number
and project number. This combination of attributes ensures the unique identification
of each project participation by an employee. Along with the concatenated key, the
relationship set receives its own attribute named “Percentage” specifying the per-
centage of working hours that employees allot to each project they are involved in.
In general, relationships can be understood as associations in two directions: The
relationship set INVOLVED can be interpreted from the perspective of the
EMPLOYEE entity set as “one employee can participate in multiple projects” and
from the entity set PROJECT as “one project is handled by multiple employees.”
The association of an entity set ES_1 to another entity set ES_2, also called role, is
the meaning of the relationship in that direction. As an example, the relationship
30 2 Database Modeling
Association types:
DEPARTMENT Type 1: “exactly one”
c 1 Type c: “none or one”
Type m: “one or
multiple”
DEPARTMENT_HEAD MEMBERSHIP Type mc: “none, one, or
multiple”
1 m
DEPARTMENT_HEAD in Fig. 2.4 has two associations: On the one hand, each
department has one employee in the role of department head; on the other hand,
some employees could fill the role of department head for a specific department.
Associations are sometimes also labeled. This is important when multiple
relationships are possible between two identical entity sets.
Each association from an entity set ES_1 to an entity set ES_2 can be weighted by
an association type. The association type from ES_1 to ES_2 indicates how many
entities of the associated entity set ES_2 can be assigned to a specific entity from
ES_1.2 The main distinction is between single, conditional, multiple, and multiple-
conditional association types.
2
It is common in database literature to note the association type from ES_1 to ES_2 next to the
associated entity set, i.e., ES_2.
2.2 The Entity-Relationship Model 31
The association types provide information about the cardinality of the relation-
ship. As we have seen, each relationship contains two association types. The
cardinality of a relationship between the entity sets ES_1 and ES_2 is therefore a
pair of association types in the form:
For example, the pair (mc,m) of association types between EMPLOYEE and
PROJECT indicates that the INVOLVED relationship is (multiple-conditional,
multiple).
Figure 2.5 shows all 16 possible combinations of association types. The first
quadrant contains four options of unique-unique relationships (case B1 in Fig. 2.5).
They are characterized by the cardinalities (1,1), (1,c), (c,1), and (c,c). For case B2,
the unique-complex relationships, also called hierarchical relationships, there are
eight possible combinations. The complex-complex or network-like relationships
(case B3) comprise the four cases (m,m), (m,mc), (mc,m), and (mc,mc).
Instead of the association types, minimum and maximum thresholds can be set if
deemed more practical. For instance, instead of the multiple association type from
projects to employees, a range of (MIN,MAX) := (3,8) could be set. The lower
threshold defines that at least three employees must be involved in a project, while
the maximum threshold limits the number of participating employees to eight.
3
The character combination “:=” stands for “is defined by.”
32 2 Database Modeling
A2
A1 1 c m mc
• Overlapping entity subsets: The specialized entity set overlap with each other.
As an example, if the entity set EMPLOYEE has two subsets PHOTO_CLUB
and SPORTS_CLUB, the club members are consequently considered employees.
However, employees can be active in both the company’s photography and sports
club, i.e., the entity subsets PHOTO_CLUB and SPORTS_CLUB overlap.
• Overlapping complete entity subsets: The specialization entity sets overlap with
each other and completely cover the generalized entity set. If we add a
2.2 The Entity-Relationship Model 33
EMPLOYEE
disjoint
complete
c c c
MANAGEMENT_
SPECIALIST TRAINEE
POSITION
COMPANY
“Company “Subsidiary is
consists of...” dependent on...”
mc mc
CORPORATION_
STRUCTURE
ITEM
c mc
ITEM_LIST
multiple sub-items, while on the other hand, each sub-item points to exactly one
superordinate item.
The entity-relationship model is very important for computer-based data
modeling tools, as it is supported by many CASE (computer-aided software engi-
neering) tools to some extent. Depending on the quality of these tools, both general-
ization and aggregation can be described in separate design steps, on top of entity
and relationship sets. Only then can the entity-relationship model be converted, in
part automatically, into a database schema. Since this is not always a one-to-one
mapping, it is up to the data architect to make the appropriate decisions. The
following sections provide some simple mapping rules to help in converting an
entity-relationship model into a relational, graph, or document database.
The study of the relational model has spawned a new database theory that precisely
describes formal aspects.
Relational Model
The relational model represents both data and relationships between data as tables.
Mathematically speaking, any relation R is simply a set of n-tuples. Such a relation is
always a subset of a Cartesian product of n attribute domains, R ⊆ D1 × D2 × . . . ×
Dn, with Di as the domain of the i-th attribute/property. A tuple is an ordered set of
specific data values or manifestations, r = (d1, d2, . . ., dn). Please note that this
36 2 Database Modeling
DEPARTMENT_EMPLOYEE
definition means that any tuple may only exist once within any table, i.e., a relation R
is a tuple set R = {r1, r2, . . ., rm}.
The relational model is based on the works of Edgar Frank Codd from the early
1970s. They were the foundation for the first relational database systems, created in
research facilities and supporting SQL or similar database languages. Today, their
sophisticated successors are firmly established in many practical uses.
One of the major fields within this theory are the normal forms, which are used to
discover and study dependencies within tables in order to avoid redundant informa-
tion and resulting anomalies.
Attribute Redundancy
An attribute in a table is redundant if individual values of this attribute can be
omitted without a loss of information.
To give an example, the following table DEPARTMENT_EMPLOYEE contains
employee number, name, street, and city for each employee, plus their department
number and department name.
For every employee of department D6, the table in Fig. 2.9 lists the department
name Accounting. If we assume that each department consists of multiple
employees, similar repetitions would occur for all departments. We can say that
the DepartmentName attribute is redundant, since the same value is listed in the table
multiple times. It would be preferable to store the name going with each department
number in a separate table for future reference instead of redundantly carrying it
along for each employee.
Tables with redundant information can lead to database anomalies, which can
take one of three forms: If, for organizational reasons, a new department D9, labeled
marketing, is to be defined in the DEPARTMENT_EMPLOYEE table from Fig. 2.9,
but there are not yet any employees assigned to that department, there is no way of
2.3 Implementation in the Relational Model 37
No multivalued dependencies
No transitive dependencies
adding it. This is an insertion anomaly—no new table rows can be inserted without a
unique employee number.
Deletion anomalies occur if the removal of some data results in the inadvertent
loss of other data. For example, if we were to delete all employees from the
DEPARTMENT_EMPLOYEE table, we would also lose the department numbers
and names.
The last kind are update anomalies (or modification anomalies): If the name of
department D3 were to be changed from IT to Data Processing, each of the
department’s employees would have to be edited individually, meaning that
although only one detail is changed, the DEPARTMENT_EMPLOYEE table has
to be adjusted in multiple places. This inconvenient situation is what we call an
update anomaly.
The following paragraphs discuss normal forms, which help to avoid
redundancies and anomalies. Figure 2.10 gives an overview over the various normal
forms and their definition. Below, we will take a closer look at different kinds of
dependencies and give some practical examples.
38 2 Database Modeling
As seen in Fig. 2.10, the normal forms progressively limit acceptable tables. For
instance, a table or entire database schema in the third normal form must meet all
requirements of the first and second normal form, plus there must be no transitive
dependencies between non-key attributes.
In the following, the first, second, and third normal forms are treated and
discussed with examples. For reasons of lack of relevance for practice, even more
restrictive normal forms are not discussed further. We refer to relevant literature for
theoretical interest.4
Understanding the normal forms helps to make sense of the mapping rules from
an entity-relationship model to a relational model (see Sect. 2.3.2). In fact, we will
see that with a properly defined entity-relationship model and consistent application
of the relevant mapping rules, the normal forms will always be met. Simply put, by
creating an entity-relationship model and using mapping rules to map it onto a
relational database schema, we can mostly forget checking the normal forms for
each individual design step.
Functional Dependencies
The first normal form is the basis for all other normal forms and is defined as follows:
4
For example, author Graeme C. Simsion presents a simplified hybrid of fourth and fifth normal
forms he calls “Business Fifth Normal Form” in “Data Modeling Essentials,” which is easy for
newcomers to data modeling to understand.
2.3 Implementation in the Relational Model 39
PROJECT_EMPLOYEE (unnormalized)
E# Name City P#
E7 P1 Howard Cleveland
E7 P9 Howard Cleveland
E1 P7 Murphy Kent
E1 P11 Murphy Kent
E1 P9 Murphy Kent
Transitive Dependencies
In Fig. 2.12, we return to the DEPARTMENT_EMPLOYEE table from earlier,
which contains department information in addition to the employee details. We can
immediately tell that the table is in both first and second normal form—since there is
no concatenated key, we do not even have to check for full functional dependency.
However, the DepartmentName attribute is still redundant. This can be fixed using
the third normal form.
Transitive dependency:
E# D# DepartmentName
D# is not functionally
dependent on E#
This section discusses how to map the entity-relationship model onto a relational
database schema, i.e., how entity sets and relationship sets can be represented in
tables.
Database Schema
A database schema is the description of a database, i.e., the specification of the
database structures and the associated integrity constraints. A relational database
schema contains definitions of the tables, the attributes, and the primary keys.
Integrity constraints set limits for the domains, the dependencies between tables,
and the actual data.
By definition, a table requires a unique primary key (see Sect. 1.2.1). It is possible
that there are multiple candidate keys in a table, all of which meet the requirement of
uniqueness and minimality. In such cases, it is up to the data architects which
candidate key they would like to use as the primary key.
2.3 Implementation in the Relational Model 43
DEPARTMENT
c 1
DEPARTMENT_HEAD MEMBERSHIP
1 m
RULE R1
RULE R2
The term foreign key describes an attribute within a table that is used as an
identification key in at least one other table (possibly also within this one). Identifi-
cation keys can be reused in other tables to create the desired relationships between
tables.
Figure 2.13 shows how rules R1 and R2 are applied to a concrete example: Each
of the entity sets DEPARTMENT, EMPLOYEE, and PROJECT is mapped onto a
corresponding table DEPARTMENT, EMPLOYEE, and PROJECT. Similarly,
44 2 Database Modeling
tables are defined for each of the relationship sets DEPARTMENT_HEAD, MEM-
BERSHIP, and INVOLVED.
The DEPARTMENT_HEAD uses the department number D# as primary key.
Since each department has exactly one department head, the department number D#
suffices as identification key for the DEPARTMENT_HEAD table.
The MEMBERSHIP table uses the employee number E# as primary key. Like-
wise, E# can be the identification key of the MEMBERSHIP table because each
employee belongs to exactly one department.
In contrast, the INVOLVED table requires the foreign keys employee number E#
and project number P# to be used as a concatenated key, since one employee can
work on multiple projects and each project can involve multiple employees. In
addition, the INVOLVED table lists also the Percentage attribute as another charac-
teristic of the relationship.
The use of rules R1 and R2 alone does not necessarily result in an ideal relational
database schema as this approach may lead to a high number of individual tables. For
instance, it seems doubtful whether it is really necessary to define a separate table for
the role of department head in our example from Fig. 2.13. As shown in the next
section, the DEPARTMENT_HEAD table is indeed not required under mapping
rule R5. The department head role would instead be integrated as an additional
attribute in the DEPARTMENT table, listing the employee number of the respective
head for each department.
This rule requires that the relationship set INVOLVED from Fig. 2.14 has to be a
separate table with a primary key, which in our case is the concatenated key
expressing the foreign key relationships to the tables EMPLOYEE and PROJECT.
The Percentage attribute describes the share of the project involvement in the
employee’s workload.
Under rule R2, we could define a separate table for the MEMBERSHIP relation-
ship set with the two foreign keys department number and employee number. This
would be useful if we were supporting matrix management and planning to get rid of
unique subordination with the association type 1, since this would result in a
complex-complex relationship between DEPARTMENT and EMPLOYEE.
2.3 Implementation in the Relational Model 45
RULE R3
Following rule R4, we forgo a separate MEMBERSHIP table in Fig. 2.15. Instead
of the additional relationship set table, we add the foreign key D#_Sub to the
EMPLOYEE table to list the appropriate department number for each employee.
The foreign key relationship is defined by an attribute created from the carried-over
identification key D# and the role name Subordination.
For unique-complex relationships, including the foreign key can uniquely iden-
tify the relationship. In Fig. 2.15, the department number is taken over into the
EMPLOYEE table as a foreign key according to rule R4. If, reversely, the employee
numbers were listed in the DEPARTMENT table, we would have to repeat the
department name for each employee of a department. Such unnecessary and redun-
dant information is unwanted and goes against the theory of the normal forms (in this
case, conflict with the second normal form; see Sect. 2.3.1).
46 2 Database Modeling
RULE R4
DEPARTMENT EMPLOYEE
D# DepartmentName E# Name City D#_Sub
Here, too, it is relevant which of the tables we take the foreign key from: Type
1 associations are preferable so the foreign key with its role name can be included in
each tuple of the referencing table (avoidance of null values; see also Sect. 3.3.4).
In Fig. 2.16, the employee numbers of the department heads are added to the
DEPARTMENT table, i.e., the DEPARTMENT_HEAD relationship set is
represented by the M#_DepHead attribute. Each entry in this referencing attribute
with the role “DepHead” shows who leads the respective department.
If we included the department numbers in the EMPLOYEE table instead, we
would have to list null values for most employees and could only enter the respective
department number for the few employees actually leading a department. Since null
values often cause problems in practice, they should be avoided whenever possible,
so it is better to have the “DepartmentHead” role in the DEPARTMENT table. For
(1,c) and (c,1) relationships, we can therefore completely prevent null values in the
foreign keys, while for (c,c) relationships, we should choose the option resulting in
the fewest null values.
2.4 Implementation in the Graph Model 47
RULE R5
DEPARTMENT EMPLOYEE
D# DepartmentName E#_DepHead E# Name City
Graph theory is a complex subject matter vital to many fields of use where it is
necessary to analyze or optimize network-like structures. Use cases range from
computer networks, transport systems, work robots, power distribution grids, or
electronic relays over social networks to economic areas such as corporation
structures, workflows, customer management, logistics, process management, etc.
In graph theory, a graph is defined by the sets of its nodes (or vertices) and edges plus
assignments between these sets.
Undirected Graph
An undirected graph G = (V,E) consists of a vertex set V and an edge set E, with
each edge being assigned two (potentially identical) vertices.
Graph databases are often founded on the model of directed weighted graphs.
However, we are not yet concerned with the type and characteristics of the vertices
and edges, but rather the general abstract model of an undirected graph. This level of
abstraction is sufficient to examine various properties of network structures, such as:
• How many edges have to be passed over to get from one node to another one?
• Is there a path between two nodes?
• Is it possible to traverse the edges of a graph visiting each vertex once?
48 2 Database Modeling
13
12
1
3 2
4 11
7
6 8
10
5
9
• Can the graph be drawn two-dimensionally without any edges crossing each
other?
These fundamental questions can be answered using graph theory and have
practical applications in a wide variety of fields.
Connected Graph
A graph is connected if there are paths between any two vertices.
One of the oldest graph problems illustrates how powerful graph theory can be:
Degree of a Vertex
The degree of a vertex is the number of edges incident to it, i.e., originating from it.
The decision problem for an Eulerian cycle is therefore easily answered: A graph
G is Eulerian, if it is connected and each node has an even degree.
Figure 2.17 shows a street map with 13 bridges. The nodes represent districts, the
edges connecting bridges between them. Every vertex in this example has an even
degree, which means that there has to be an Eulerian cycle.
pre_v: v2
S7 = {v0,v3,v2,v5,v1,v6,v4,v7} v1 dist: 5
2
7 pre_v: v0
dist: 3
3 v2
1 v0
v3 1
pre_v: v0
dist: 1 pre_v: v2
6
v5 dist: 4
pre_v: v0 5
4 dist: 6 v
4
3 4
v6 v7
pre_v: v3 2 pre_v: v6
dist: 5 dist: 7
v0 Initial node
v3 v2 v4
pre_v: v0 pre_v: v0 pre_v: v0
dist: 1 dist: 3 dist: 6
v6 v5 v1
pre_v: v3 pre_v: v2 pre_v: v2
dist: 5 dist: 4 dist: 5
v7
pre_v: v6
dist: 7
Destination node
Weighted Graph
Weighted graphs are graphs whose vertices or edges have properties assigned
to them.
Weight of a Graph
The weight of a graph is the sum of all weights within the graph, i.e., all node or edge
weights.
Dijkstra’s Algorithm
• (1) Initialization: Set the distance in the initial node to 0 and in all other nodes to
infinite. Define the set S0 := {pre_v: initial node, dist: 0}.
• (2) Iterate Sk while there are still unvisited vertices and expand the set Sk in each
step as described below:
– (2a) Calculate the sum of the respective edge weights for each neighboring
vertex of the current node.
– (2b) Select the neighboring vertex with the smallest sum.
– (2c) If the sum of the edge weights for that node is smaller than the distance
value stored for it, set the current node as the previous vertex (pre_v) for it and
enter the new distance in Sk.
It becomes obvious that with this algorithm, the edges traversed are always
those with the shortest distance from the current node. Other edges and nodes are
considered only when all shorter paths have already been included. This method
ensures that when a specific vertex is reached, there can be no shorter path (greedy
algorithm5). The iterative procedure is repeated until either the distance from
initial to destination node has been determined or all distances from the initial
node to all other vertices have been calculated.
Property Graph
Graph databases have a structuring scheme, the property graph, which was
introduced in Sect. 1.4.1. Formally, a property graph can be defined by a set of
5
In each step, greedy algorithms select the locally optimal subsequent conditions according to the
relevant metric.
2.4 Implementation in the Graph Model 51
In a graph database, data is stored as nodes and edges, which contain as properties
node and edge types and further data, e.g., in the form of attribute-value pairs. Unlike
conventional graphs, property graphs are multigraphs, i.e., they allow multiple edges
between two nodes. To do this, edges are given their own identity and are no longer
defined by pairs of nodes, but by two indicators that define the beginning and end of
the edge. This edge identity and the mapping of edges as their own data sets lead to
the constant performance in graph analyses, independent of data volume (see Sect. 5.
2.7 on “Index-Free Adjacency”).
The center of Fig. 2.19 shows how the entity sets DEPARTMENT, EMPLOYEE,
and PROJECT are mapped onto corresponding nodes of the graph database, with the
attributes attached to the nodes (attributed vertices).
DEPARTMENT
c 1
DEPARTMENT_HEAD MEMBERSHIP
1 m
RULE G1
RULE G2
M
EM
D
BE
R
SH
IP
D
EP
AR
TM
EN
INVOLVED
M P
T_
H
Percentage:
EA
D
Percentage
RULE G3
IN
E
VO
LV
ES
Percentage:
IS
_I
N
P
VO
LV
ED
RULE G4
IS_MEMBER
For instance, Fig. 2.22 illustrates the definition of department heads: The rela-
tionship set DEPARTMENT_HEAD becomes the directed edge
HAS_DEPARTMENT_HEAD leading from the DEPARTMENT node (D) to the
EMPLOYEE node (E). The arrowhead is associated with “1,” since each department
has exactly one department head.
The graph-based model is highly flexible and offers lots of options, since it is not
limited by normal forms. However, users can use this freedom too lavishly, which
may result in overly complex, potentially redundant graph constellations. The
2.5 Implementation in the Document Model 55
RULE G5
HAS_DEPARTMENT_HEAD
presented mapping rules for entity sets (G1) and relationship sets (G2, G3, G4, and
G5) are guidelines that may be ignored based on the individual use case.
Complex Objects
Complex objects allow the description of structural relationships between semanti-
cally related data in their entirety. The holistic approach is intended to make
references and foreign keys unnecessary, which enables the scaling mentioned
above. Precisely for this purpose, complex objects represent a powerful structuring
tool that is easy to understand.
Complex objects are built from simpler objects by applying constructors to them.
The simplest objects include values such as numbers, text, and Boolean values.
These are called atomic objects. Based on this, composite structures can be created
by combining and structuring objects with so-called constructors. There are different
object constructors like tuples, sets, lists, and arrays. These constructors can be
applied to all objects: to atomic as well as to compound, constructed, complex
objects. Thus, by repeated application of the constructors, starting from atomic
objects, complex objects can be built, which can represent various relationships
between entities together with their attributes through their structuring.
A well-known example of a syntax for mapping complex objects is JSON, which
we will look at in detail below, since it forms the basis for the data structure of
common document databases.
The constructors OBJECT { } and LIST [ ] are orthogonal in the sense of complex
objects, because they can be applied to all values, i.e., to basic data types as well as to
more complex structures.
2.5 Implementation in the Document Model 57
Document Structure
EMPLOYEE:
Name:
Place:
DEPARTMENT:
Designation:
PROJECTS:
Title:
Workload:
Title:
Workload:
JSON-Syntax
{
"EMPLOYEE": {
"Name": "Murphy",
"Place": "Kent",
"DEPARTMENT": {
"Designation": "IT" },
"PROJECTS": [ {
"Title": "WebX",
"Workload": 0.55 }, {
"Title": "ITorg",
"Workload": 0.45 } ]
}}
Fig. 2.23 JSON representation of a fact about the use case in Fig. 2.1
As an example, Fig. 2.23 shows a JSON structure that meets the requirements of
the use case in Fig. 2.1. We see in a collection of JSON documents the description of
58 2 Database Modeling
JSON-Schema
{
"type": "object",
"properties": {
"EMPLOYEE": { "type": "object",
"properties": {
"Name": { "type": "string" },
"Place": { "type": "string" },
"DEPARTMENT": { "type": "object",
"properties": {
"Designation": { "type": "string" } } },
"PROJECTS": { "type": "array",
"items": [ { "type": "object",
"properties": {
"Title": { "type": "string" },
"Workload": { "type": "integer" } } }
]}}}}}
Fig. 2.24 Specification of the structure from Fig. 2.23 with JSON Schema
the case for the IT employee Murphy from Kent. He works on the WebX project
55% of the time and on the ITorg project 45% of the time.
To represent this situation, we need an object EMPLOYEE with fields Name and
Location, a subobject DEPARTMENT with field Designation, and a list of
subobjects PROJECT with fields Title and Workload.
JSON does not provide a schema definition as a standard. Since the validation of
data exchanges is relevant in practice, another standard has developed for this
purpose in the form of JSON Schema (cf. Fig. 2.24).
JSON Schema
A JSON Schema can be used to specify the structure of JSON data for validation and
documentation. JSON Schemas are descriptive JSON documents (metadata) that
specify an intersection pattern for JSON data necessary for an application. JSON
Schema is written in the same syntax as the documents to be validated. Therefore,
the same tools can be used for both schemas and data. JSON Schema was specified
in a draft by the Internet Engineering Task Force. There are several validators for
different programming languages. These can check a JSON document for structural
conformance with a JSON Schema document.
JSON-Prototype
{
"EMPLOYEE": {
"Name": "",
"PLACE": "",
"DEPARTMENT": {
"Designation": "" },
"PROJECTS": [ {
"Title": "",
"Workload": 0 } ]
}}
complex facts, JSON Schemas become unwieldy. They are well-suited for machine
validation, but are not easy for humans to read.
Therefore, we propose the prototype method for conceptual JSON data modeling.
A prototype (from Greek: πρωτóτυπoς, original image) is an exemplar that
represents an entire category. Thus, a JSON prototype is a JSON document
representing a class of JSON documents with the same structural elements
(OBJECT, PROPERTY, LIST). A JSON prototype defines the structure not as a
description by metadata, but by demonstration. For example, the document in
Fig. 2.24 can be viewed as a blueprint for documents having the same objects,
properties, and lists, where JSON data corresponding to this prototype can have
arbitrary data values (FIELD, VALUE) of the same data type in the same fields.
However, to distinguish JSON prototypes from concrete data, we propose to
represent the values with zero values instead of dummy values. These are the empty
string ("") for text, zero (0) for numbers, and true for truth values. For lists, we
assume that specified values represent repeatable patterns of the same structure.
In Fig. 2.25, we see a well-human-readable JSON prototype instead of the JSON
Schema in Fig. 2.24 for modeling. For conceptual design, this human-oriented
approach of the JSON prototype is recommended. Moreover, for machine validation,
a JSON Schema can be generated from a prototype using appropriate tools.
For these reasons, we will use the JSON prototype method in the following for
modeling JSON structures and for mapping entity-relationship models in JSON
models.
Very similar to the mapping rules for the design of tables and the structuring of
graphs, we now look at how we can map entity and relationship sets in JSON
documents as objects, properties, and lists. As an illustrative example, Fig. 2.26
60 2 Database Modeling
DEPARTMENT
c 1
DEPARTMENT_HEAD MEMBERSHIP
1 m
RULE D1
{
"EMPLOYEE": {
"Name": "",
"Place": ""
}}
{
"PROJECT": {
"Title": ""
}}
{
"DEPARTMENT": {
"Designation": ""
}}
Fig. 2.26 Mapping of selected entity sets and attributes to objects and properties
For example, in Fig. 2.26, the entity sets DEPARTMENT with attribute Desig-
nation, EMPLOYEE with attributes Name and Location, and PROJECT with
attribute Title are mapped into corresponding objects with root element.
Now we consider the mapping of relationship sets to documents. Here we notice
that a complete document mapping a set of facts with multiple entities and
relationships implies an ordering of the entities and relationships. In the document
model, the relationships are no longer symmetric, but aggregated.
Rule D2 (Aggregation)
For each symmetric set of relationships mapped in a document type, an asymmetric
aggregation must be specified. It is decided which of the related entity sets will be
superordinately associated in the present use case and which entity set will be
subordinately associated.
RULE D2
RULE D3
{
"EMPLOYEE": {
"Name": "",
"Place": "",
"DEPARTMENT": {
"Designation": "" }
}}
For multivalued associations, list fields are needed. As another example, we see in
Fig. 2.28 that the association of EMPLOYEE to PROJECT is multiple (mc). Again,
EMPLOYEE is marked root element with an extra rectangle. Also, in Fig. 2.28,
INVOLVED is marked with a wider frame on the side of PROJECT to indicate that
projects will be aggregated into the EMPLOYEE object in the document. However,
this time, the association of EMPLOYEE to PROJECT is multiple. Employees can
work on different projects with different workloads (Percentage). Therefore, in
Fig. 2.28, the projects to employees are stored in a field with lists of subobjects of
type PROJECTS.
The relationship attribute Percentage is stored as the property Workload in the
subobject PROJECT in Fig. 2.28. Here we see that from the entity-relationship
model, the attributes of relationships with composite keys (here, e.g., Percentage;
cf. Fig. 2.3) can be mapped in JSON as fields of the subobjects (in this case,
projects), since these take on the context of the parent objects (such as the employees
in this case).
A document type stores a data structure with respect to a particular use case in an
application. For example, this can be a view, a download, an editor, or a data
interface. Different use cases may involve the same entity and relationship sets in
2.5 Implementation in the Document Model 63
Percentage
RULE D4
{
"EMPLOYEE": {
"Name": "",
"Place": "",
"PROJECTS": [ {
"Title": "", RULE D5
"Workload": 0 } ]
}}
Fig. 2.28 Aggregation of an ambiguous association as a field with a list of subobjects, including
relationship attributes
This means that one and the same entity can be subordinate at one time and
superordinate at another time in different document types. For example, for project
management data model in Fig. 2.1, there are two use cases:
First, the employee data is entered in an input mask according to the structure in
Fig. 2.23 on the level of individual employees. For this write access, it is more
efficient to store individual employee records as independent documents.
Second, all employees are reported per department with project workloads
including calculated financial expenditure in one application view. For this read
access, the transmission of a single document per department is more economical.
Therefore, for the sake of performance, a deliberate redundancy can be inserted by
serving both use cases with different document structures that use the same entity
and relationship sets. Therefore, in Fig. 2.29, another document type has been
64 2 Database Modeling
RULE D6
DEPARTMENT
c 1
DEPARTMENT_HEAD MEMBERSHIP
1 m
RULE D7
{
"DEPARTMENT": {
"Designation": "",
"EMPLOYEES": {
"DEPARTMENT_HEAD": {
"Name": "",
"Place": "" } },
"MEMBERS": [{
"Name": "",
"Place": "",
"PROJECTS": [ {
"Title": "",
"Workload": 0,
"Expense": 0 } ]
}]}
}}
Fig. 2.29 Document type DEPARTMENT with aggregation of the same entity set EMPLOYEE in
two different associations DEPARTMENT_HEAD and MEMBERS
2.6 Formula for Database Design 65
defined for this purpose. This time, DEPARTMENT is the root entity, and
employees are aggregated. Now, a second, unique association of department to
employee has been added: the department head. Thus, a field named EMPLOYEE
would not be unique. This has been solved here by adding another structural level
under EMPLOYEES with two properties named after the DEPARTMENT_HEAD
and MEMBERSHIP relationship sets.
This section condenses our knowledge of data modeling into a formulaic action plan.
The design steps can be characterized as follows: First, during the requirements
analysis, the relevant information facts must be recorded in writing in a catalog. In
the further design steps, this list can be supplemented and refined in consultation
with the future users, since the design procedure is an iterative process. In the second
step, the entity and relationship sets are determined as well as their identification
keys and feature categories.
Then, generalization hierarchies and aggregation structures6 can be examined in
the third step. In the fourth step, the entity-relationship model is aligned with the
existing application portfolio so that the further development of the information
systems can be coordinated and driven forward in line with the longer-term corporate
goals. In addition, this step serves to avoid legacy systems as far as possible and to
preserve the enterprise value with regard to the data architecture.
The fifth step maps the entity-relationship model to an SQL and/or NoSQL
database. In this process, the explained mapping rules for entity sets and relationship
sets are used (cf. corresponding mapping rules for the relational, graph, and docu-
ment models, respectively). In the sixth step, the integrity and privacy rules are
defined. In the seventh step, the database design is checked for completeness by
6
Two important abstraction principles in data modeling are aggregation and generalization. Under
the title “Database Abstractions: Aggregation and Generalization,” the two database specialists
J.M. Smith and D.C.P Smith already pointed this out in 1977 in Transactions on Database Systems.
Aggregation means the combination of entity sets to a whole; generalization means the generaliza-
tion of entity sets to a superordinate entity set.
66 2 Database Modeling
Preliminary study
Detailed concept
Rough concept
Steps in database design
1. Data analysis D D D
2. Entity and relationship sets D D D
3. Generalization and aggregation D D D
4. Alignment with the enterprise-wide D D D
data architecture
developing important use cases (cf. Unified Modeling Language7) and prototyping
them with descriptive query languages.
The determination of an actual quantity structure as well as the definition of the
physical data structure takes place in the eighth step. This is followed by the physical
distribution of the data sets and the selection of possible replication options in the
ninth step. When using NoSQL databases, it must be weighed up here, among other
things, whether or not availability and failure tolerance should be given preference
over strict consistency (cf. CAP theorem in Sect. 4.5.1). Finally, performance testing
and optimization of data and access structures must be performed in the tenth step to
guarantee users from different stakeholder groups reasonable response times for their
application processes or data searches.
The recipe shown in Fig. 2.30 is essentially limited to the data aspects. In addition
to data, functions naturally play a major role in the design of information systems.
Thus, CASE tools (CASE = computer-aided software engineering) have emerged in
recent years to support not only database design but also function design.
7
The Unified Modeling Language or UML is an ISO-standardized modeling language for the
specification, construction, and documentation of software. An entity-relationship model can be
easily transformed into a class diagram and vice versa.
Bibliography 67
Bibliography
Atkinson, M., Bancilhon, F., DeWitt, D., Dittrich, K., Maier, D., Zdonik, S.: The object-oriented
database system manifesto. In: Deductive and Object-Oriented Databases. North-Holland,
Amsterdam (1990)
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. Internet Engineering
Task Force, Request for Comments RFC 8259 (2017)
Chen, P.P.-S.: The entity-relationship model – Towards a unified view of data. ACM Trans.
Database Syst. 1(1), 9–36 (1976)
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM. 13(6), 377–387
(1970)
Dutka, A.F., Hanson, H.H.: Fundamentals of Data Normalization. Addison-Wesley (1989)
Kemper, A., Eikler, A.: Datenbanksysteme – Eine Einführung. DeGruyter (2015)
Knauer, U.: Algebraic Graph Theory – Morphisms, Monoids and Matrices. De Gruyter, Berlin
(2019)
Marcus, D.A.: Graph Theory – A Problem Oriented Approach. The Mathematical Association of
America (2008)
Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In:
Proceedings of the 25th International Conference on World Wide Web, Republic and Canton of
Geneva, pp. 263–273 (2016)
Smith, J.M., Smith, D.C.P.: Database abstractions: aggregation and generalization. ACM Trans.
Database Syst. 2(2), 105–133 (1977)
Database Languages
3
In the previous chapter, we have seen how to model databases. To operate the
database, different stakeholders interact with it, as shown in Fig. 3.1.
Data architects define the database schema. They design an architecture to run
the database system and embed it into the existing landscape with all necessary
components. They also describe and document the data and structures. It makes
sense for them to be supported by a data dictionary system (see “Glossary”).
Database specialists, often called database administrators, install the database
server. For schema-oriented database systems (e.g., relational), they create the
database schema. For schema-free databases (e.g., document model), this step is
not necessary, because the schema is created implicitly by inserting appropriate
database objects. Based on this, large amounts of data can be imported into the
database. To do this, there are extract-transform-load (ETL) tools or powerful import
functionalities of the database software. To protect the data, administrators define
users, roles, and access rights and ensure regular backup of the database. For large
amounts of data, they ensure the performance and efficiency of the database system
by, for example, creating indexes, optimizing queries syntactically, or distributing
the database server across multiple computers.
Application programmers develop applications that allow users to insert, modify,
and delete data in the database. They also implement interfaces through which data
can be automatically exchanged with other databases.
Data analysts, who are also called data scientists if they are very highly
specialized, analyze databases in order to support data-based decisions. To do this,
they query data, evaluate it using statistical and/or soft computing-based methods
(see, e.g., fuzzy databases in Chap. 6), and visualize the results.
To successfully operate a database, a database language is necessary that can
cover the different requirements of the users. Query and manipulation languages for
databases have the advantage that one and the same language can be used to create
databases, assign user rights, or modify and evaluate data. In addition, a descriptive
Users
Install the database server
Applications Create the database schema
Import data
Design the system architecture
Assign user rights
Define the database schema
Secure backups
Describe the data
Optimize performance
Databases
Implement applications Query, analyze, and
and interfaces visualize data
Query, manipulate, or Support decisions
delete data
Interfaces
language allows precise, reproducible interaction with the database without having
to program routines and procedures. Therefore, we will look at different database
languages in the following.
We start with a theoretical model for database languages. The relational algebra
provides a formal framework for database query languages. It defines a number of
algebraic operators that always apply to relations. Although most modern database
languages do not use those operators directly, they provide analogous
functionalities. However, they are only considered relationally complete languages
in terms of the relational model if the original potential of relational algebra is
retained.
Below, we will give an overview of the operators used in relational algebra,
divided into set operators and relational operators, on two sample relations R and
S. Operators work on either one or two tables and always output a new relation. This
3.2 Relational Algebra 71
Set operators
R øS R ùS
S S S
Fig. 3.2 Set union, set intersection, set difference, and Cartesian product of relations
Relation operators
M
(R) F
(R)
R S
Division of a relation R
by a sub-relation S
R÷S
The following two sections provide a more detailed explanation of set and
relation operators of relational algebra with illustrative examples.
Since every relation is a set of records (tuples), multiple relations can be correlated
using set theory. However, it is only possible to form a set union, set intersection, or
set difference of two relations if they are union-compatible.
3.2 Relational Algebra 73
Union Compatibility
Two relations are union-compatible if they meet both of the following criteria: Both
relations have the same number of attributes and the data formats of the
corresponding attribute categories are identical.
Figure 3.4 shows an example: For each of two company clubs, a table has been
defined from an employee file, containing employee numbers, last names, and street
names. The two tables SPORTS_CLUB and PHOTO_CLUB are union-compatible:
They have the same number of attributes, with values from the same employee file
and therefore defined from the same range.
In general, two union-compatible relations R and S are combined by a set union
R[S where all entries from R and all entries from S are entered into the resulting
table. Identical records are automatically unified, since a distinction between tuples
with identical attribute values in the resulting set R[S is not possible.
The CLUB_MEMBERS table (Fig. 3.5) is a set union of the tables
SPORTS_CLUB and PHOTO_CLUB. Each result tuple exists in the
SPORTS_CLUB table, the PHOTO_CLUB table, or both of them. Club member
Howard is only listed once in the result table, since duplicate results are not
permitted in the unified set.
The other set operators are defined similarly: The set intersection R\S of two
union-compatible relations R and S holds only those entries found in both R and
S. In our table excerpt, only employee Howard is an active member of both the
SPORTS_CLUB and the PHOTO club.
The resulting set SPORTS_CLUB\PHOTO_CLUB is a singleton, since exactly
one person has both memberships.
SPORTS_CLUB
PHOTO_CLUB
Fig. 3.5 Set union of the two tables SPORTS_CLUB and PHOTO_CLUB
Union-compatible relations can also be subtracted from each other: The set
difference R\S is calculated by removing all entries from R that also exist in S. In
our example, a subtraction SPORTS_CLUB\PHOTO_CLUB would result in a
relation containing only the members Murphy and Stewart. Howard would be
eliminated, since he is also a member of the PHOTO_CLUB. The set difference
operator therefore allows us to find all members of the sport club that are not also part
of the photo club.
The basic relationship between the set intersection operator and the set difference
operator can be expressed as a formula:
R \ S = R∖ðR∖SÞ:
The relation-based operators complement the set operators. A projection πa(R) with
the project operator π forms a subrelation of the relation R based on the attribute
names defined by a. For instance, given a relation R with the attributes (A,B,C,D),
the expression πA,C(R) reduces R to the attributes A and C. The attribute names in a
projection do not have to be in order; e.g., R′ := πC,A(R) means a projection of R =
(A,B,C,D) onto R′ = (C,A).
The first example in Fig. 3.7, πCity(EMPLOYEE), lists all places of residence
from the EMPLOYEE table in a single-column table without any repetitions. The
second example, πSub,Name(EMPLOYEE), results in a subtable with all department
numbers and names of the respective employees.
The select operator σ in an expression σF(R) extracts a selection of tuples from the
relation R based on the formula F. F consists of a number of attribute names and/or
value constants connected by comparison operators, such as <, >, or =, or by
logical operators, e.g., AND, OR, or NOT. σF(R) therefore includes all tuples from R
that meet the selection condition F.
This is illustrated by the examples for selection of tuples from the EMPLOYEE
table in Fig. 3.8: In the first example, all employees meeting the condition
City=Kent, i.e., living in Kent, are selected. The second example with the condition
“Sub=D6” picks out only those employees working in department D6. The third and
last example combines the two previous selection conditions with a logical connec-
tive, using the formula “City=Kent AND Sub=D6.” This results in a singleton
relation, since only employee Bell lives in Kent and works in department D6.
Of course, the operators of relational algebra as described above can also be
combined with each other. For instance, if we first do a selection for employees of
department D6 by σSub=D6(EMPLOYEE) and then project on the City attribute
using the operator πCity(σSub=D6(EMPLOYEE)), we get a result table with the two
towns of Stow and Kent.
76 3 Database Languages
EMPLOYEE
City
(EMPLOYEE) Sub,Name
(EMPLOYEE)
Stow D6 Stewart
Kent D3 Murphy
Cleveland D5 Howard
D6 Bell
Next is the join operator, which merges two relations into a single one. The join
R|×|PS of the two relations R and S by the predicate P is a combination of all tuples
from R with all tuples from S where each meets the join predicate P the join operator
combines a Cartesian product with a selection over predicate P, hence the symbol.
The join predicate contains one attribute from R and one from S. Those two
attributes are correlated by a comparison operator (<, >, or =) so that the relations
R and S can be combined. If the join predicate P uses the relational operator =, the
result is called an equi-join.
The join operator often causes misunderstandings which may lead to wrong or
unwanted results. This is mostly due to the predicate for the combination of the two
tables being left out or ill-defined.
For example, Fig. 3.9 shows two join operations with and without a defined join
predicate. By specifying EMPLOYEE |×|Sub=D#DEPARTMENT, we join the
EMPLOYEE and DEPARTMENT tables by expanding the employee information
with their respective departments.
Should we forget to define a join predicate in the example from Fig. 3.9 and
simply specify EMPLOYEE × DEPARTMENT, we get the Cartesian product of the
two tables EMPLOYEE and DEPARTMENT. This is a rather meaningless combi-
nation of the two tables, since all employees are juxtaposed with all departments,
resulting in combinations of employees with departments they are not actually part
of (see also the COMPETITION table in Fig. 3.6).
3.2 Relational Algebra 77
σCity=Kent (EMPLOYEE)
σSub=D6 (EMPLOYEE)
As shown by the examples in Fig. 3.9, the join operator |×| with the join predicate
P is merely a limited Cartesian product.
In fact, a join of two tables R and S without a defined join predicate P expresses
the Cartesian product of the R and S tables, i.e., for an empty predicate P={}
R j×j P = fg S = R × S:
This general formula demonstrates that each join can be expressed using first a
Cartesian product and second a selection.
Referring to the example from Fig. 3.9, we can calculate the intended join
EMPLOYEE j ×jSub=D# DEPARTMENT with the following two steps: First we
generate the Cartesian product of the two tables EMPLOYEE and DEPARTMENT.
Then all entries of the preliminary result table meeting the join predicate Sub=D#
are determined using the selection σSub=D#(EMPLOYEE × DEPARTMENT). This
gives us the same tuples as calculating the join EMPLOYEE j× jSub=D# DEPART-
MENT directly (see the tuples marked in yellow in Fig. 3.9).
78 3 Database Languages
EMPLOYEE DEPARTMENT
EMPLOYEE × DEPARTMENT
Fig. 3.9 Join of two relations with and without a join predicate
3.2 Relational Algebra 79
R: Table of employees
and projects they are
assigned to S: Project combination R’ := R ÷ S
E# P# P# E#
E1 P1 P2 E1
E1 P2 P4 E4
E1 P4
E2 P1
All employees working
E2 P2 on projects P2 and P4
E4 P4
E4 P2
Completeness Criterion
A database language is considered relationally complete if it enables at least the set
operators set union, set difference, and Cartesian product as well as the relation
operators projection and selection.
This is the most important criterion for assessing a language’s suitability for
relational contexts. Not every language working with tables is relationally complete.
If it is not possible to combine multiple tables via their shared attributes, the
language is not equivalent to relational algebra and can therefore not be considered
relationally complete.
Relational algebra is the foundation for the query part of relational database
languages. Of course, it is also necessary to be able to not only analyze but also
manipulate tables or individual parts. Manipulation operations include, among
others, insertion, deletion, or changes to tuple sets. Database languages therefore
need the following functions in order to be practically useful:
The definition of relational algebra has given us the formal framework for
relational database languages. However, this formal language is not per se used in
practice; rather, it has been a long-standing approach to try and make relational
database languages as user-friendly as possible. Since the algebraic operators in
their pure form are beyond most database users, they are represented by more
accessible language elements. The following sections will give examples from
SQL in order to illustrate this.
3.3 Relational Language SQL 81
In the 1970s, the language SEQUEL (Structured English QUEry Language) was
created for IBM’s System R, one of the first working relational database systems.
The concept behind SEQUEL was to create a relationally complete query language
based on English words, such as “select,” “from,” “where,” “count,” “group by,”
etc., rather than mathematical symbols. A derivative of that language named SQL
(Structured Query Language) was later standardized first by ANSI and then interna-
tionally by ISO. For years, SQL has been the leading language for database queries
and interactions.
A tutorial for SQL can be found on the website accompanying this book, www.
sql-nosql.org. The short introduction given here covers only a small part of the
existing standards; modern SQL offers many extensions, e.g., for programming,
security, object orientation, and analysis.
Before we can query data, we must be able to enter data into the database.
Therefore, we will first start with how to create a database schema starting from
the table structure and fill it with data.
SQL provides the CREATE TABLE command for defining a new table. The
EMPLOYEE table would be specified as follows. The first column, which consists
of six characters, is called E# and cannot be empty; The second column called Name
can contain up to 20 characters, and so on.1
1
The column name E# without double quotes is used here for illustrative purposes. In standard
SQL, unquoted object names may contain only letters (A–Z and a–z), numbers (0–9), and the
underscore character. Thus, a column name such as E# is not a legal name unless you enclose it in
double quotes (called a quoted identifier in the SQL standard), but if you do that, you have to
reference it as “E#” everywhere it is referenced. The names can also be modified to something like
E_ID if readers would like to run the examples throughout the book.
82 3 Database Languages
The SQL language allows to define characteristics and tables (data definition
language or DDL). The SQL standard specifies different formats as data types:
In practice, the INSERT command of SQL is rather suitable for modest data
volumes. For larger data volumes (cf. big data), SQL database systems often offer
special NoSQL language extensions that support efficient loading of large data
volumes (so-called bulk loads).2 In addition, extract-transform-load (ETL) tools
and application programming interfaces (API) exist for this purpose.
Existing tables can be manipulated using UPDATE statements:
2
Examples are the LOAD command in MySQL and the COPY command in PostgreSQL.
3.3 Relational Language SQL 83
UPDATE EMPLOYEE
SET City = 'Cleveland'
WHERE City = 'Cuyahoga Heights'
This example replaces the value Cuyahoga Heights for the City attribute with the
new name Cleveland in all matching tuples of the EMPLOYEE table. The UPDATE
manipulation operation is set-based and can edit a multi-element set of tuples.
The content of entire tables or parts of tables can be removed with the help of
DELETE statements:
DELETE statements usually affect sets of tuples if the selection predicate applies
to multiple entries in the table. Where referential integrity is concerned, deletions can
also impact dependent tables.
As described in Sect. 1.2.2, the basic structure of SQL looks like this:
In the following, we will look at the individual relational operators and their
implementation in SQL.
Projection
The SELECT clause corresponds to the project operator of relational algebra, in that
it defines a list of attributes. In SQL, the equivalent of the project operator πSub,
Name(EMPLOYEE) as shown in Fig. 3.7 is simply
SELECT Place
FROM EMPLOYEE;
The result table is a single-column table with the localities Stow, Kent, Cleveland,
and Kent, as shown in Fig. 3.11 to the right.
Correctly, it must be added here that the result table of query is not a relation at all
in the sense of the relation model, since every relation is a set by definition and hence
does not allow duplicates. Since SQL, unlike relation algebra, does not eliminate
duplicates, the word DISTINCT must be added to the SELECT clause (cf. Fig. 3.11).
Cartesian Product
The FROM clause lists all tables to be used. For instance, the Cartesian product of
EMPLOYEE and DEPARTMENT is expressed in SQL as
This command generates the cross-product table from Fig. 3.9, similar to the
equivalent operators
3.3 Relational Language SQL 85
EMPLOYEEj × jP = fg DEPARTMENT
and
EMPLOYEE × DEPARTMENT:
Join
By setting the join predicate “Sub=D#” in the WHERE clause, we get the equi-join
of the EMPLOYEE and DEPARTMENT tables in SQL notation:
SELECT E#,Name,Street,City,Sub,D#,Department_Name
FROM EMPLOYEE, DEPARTMENT
WHERE Sub=D#
SELECT *
FROM EMPLOYEE,
JOIN DEPARTMENT
ON Sub=D#
An asterisk (*) in the SELECT clause means that all attributes in the table are
selected, i.e., the result table contains all the attributes E#, Name, Street, City, and
Sub (Subordinate).
Selection
Qualified selections can be expressed by separate statements in the WHERE clause
being connected by the logical operators AND or OR. The SQL command for the
selection of employees σCity=Kent AND Sub=D6(EMPLOYEE) as shown in Fig. 3.8
would be
SELECT *
FROM EMPLOYEE
WHERE City='Kent' AND Sub='D6'
The WHERE clause contains the desired selection predicate. Executing the above
query would therefore give us all information of the employee Bell from Kent
working in department D6.
86 3 Database Languages
Union
The set-oriented operators of the relation algebra find their equivalent in the SQL
standard. For example, if one wants to unite the union-compatible tables
SPORTCLUB with FOTOCLUB, this is done in SQL with the keyword UNION:
SELECT *
FROM SPORTCLUB
UNION
SELECT *
FROM FOTOCLUB;
Since the two tables are union-compatible, the results table contains all sports and
photo club members, eliminating duplicates.
Difference
If you want to find out all sport club members who are not in the photo club at the
same time, the query is done with the difference operator EXCEPT:
SELECT *
FROM SPORTCLUB
EXCEPT
SELECT *
FROM PHOTOCLUB;
Intersection
For union-compatible tables, intersections can be formed. If you are interested in
members who participate in both the sports club and the photo club, the INTER-
SECT keyword comes into play:
SELECT *
FROM SPORTCLUB
INTERSECT
SELECT *
FROM FOTOCLUB;
In addition to the common operators of relational algebra, SQL also contains built-in
functions that can be used in the SELECT clause.
3.3 Relational Language SQL 87
Aggregate Functions
These include the aggregate functions which calculate a scalar value based on a set,
namely, COUNT for counting, SUM for totaling, AVG for calculating the average,
MAX for determining the maximum, and MIN for finding the minimum value.
For example, all employees working in department D6 can be counted. In SQL,
this request is as follows:
The result is a one-element table with a single value 2, which according to the
table excerpt in Fig. 3.7 stands for the two persons Stewart and Bell.
Grouping
The results of aggregations can also be grouped by values of variables. For example,
all employees working in each department can be counted. In SQL, this prompt is as
follows:
The result is a table with one row per department number together with the
corresponding number of employees.3 With the last line of the statement, the result
is sorted by the number of employees in descending order.
Nested Queries
It is allowed and sometimes necessary to formulate another SQL call within an SQL
statement. In this context, one speaks of nested queries. Such queries are useful, for
example, when searching for the employee with the highest salary:
3
COUNT(*) can also be used, the difference being COUNT(*) counts all rows that pass any filters
while COUNT(column_name) only counts rows where the column name specified is not NULL,
cf. Sect. 3.3.4.
88 3 Database Languages
This statement contains another SQL statement within the WHERE clause to
select the salaries of all employees. This is called an inner SQL expression or
subquery. In the outer SQL statement, the PERSONNEL table is consulted again
to get the employee with M# and name who earns the highest salary. The keyword
ALL means that the condition must be valid for all results of the subquery.
The existence quantifier of the propositional logic is expressed in the SQL
standard by the keyword EXISTS. This keyword is set to “true” in an SQL evalua-
tion if the subsequent subquery selects at least one element or row.
As an example of a query with an EXISTS keyword, we can refer to the project
affiliation INVOLVED, which shows which employees work on which projects. If
we are interested in the employees who are not doing project work, the SQL
statement is as follows:
In the outer statement, the names and addresses of the employees who do not
belong to a project are selected from the table EMPLOYEES. For this purpose, a
subquery is formulated to get all employees’ project affiliations (relations). In the
exclusion procedure (NOT EXISTS), we obtain the desired employees who do not
perform project work.
In this query, we can see once again how useful substitute names (aliases) are
when formulating SQL statements.
The work with databases regularly entails situations where individual data values for
tables are not (yet) known. For instance, it may be necessary to enter a new employee
in the EMPLOYEE table before their full address is available. In such cases, instead
of entering meaningless or maybe even wrong filler values, it is advisable to use null
values as placeholders.
A null value represents an as yet unknown data value within a table column.
Null values, illustrated in Fig. 3.12 as “?”, must not be confused with the number
0 (zero) or the value “”(space). These two values express specific situations in
3.3 Relational Language SQL 89
EMPLOYEE
SELECT *
FROM EMPLOYEE
WHERE City = ‘Kent’
UNION
SELECT *
FROM EMPLOYEE
WHERE NOT City = ‘Kent’
RESULTS_TABLE
OR 1 ? 0 AND 1 ? 0 NOT
1 1 1 1 1 1 ? 0 1 0
? 1 ? ? ? ? ? 0 ? ?
0 1 ? 0 0 0 0 0 0 1
The query in Fig. 3.12, which selects all employees from the EMPLOYEE table
who live either in Kent or not in Kent, returns a result table containing only a subset
of the employees in the original table. The reason is that some places of residence of
employees are unknown. Therefore, the truth of both comparisons, City='Kent' and
NOT City='Kent', is unknown and therefore not true.
This clearly goes against the conventional logical assumption that a union of the
subset “employees living in Kent” with its complement “employees NOT living in
Kent” should result in the total set of all employees.
Sentential logic with the values TRUE, FALSE, and UNKNOWN is commonly
called three-valued logic for the three truth values a statement can take. This logic is
less known and poses a special challenge for users of relational databases, since
analyses of tables with null values are hard to interpret. In practice, null values are
therefore largely avoided. Sometimes, DEFAULT values are used instead. For
instance, the company address could be used to replace the yet unknown private
addresses in the EMPLOYEE table from our example. The function COALESCE
(X, Y) replaces all X attributes with a null value with the value Y. If null values have
to be allowed, attributes can be checked for unknown values with specific compari-
son operators, IS NULL or IS NOT NULL, in order to avoid unexpected side effects.
Foreign keys are usually not supposed to take null values; however, there is an
exception for foreign keys under a certain rule of referential integrity. For instance,
the deletion rule for the referenced table DEPARTMENT can specify whether
existing foreign key references should be set to NULL or not. The referential
integrity constraint “set NULL” declares that foreign key values are set to NULL
if their referenced tuple is deleted. For example, deleting the tuple (D6, Accounting)
from the DEPARTMENT table in Fig. 3.12 with the integrity constraint rule “set
NULL” results in null values for the foreign keys of employees Stewart and Bell in
the EMPLOYEE table. For more information, see also Sect. 4.3.1.
Null values also exist in graph-based languages. As we will see in the following
section, handling null values with IS NULL and COALESCE is done in the Cypher
language as well, which we will cover in detail in the next section.
3.4 Graph-Based Language Cypher 91
Graph-based database languages were first developed toward the end of the 1980s.
The interest in high-performance graph query languages has grown with the rise of
the Internet and social media, which produce more and more graph-structured data.
Graph databases store data in graph structures and provide options for data
manipulation on a graph transformation level. As described in Sect. 1.4.1, graph
databases consist of property graphs with nodes and edges, with each graph storing a
set of key-value pairs as properties. Graph-based database languages build on that
principle and enable the use of a computer language to interact with graph structures
in databases and program the processing of those structures.
Like relational languages, graph-based languages are set-based. They work with
graphs, which can be defined as sets of vertices and edges or paths. Graph-based
languages allow for filtering data by predicates, similar to relational languages; this
filtering is called a conjunctive query. Filtering a graph returns a subset of nodes
and/or edges of the graph, which form a partial graph. The underlying principle is
called subgraph matching, the task of finding a partial graph matching certain
specifications within a graph. Graph-based languages also offer features for
aggregating sets of nodes in the graph into scalar values, e.g., counts, sums, or
minimums.
In summary, the advantage of graph-based languages is that the language
constructs directly target graphs, and thus the language definition of processing
graph-structured data is much more direct. As a language for graph databases, we
focus on the graph-based language Cypher in this work.
Cypher is a declarative query language for graph databases. It provides pattern
matching on property graphs. It was developed by Andrés Taylor in 2011 at Neo4J,
Inc. With openCypher, the language was made available to the general public as an
open-source project in 2015. It is now used in more than ten commercial database
systems. In 2019, the International Organization for Standardization (ISO) decided
to further develop openCypher into an international standard under the name GQL
by 2023.
The graph database Neo4J4 (see also Cypher tutorial and Travelblitz case study
with Neo4J on www.sql-nosql.org) uses the language Cypher to support a language
interface for the scripting of database interactions.
Cypher is based on a pattern matching mechanism. Cypher has language
commands for data queries and data manipulation (data manipulation language,
DML); however, the schema definition in Cypher is done implicitly, i.e., node and
edge types are defined by inserting instances of them into the database as actual
specific nodes and edges.
Cypher also includes direct linguistic elements for security mechanisms, similar
to relational languages, with statements such as GRANT and REVOKE (see Sect.
4.2). Below, we will take a closer look at the Cypher language.
4
http://neo4j.com
92 3 Database Languages
Schema definition in Cypher is done implicitly, i.e., abstract data classes (metadata)
such as node and edge types or attributes are created by using them in the insertion of
concrete data values. The following example inserts new data into the database:
CREATE
(p:Product {
productName:'Alice’s Adventures in Wonderland'})
-[:PUBLISHER]->
(o:Organization {
name:'Macmillan'})
MATCH (p:Product)
WHERE p.productName = 'Alice’s Adventures in Wonderland'
SET p.unitPrice = 13.75
MATCH
()-[r1]->(p:Product),
(p)-[r2]->()
WHERE p.productName = 'Alice’s Adventures in Wonderland'
DELETE r1, r2, p
Even though Cypher operates on graphs, property graphs can be mapped con-
gruently to relations. Therefore, it is possible to analyze the relational operators of
Cypher.
MATCH (p:Product)
WHERE p.productName = 'Alice’s Adventures in Wonderland'
RETURN p
The RETURN clause can output either vertices or property tables. The return of
entire nodes is similar to “SELECT *” in SQL. Cypher can also return properties as
attribute values of nodes and edges in the form of tables:
MATCH (p:Product)
WHERE p.unitPrice > 55
RETURN p.productName, p.unitPrice
ORDER BY p.unitPrice
This query includes a selection, a projection, and a sorting. The MATCH clause
defines a pattern matching filtering the graph for the node of the “Product” type; the
WHERE clause selects all products with a price greater than 55; and the RETURN
94 3 Database Languages
clause projects those nodes on the product name and price, with the ORDER BY
clause sorting the products by price.
This command lists all possible combinations of product names and category
names. A join of nodes, i.e., a selection on the Cartesian product, is executed graph-
based by matching path patterns by edge types:
For each product, this query lists the category it belongs to, by only considering
those product and category nodes connected by edges of the PART_OF type. This
equals the inner join of the “Product” node type with the “Category” node type via
the edge type PART_OF.
In Cypher, there are built-in functions which can be applied to properties and data
sets. These functions, as a supplement to selection, projection, and join, are central
for the usability in practice. An important category for data analysis are the aggregate
functions.
Aggregate Functions
An important category of built-in functions for data analysis are the aggregating
functions like COUNT, SUM, MIN, MAX, and AVG, which Cypher supports.
Suppose we want to generate a list of all employees, together with the number of
subordinates. To do this, we match the pattern MATCH (e:Employee)<-[:
REPORTS_TO]-(sub) and get a list of employees where the number of subordinates
is greater than zero:
3.4 Graph-Based Language Cypher 95
There are node types where only a subset of the nodes has an edge of a specific
edge type. For instance, not every employee has subordinates, i.e., only a subset of
the nodes of the “Employee” type has an incoming REPORTS_TO type edge.
An OPTIONAL MATCH clause allows to list all employees including those
without subordinates:
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[:REPORTS_TO]-(sub)
RETURN e.employeeID, COUNT (sub.employeeID)
With OPTIONAL MATCH, connected attributes that are not connected remain
empty (NULL). Cypher is based on three-valued logic. Handling null values with IS
NULL and COALESCE is analogous to SQL (see Sect. 3.3.4). To filter records with
null values, the additional code WHERE sub.employeeID IS NULL can be used.
With the function COALESCE(sub.employeeID, "not available"), the null values
can be replaced.
Other aggregates are sum (SUM), minimum (MIN), and maximum (MAX). An
interesting non-atomic aggregate is COLLECT, which generates an array from the
available data values. Thus, the expression in the previous example lists all
employees by first name and abbreviated last name, along with a list of the employee
numbers of their subordinates.
Data Operators
Cypher supports functions on data values. The following query returns the full first
name and the last name initial for each employee, along with the number of
subordinates:
MATCH (e:Employee)
OPTIONAL MATCH (e)<-[:REPORTS_TO]-(sub)
RETURN
e.firstName + " "
+ LEFT(e.lastName, 1) + "." as name,
COUNT(sub.employeeID)
The operator + can be used on “text” type data values to string them together. The
operator LEFT returns the first n characters of a text.
96 3 Database Languages
HAS mc
HAS
c Part Part
Fig. 3.14 Recursive relationship as entity-relationship model and as graph with node and edge
types
with recursive
r_path (partID, hasPartId, length) – CTE definition
as (
select partID, hasPartId, 1 -- Initialization
from part
union all
select r.partID, p.hasPartId, r.length+1
from part p
join r_path r – Recursive join of CTE
on (r.hasPartId = p.partID)
)
select
distinct path.partID, path.hasPartId, path.length
from r_path -- Selection via recursive defined CTE
3.4 Graph-Based Language Cypher 97
This query returns a list of all subparts for a part, plus the degree of nesting, i.e.,
the length of the path within the tree from the part to any (potentially indirect)
subpart.
A regular path query in a graph-based language allows for simplified filtering of
path patterns with regular expressions. For instance, the regular expression HAS*
using a Kleene star (*) defines the set of all possible concatenations of connections
with the edge type HAS (called the Kleene hull). This makes defining a query for all
indirectly connected vertices in a graph-based language much easier. The example
below uses the graph-based language Cypher to declare the same query for all direct
and indirect subparts as the SQL example above, but in only two lines:
In addition to the data manipulation we know from SQL, Cypher also supports
operations on paths within the graph. In the following example, an edge of the type
BASKET is generated for all product pairs that have been ordered together. This
edge shows that those two products have been included in at least one order together.
Once that is done, the shortest connection between any two products through shared
orders can be determined with a shortestPath function:
MATCH
(p1:Product)<--(o:Order)-->(p2:Product)
CREATE
p1-[:BASKET{order:o.orderID}]->p2,
p2-[:BASKET{order:o.orderID}]->p1;
MATCH path =
shortestPath(
(p1:Product)-[b:BASKET*]->(p2:Product))
RETURN
p1.productName, p2.productName, LENGTH(path),
EXTRACT(r in RELATIONSHIPS(path)| r.order)
In addition to the names of the two products, the RETURN clause also contains
the length of the shortest path between them and a list of the order numbers indirectly
connecting them.
It should be noted here that, while Cypher offers some functions for analyzing
paths within graphs (including the Kleene hull for edge types), it does not support the
full range of Kleene algebra for paths in graphs, as required in the theory of graph-
based languages. Nevertheless, Cypher is a language well-suited for practical use.
98 3 Database Languages
Document databases like MongoDB5 are schema-free. This does not mean that their
records do not follow a schema. A schema is always necessary to structure records.
Schema freedom simply means that database users are free to use any schema they
want for structuring without first reporting it to the database system and without
requiring that the schemas of records within a collection be uniform. It is thus a
positive freedom to use any schema within a collection. The database schema in a
document database is an implicit schema.
Therefore, all that is needed to populate a document database schema is a JSON
document. We propose that an entity-relationship model be used to structure the
JSON records, as described in Sect. 2.5. For example, to insert a document about an
employee into the database according to the structure in Fig. 2.25, we use the
insertOne() method on the EMPLOYEE collection as follows:
db.EMPLOYEES.insertOne( {
{ "EMPLOYEE":
{ "Name": "Steward",
"City": { "Stow",
"DEPARTMENT": { "Designation": "Finance" },
}, { "PROJECTS":
[ { "Title": "DWH", "Workload": 0.3 },
{ "Title": "Strat", "Workload": 0.5 } ] } }
)
If the used collection does not exist yet, it will be created implicitly. To insert
multiple documents, the method insertMany() can be applied.
To adapt an existing document, the method updateOne() can be applied. In the
following example for employee Steward, the department is changed to “IT”:
5
https://www.mongodb.com
3.5 Document-Oriented Language MQL 99
db.EMPLOYEES.updateOne(
{ "EMPLOYEE.Name": "Steward" },
{ $set: {
"EMPLOYEE.DEPARTMENT.Designation": "IT" }})
The updateOne() method can use several update operators. The $set operator sets
a new value for a field or adds the field if it does not already exist. With $unset, a
field can be removed; with $rename, it is renamed. Other operators are available,
such as $inc, which increments the field value by the specified value.
UpdateOne changes the first document that matches the filter criterion. Similar to
insertion, multiple documents can be changed at once with the updateMany()
method.
The deleteOne() method is used to delete a document that matches a filter
criterion. If there are several documents that match the filter, the first one is deleted.
To delete several documents at once, the deleteMany() method can be used. In the
following example, we delete all documents related to employees named “Smith.”
db.EMPLOYEES.deleteMany(
{ "EMPLOYEE.Name": "Smith" } )
Once we have inserted data into the database, we can query that data. The
relational operators, which exist in a similar form for MQL, are used for this purpose.
Selection
To select employees, the find() method is applied to the EMPLOYEES collection. In
the following example, the filter “location = Kent” is given as a parameter in JSON
syntax.
db.EMPLOYEES.find({
"EMPLOYEES.City": "Kent"})
100 3 Database Languages
Different filter criteria can be combined with the Boolean operators $and, $or, and
$not, even over multiple attributes, as shown in the following example. Here,
employees are selected who live in Kent and work in IT.
db.EMPLOYEES.find(
{$and: [
{"EMPLOYEE.city": "Kent"},
{"EMPLOYEE.DEPARTMENT.Designation": "IT"
} ] } )
Projection
Document sets (collections) can be projected to attributes. For this purpose, a second
parameter can be given to the find() method, specifying a list of properties of the
document to be returned by the database. In the following example, the fields name
and city are shown for the employees from Kent. This is called an inclusion
projection.
db.EMPLOYEES.find({
"EMPLOYEE.City": "Kent"},
{_id:0,
"EMPLOYEE.Name": 1,
"EMPLOYEE.City": 1})
Join
By definition, documents are complete with respect to a subject. For certain
evaluations, it can be nevertheless meaningful to join document sets. This is basi-
cally possible in MQL with the $lookup aggregation. The operation performs a kind
of left outer join because all documents of the parent (left) collection are returned,
even if they do not match the filter criteria. However, the operation is not performant
and should be used with caution.
The following example associates the employees (according to Fig. 2.25) with the
departments (according to Fig. 2.29) using the “Designation” field, thus adding the
name of the department head. We are looking for the departments whose name
matches the department of the corresponding employee. The statements in the
pipeline field are used here to modify the connected documents with the $project
operator. The department documents are projected to a single “Name” field, which
stores the name of the department head. The $ operator before the property name on
3.5 Document-Oriented Language MQL 101
the fourth to last line eliminates the nested field properties and reduces the JSON
structure to the value of the corresponding field.
db.EMPLOYEES.aggregate([{
$lookup: {
from: "DEPARTMENTS",
localField: "EMPLOYEE.DEPARTMENT.name",
foreignField: "DEPARTMENT.Designation",
pipeline: [
{ $project: { _id:0, "name":
"$DEPARTMENT.EMPLOYEES.DEPARTMENT_HEAD.Name"
}}],
as: "STAFF.DEPARTMENT.Head"
}}]);
{ _id: ObjectId("62aa3c16c1f35d9cedb164eb"),
EMOPLOYEE:
{ Name: 'Murphy',
Place: 'Kent',
DEPARTMENT: { Designation: 'IT',
'Head': [ { Name: 'Miller' } ] },
PROJECTS:
[ { Title: 'WebX', Workload: 0.55 },
{ Title: 'ITorg', Workload: 0.45 } ] } }
The $lookup operator always returns the associated documents and values are as
an array, even if there is only one corresponding document. Therefore, this value is
enclosed in square brackets. With further operations, this single value could be
unpacked.
Cartesian Product
Similarly, we can use the $lookup operation to derive a kind of Cartesian product by
omitting the join predicates localField and foreignField. This cross join is only listed
102 3 Database Languages
here for the sake of completeness. The operation is inefficient even for small
data sets.
For example, the names of all department heads could be stored in a new field
“Bosses”:
db.EMPLOYEES.aggregate([{
$lookup: {
from: "DEPARTMENTS",
pipeline: [
{ $project: { _id:0, "Name":
"$DEPARTMENT.EMPLOYEE.DEPARTMENT.Designation"
}}],
as: "EMPLOYEES.Bosses"
}}]);
Union
To unify collections as sets of documents, the aggregation operator $unionWith is
available. In the following example, all documents of the collection
SPORTS_CLUB are united with all documents of the collection PHOTO_CLUB.
However, it is not a true set operator because $unionWith does not remove the
duplicates. It is more like the SQL command UNION ALL.
db.SPORTS_CLUB.aggregate([
{ $unionWith: { coll: "PHOTO_CLUB"} }
])
Similar operators for intersections or difference sets at the collection level do not
exist. We see that MQL is relationally incomplete, since basic set operators are
missing. However, MQL provides a rich set of built-in functions, some of which we
look at below.
In MQL terminology, the term aggregation is used in a more general sense. There-
fore, aggregation functions such as $count or $sum are called aggregation
accumulators in MQL to distinguish them from other aggregations such as $lookup
or $unionWith.
Accumulation Aggregations
With the $count accumulator, we can count documents in a collection. In the
following, we count the number of employees:
3.5 Document-Oriented Language MQL 103
db.EMPLOYEES.aggregate([ {
$count: "Result"
} ] )
Grouping
An accumulator aggregation such as sum, count, or average cut can be grouped using
a variable. For each value of this variable, a corresponding partial result is calculated.
In the following example, we ask for the number of employees per location:
db.EMPLOYEES.aggregate( [ {
$group: {
_id: "$employee.location",
Number_of_employees: { $count: { } }
} } ] )
The output of this query is one JSON object per department, with the name of the
department in the “_id” field and the number of employees in the
“Number_of_employees” field. If records for employees are stored in the database
analogous to Fig. 1.3, this results in the following output in Mongo Shell (mongosh):
The $group aggregation can be used with all the above aggregation accumulators
like $sum, $avg, $count, $max, and $min.
JSON.stringify(
db.EMPLOYEES.aggregate( [ {
$group: {
_id: "$EMPLOYEE.City",
count: { $count: { } }
} } ] )
.toArray())
db.EMPLOYEES.aggregate([
{ $unwind: "$EMPLOYEES.PROJECTS" },
{ $group: {
_id:"$EMPLOYEES.DEPARTMENT.Designation",
"s": {"$sum":"$EMPLOYEES.PROJECTS.Workload"}
}}
])
For the sum of the workloads stored within an array of projects for the employees,
in addition to the grouping, the unwinding of the array structure with $unwind is
necessary.
In MQL field values can remain unknown explicitly. For this purpose, the keyword
“null” is used (the lower case is relevant). The special feature of the schema-free
document model is that even the omission of an object field can logically be a null
value.
3.5 Document-Oriented Language MQL 105
In a three-valued first-order logic, it is the same whether one specifies a field with
a property whose value is “null,” which means explicitly unknown, or whether the
field is omitted altogether and is thus implicitly unknown. However, in three-valued
second-order logic, the former is a known unknown, but the latter is an unknown
unknown. MQL treats both variants as equivalent.
As an example, let’s look at the documents for employees Smith, Smyth, and
Smythe in Fig. 3.15. While we know the Place is Basel for Smith, it is unknown for
the other two. For Smyth, we explicitly mark this as a null value in the object field
with property “Place”; for Smythe, we omit the field with property “Place.” This now
leads to different ways of filtering with null values.
• In query (1) in Fig. 3.15, we use “null” directly as a filter criterion. This means
that the criterion itself is unknown, and therefore it is not applied at all—so all
employees are returned.
• In query (2) in Fig. 3.15, we filter on whether the “Place” property is equal to
“null.” In this case, documents are returned that either have such a field with
explicit value “null”—as with Smyth—or do not have such a field, as with
Smythe.
• In query (3) in Fig. 3.15, we explicitly select for values of data type 10 (BSON
Type Null). Thus, only documents are returned that have a field with property
“Place” and an explicit value “null,” like Smyth.
• In query (4) in Fig. 3.15, we ask for documents in the collection EMPLOYEES
for which no field with property “Place” exists. This is only true for Smythe.
106 3 Database Languages
The query and manipulation languages for databases can be not only used interac-
tively as stand-alone languages but also embedded in an actual, i.e., procedural,
programming language (host language). For embedding in a programming environ-
ment, however, some precautions have to be taken, which we will discuss in more
detail here.
Cursor Concept
A cursor is a pointer that can traverse a set of records in a sequence specified by the
database system. Since a sequential program cannot process an entire set of records
in one fell swoop, the cursor concept allows a record-by-record, iterative approach.
In the following, we take a closer look at the embedding of SQL, Cypher, and
MQL in procedural languages with cursors.
This allows to process the individual records in a table, i.e., tuple by tuple. If
necessary, it is also possible to modify some or all data values of the current tuple. If
the table has to be processed in a specific sequence, the above declaration must be
amended by an ORDER BY clause.
3.6 Database Programming with Cursors 107
Multiple cursors can be used within one program for navigation reasons. They
have to be declared and then activated and deactivated by OPEN and CLOSE
commands. The actual access to a table and the transmission of data values into
the corresponding program variables happen via a FETCH command. The types of
the variables addressed in the programming language must match the formats of the
respective table fields. The FETCH command is phrased as
Each FETCH statement moves the CURSOR forward by one tuple. If no further
tuples are found, a corresponding status code is returned to the program.
Cursor concepts allow the embedding of set-oriented query and manipulation
languages into a procedural host language. For instance, the same linguistic
constructs in SQL can be either used interactively or embedded. This has additional
advantages for testing embedded programming sections, since the test tables can be
analyzed and checked with interactive SQL at any point.
6
Quartiles of ranked data sets are the points between the quarters of the set.
108 3 Database Languages
This function opens a cursor on the employee table sorted by salary (low to high),
loops through each row, and returns the value of the Salary column from the row
where COUNT(*)/4 iterations of the loop have been run. This value is the first
quartile, i.e., the value separating the lowest 25% of values in the set. The result of
the function can then be selected with the statement
Select SalaryQuartile();
First, the program library for the database is imported, which is product-specific.
Then a connection to the database is opened with appropriate access information (see
Sect. 4.2) and stored in the variable db. Finally, a cursor is opened on a
SQL-SELECT query, which is run sequentially in a FOR loop. In this simple
example, only the record is printed with print(); any processing logic could now
be inserted here.
Graph-based languages, since they are also set-oriented, can be embedded in host
languages using the same principle with the use of the cursor concept. One receives a
result set back under execution of an embedded Cypher statement, which can be
processed arbitrarily in a loop.
Python is also used more and more in the area of graph databases. For this reason,
there is also the possibility to embed Cypher in Python scripts. In the following
example, we see a corresponding example with Python:
First, the program library is imported. Then a driver is instantiated that contains
the access information. With this, a database session can be opened. With the run
command, a Cypher query can be executed on the database server. The processing of
the CURSOR is done in a FOR loop.
We have now seen the embedding of the set-oriented database languages SQL and
Cypher. However, MQL is actually a program library that is controlled with JSON
parameters. Therefore, MQL is not embedded as a separate language in procedural
host languages. The find, insert, update, and delete commands are applied directly as
routines of the corresponding APIs and parameterized accordingly. Below we see an
example of using MQL in Python:
110 3 Database Languages
import pymongo
uri = "mongodb://localhost:27017"
client = pymongo.MongoClient(uri)
database = client['Company']
collection = database['EMPLOYEES']
result = collection.find({},{"Name" : 1})
for r in result: print(r)
Bibliography
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM. 13(6), 377–387
(1970)
Chamberlin, D.D., Boyce, R.F.: SEQUEL: a structured English query language. In: Proceedings of
the 1974 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and
Control, pp. 249–264 (1974)
Kemper, A., Eikler, A.: Datenbanksysteme – Eine Einführung. DeGruyter (2015)
Melton, J., Simon, A.R.: SQL1999 – Understanding Relational Language Components. Morgan
Kaufmann (2002)
MongoDB, Inc.: MongoDB Documentation. https://www.mongodb.com/docs/ (2022)
Neo4J, Inc.: (2022). Neo4j Documentation. Neo4j Graph Data Platform. https://neo4j.com/docs/
(2022)
Panzarino, O.: Learning Cypher. Packt Publishing Ltd., Birmingham (2014)
Perkins, L., Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks: A Guide to Modern
Databases and the Nosql Movement, 2nd edn. O’Reilly UK Ltd., Raleigh, NC (2018)
Database Security
4
Database security is based on the basic security of the data center, the computers,
and the network in which the database server is operated. Computers must use the
latest version of all software components to close security gaps. The network must
be protected with firewall rules and geo-IP filters. And the data center must physi-
cally protect hardware from access. We do not go into these basics of cybersecurity
here, but refer to relevant literature. We focus on the technical functions that a
database system can use to achieve the security goals.
Figure 4.1 lists necessary measures for databases for each of the three CIA goals.
The security measures in Fig. 4.1 build on each other. To achieve integrity, confi-
dentiality must also be ensured, and the measures for integrity are also necessary but
not sufficient for availability. In the following, we provide a brief overview of the
general security measures before we discuss the special features in the database
environment in the following sections.
To ensure confidentiality, privacy protection is central. Authentication uses
accounts and passwords to verify that users are who they say they are. With
appropriate password policies, users are encouraged to choose passwords with a
sufficiently large number of characters (e.g., 9) with upper and lower case, including
numbers and special characters, and to change them regularly. Authorization rules
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 111
M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9_4
112 4 Database Security
Fig. 4.1 Database security measures for confidentiality, integrity, and availability
restrict access rights and give users exactly the access they need. However, even with
the best access protection, injection attacks can inject executable code into user
interfaces to read or modify the database without permission. This must be prevented
with appropriate measures (e.g., prepared statements). Encryption of the database
and communication with the database server prevents unauthorized reading. The
certification of the server ensures that information is entrusted to one’s own system
and not unintentionally revealed to a “person in the middle.”
Many of the measures mentioned above serve to ensure the integrity of the data as
well, i.e., to protect it from unintentional modification. In the case of authorization,
additional care must be taken to be particularly restrictive in terms of write
permissions. Furthermore, database auditing can be used to record who has
performed which action on the database and when, which allows errors to be traced
and corrected. A special feature of database management systems is that they can
partially check the integrity of the data automatically. For this purpose, conditions
under which the data is correct, so-called integrity constraints, can be formulated.
Inconsistencies can also occur in multi-user operation if data records are modified
simultaneously. Transaction management provides mechanisms to protect the integ-
rity of data from side effects and version conflicts.
To ensure the availability of the services of a database system, further measures
are needed. A system of multiple servers with load balancing ensures availability
even with a large number of requests. The geographical distribution of redundant,
identical database systems protects against interruptions caused by natural disasters
and other major events. Regular backups of the database ensure that all data remains
available in the event of damage. A transaction log records all changes to the
4.2 Access Control 113
database and ensures that all transactions are completed consistently in the event of a
crash. To do this, the log files must be copied and backed up regularly.
In the following sections, we will deal with security measures for which the
database management system provides specific mechanisms: access control, integ-
rity conditions, and transaction management.
Similar to other database objects, user accounts can be modified and deleted with
ALTER USER and DROP USER.
The GRANT command is used to authorize users for actions on tables. The
following command authorizes user Murphy for all actions on the STAFF table (see
Fig. 4.2).
STAFF
CREATE VIEW
GROUP_A AS
CREATE VIEW SELECT E#, Name, Salary, Sub
EMPLOYEE AS FROM EMPLOYEE
SELECT E#, Name, City, Sub WHERE Salary BETWEEN 80,000
FROM STAFF AND 100,000
E# Name City Sub E# Name Salary Sub
E7 Howard Cleveland D5
E4 Bell Kent D6
To simplify the assignment of rights for several users, reusable roles can be
defined. In the following example, a role “hr” is created. This role will be authorized
to perform read and write actions on the table STAFF.
However, the GRANT command only allows access control at the level of entire
database objects, such as tables. In many cases, we may want to further restrict
access to columns and rows of a table. This is done with table views, each of which is
4.2 Access Control 115
based on either one or multiple physical tables and is defined using a SELECT
statement:
However, view security is only effective when users are granted privileges on the
views rather than on the base tables.
Figure 4.2 shows two example views based on the STAFF table. The
EMPLOYEE view shows all attributes except for the salary information. The view
GROUP_A shows only those employees with their respective salaries who earn
between USD 80,000 and 100,000 annually. Other views can be defined similarly,
e.g., to allow HR to access confidential data per salary group.
The two examples in Fig. 4.2 demonstrate important protection methods: On the
one hand, tables can be limited for specific user groups by projection; on the other
hand, access control can also be value-based, e.g., for salary ranges, via
corresponding view definitions in the WHERE clause.
As on tables, it is possible to formulate queries on views; however, manipulation
operations cannot always be defined uniquely. If a view is defined as a join of
multiple tables, change operations may be denied by the database system under
certain circumstances.
Updateable views allow for insert, delete, and update operations. The following
criteria determine whether a view is updateable:
• The view contains content from only one table (no joins allowed).
• That base table has a primary key.
• The defining SQL expression contains no operations that affect the number of
rows in the result set (e.g., aggregate, group by, distinct, etc.).
It is important to note that for different views of a single table, the data are
managed uniformly in the base table; rather, merely the definitions of the views are
stored. Only when the view is queried with a SELECT statement are the
corresponding result tables generated from the view’s base tables with the permitted
data values.
Using views, it is now possible to grant only reading privileges for a subset of
columns of the STAFF table with the EMPLOYEE view from Fig. 4.2:
Instead of listing specific users, this example uses PUBLIC to assign reading
privileges to all users so they can look at the limited EMPLOYEE view of the base
table.
For a more selective assignment of permissions, for instance, it is possible to
authorize only a certain HR employee with the user ID ID37289 to make changes to
a subset of rows in the GROUP_A view from Fig. 4.2:
User ID37289 can now modify the GROUP_A view and, thanks to the GRANT
OPTION, even assign this authorization or a limited reading privilege to others and
take it back. This concept allows to define and manage dependencies between
privileges.
The complexity of managing the assignment and removal of permissions when
giving end users access to a relational query and manipulation language is not to be
underestimated, even if the data administrators can use GRANT and REVOKE
commands. In reality, daily changes and the monitoring of user authorizations
require additional management tools (e.g., auditing). Internal or external controlling
instances and authorities may also demand special measures to constantly ensure the
proper handling of especially sensitive data (see also the legal data protection
obligations for your jurisdiction).
SQL Injection
One security aspect that plays an increasingly important role in the age of the Web in
the area of databases is the prevention of so-called SQL injections. When Web pages
are programmed on the server side and connected to an SQL database, server scripts
sometimes generate SQL code to interface with the database (see Sect. 3.6). If the
code contains parameters that are entered by users (e.g., in forms or as part of the
URL), additional SQL code can be injected there, the execution of which exposes or
modifies sensitive information in the database.
As an explanatory example, let’s assume that after logging into the user account
of a Web store, the payment methods are displayed. The Web page that displays the
user’s saved payment methods has the following URL:
http://example.net/payment?uid=117
Let’s assume that in the background, there is a program in the Java programming
language that fetches the credit card data (name and number) from the database via
4.2 Access Control 117
Java Database Connectivity (JDBC). The Java servlet uses embedded SQL and a
cursor (see Sect. 3.6.1). Then the data is displayed on the Web page using HTML:
ResultSet cursor =
connection.createStatement().executeQuery(
"SELECT credit card number, name+
+ "FROM PAYMENT"
+ "WHERE uid = "
+ request.getParameter("uid"));
while (cursor.next()) {
out.println(
resultset.getString("credit_card_number ")
+ "<br/>" +
+ resultset.getString("name");
}
For this purpose, an SQL query of the PAYMENT table is generated on lines
6 and following of the Java code above. It is parameterized via the user input via
URL using a get request (request.getParameter). This type of code generation is
vulnerable to SQL injection. If the parameter uid is added to the URL as follows, all
credit card data of all users will be displayed on the Web page:
http://example.net/payment?uid=117%20OR%201=1
The reason for this is that the servlet shown above generates the following SQL
code based on the GET parameter:
The additional SQL code “OR 1=1” inserted, the SQL injection, causes the
search filter to become inactive with the user identification in the generated query,
since 1=1 is always true, and an OR operation is always true even if only one of the
conditions is true. Therefore, in this simple example, the website is exposing
protectable data due to this SQL injection.
118 4 Database Security
PreparedStatement ps = con.prepareStatement(
"SELECT credit_card_number, name FROM PAYMENT" +
"WHERE uid = "?");
ps.setString(1, + request.getParameter("uid"));
ResultSet resultset = ps.executeQuery();
With SHOW USERS, all existing user accounts can be displayed. To rename a
user account, use the command RENAME USER:
An account can be changed with the command ALTER USER, e.g., to reset the
password. With the addition CHANGE NOT REQUIRED, the specified password
can be reused.
For authorization, Cypher offers the GRANT command. There are predefined
roles for the role-based access control (RBAC):
• PUBLIC can access its own HOME database and perform all functions there. All
user accounts have this role.
• reader can read data from all databases.
• editor can read databases and modify contents.
• publisher can read and edit and add new node and edge types and property
names.
• architect has in addition to publisher the ability to manage indexes and integrity
constraints.
• admin can manage databases, users, roles, and permissions in addition to the
architect role.
The following command assigns the role architect to user account murphy.kent:
Privileges can also be set in a fine-grained way. The following command allows
reading all node types (*) of the “company” graph for user account muprh.
Authorization works similarly for relationship types. The following Cypher code
creates a new role project.admin, gives it the ability to create relationships of type
Team, and grants permissions to this role to murphy.kent.
Cypher supports the division of access rights at the level of individual properties.
The following ensures that the role project.admin can read all nodes and edges, but
cannot see the property wage of the table personnel.
Cypher Injection
Cypher injection is an attack using specifically formatted input from users to perform
unexpected operations on the database, read, or modify data without permission.
Let’s assume that a Web application allows to insert new records about projects into
the database. To do this, a Web server executes the following Java code:
This code is vulnerable to Cypher injection. Users could write the following input
to the Web form:
4.2 Access Control 121
Cypher code has been injected in this user input. If this string is interpreted in the
Java code above, the following Cypher command is generated:
CREATE (p:project)
SET p.name = 'Anything'
WITH true as x
MATCH (p:Project) DETACH DELETE p //'
It doesn’t matter what other code comes directly before the quotation mark that
terminates the string. The WITH statement after the CREATE command allows in
Cypher to append a MATCH and DELETE statement. This deletes all project data
here. All other strings are reduced to comments with the double crossbar //, and are
therefore not executed as code.
To work around this problem, user input can be passed as parameters in Cypher.
This way the query is precompiled, and the parameter inputs are not interpreted by
the DBMS.
Sophisticated concepts of access control are present in MQL. The basis is to create a
user account for authentication with the createUser() method:
122 4 Database Security
use Company
db.createUser(
{
user: "murphy",
pwd: passwordPrompt(),
roles: [
{ role: "read", db: "Company" },
]
}
)
This request creates a user account “murphy” in the database “Company” With
the passwordPrompt() specification, the password is not passed as plain text, but by
command prompt. This has security advantages. The password is not visible on the
screen, is not saved as a file, does not appear in the command history of the
command line, and is invisible to other processes of the operating system. However,
the createUser() function can be passed the password in plain text if necessary.
db.changeUserPassword("murphy",
"KJDdfgSD$_3")
These roles each apply to all collections in a database. In order to set the user
rights on the level of collections, user-defined roles can be created. For example, to
give a user account write access to the “Staff” collection, a “staffAdmin” role can be
created using the “createRole()” method:
4.2 Access Control 123
use admin
db.createRole(
{
role: "staffAdmin",
privileges: [
{
actions: [ "insert", "remove", "update" ],
resource: { db: "Company",
collection: "Staff" }
}
],
roles: [] } )
This command gives the “staffAdmin” role the privileges to perform the insert,
remove, and update actions on the “Staff” collection. This role can now be assigned
to individual user accounts using the “grantRolesToUser()” method.
use Company
db.grantRolesToUser(
"murphy",
[
{ role: "staffAdmin", db: "Company" } ] )
use reporting
db.revokeRolesFromUser(
}, "murphy",
[
{ role: "read", db: "Company" } ] )
MQL does not offer the possibility to grant privileges to user accounts individu-
ally. All access rights are distributed via roles. Roles allow to distribute access rights
on collections level. To restrict read access to individual fields and to subsets of
documents, MQL can define views.
124 4 Database Security
Use Company;
db.createView(
"vStaff",
"Staff",
[
{ $match: { Salary:
{ $gte : 80000, $lte : 160000} } },
{ $project: { Salary: 0 } }
] )
This view shows only employees with salaries between 80,000 and 160,000, but
without the exact salary information, because the field was expanded in the view
definition with an exclusion projection.
Subsequently, a user-defined role can be authorized to read this view instead of
the original collection:
use Company
db.grantPrivilegesToRole(
"staffAdmin",
[
{
resource: {
db: "Company",
collection: "vStaff" },
actions: [ "find" ]
}
)
JavaScript Injection
Although MQL is not interpreted as a language, but is parameterized with JSON
objects, NoSQL injection attacks are certainly possible with MongoDB. Let’s
assume, for example, that the user name and password for authentication in a Web
application with MongoDB are passed as Get parameters via the URL:
https://example.org/login?user=u526&password=123456
Let’s further assume that in the background, this URL request to the Web server is
forwarded to the MongoDB database in a Python program to check if the combina-
tion of user and password is present in the database:
4.2 Access Control 125
result = collection.find({"$where":
"this.user == '"
+ parse_qs(urlparse(url).query)['user'][0]
+ "' && this.pw == '"
+ parse_qs(urlparse(url).query)['pw'][0]
+ "'" })
Given the input parameters, this Python program generates and executes the
following MQL query:
Db.users.find({'$where':
"this.user == 'u526'
&& this.pw == '123456'" } )
The $where operator in MQL, in the current version 5.0 of MongoDB, allows a
JavaScript expression to be checked for document selection. The operator is vulner-
able to JavaScript injection.
If an attacker injects JavaScript code into the URL, it could look like this:
'https://example.org/login?user=u526&pw=%27%3B%20return%20tr
ue%2B%27
In this URL, special characters like single quotes (%27), a semicolon (%3B),
spaces (%20), and a plus sign (%2B) have been injected together with JavaScript
code like “return true.” The server generates the following query to MongoDB
from this:
Db.users.find({'$where':
"this.user == 'u526'
&& this.pw == ''; return true+''" } )
By injecting the statement “return true,” the filter predicate for checking users and
passwords becomes a tautology, so it is always true. Thus, we can bypass the
password in this example.
A real authentication is certainly not implemented this way. We simply want to
show here that the injection is principally possible in MQL. This simple example
should suffice for that.
126 4 Database Security
• Uniqueness constraint: An attribute value can exist at most once within a given
class of records.
• Existence constraint: An attribute has to exist at least once within a given class
of records and must not be empty.
• Key constraint: An attribute value has to exist exactly once within a given set of
records, and the corresponding attribute has to be present for each record of a
class. The key condition combines uniqueness and existence constraints. Keys
can also be defined on attribute combinations.
• Primary key constraint: If multiple attributes exist that satisfy the key condition,
at most one of them can be defined as primary.
• Domain constraint: The set of possible values of an attribute can be restricted,
for example, using data types, enumerations, and checking rules.
• Referential integrity constraint: Attributes within a data set that refer to other
data sets must not point to nothing, i.e., the referenced data sets must exist.
4.3 Integrity Constraints 127
Integrity or consistency of data means that stored data does not contradict itself. A
database has integrity/consistency if the stored data is free of errors and accurately
represents the anticipated informational value. Data integrity is impaired if there are
ambiguities or conflicting records. For example, a consistent EMPLOYEE table
requires that the names of employees, streets, and cities really exist and are correctly
assigned.
Declarative integrity constraints are defined during the generation of a new table
in the CREATE TABLE statement using the data definition language. Constraints
can be added to, changed, and removed from existing tables using the ALTER
TABLE statement. In the example in Fig. 4.3, the primary key for the DEPART-
MENT table is specified as an integrity constraint with PRIMARY KEY. Primary
and foreign key of the EMPLOYEE table are defined similarly.
The various types of declarative integrity constraints are:
• Primary key definition: PRIMARY KEY defines a unique primary key for a
table. Primary keys must, by definition, not contain any NULL values.
• Foreign key definition: FOREIGN KEY can be used to specify a foreign key,
which relates to another table in the REFERENCES clause.
• Uniqueness: The uniqueness of an attribute can be determined by the UNIQUE
constraint. Unlike primary keys, unique attributes may contain NULL values.
• Existence: The NOT NULL constraint dictates that the respective attribute must
not contain any NULL values. For instance, the attribute Name in the
EMPLOYEE table in Fig. 4.3 is set to NOT NULL, because there must be a
name for every employee.
• Check constraint: Such rules can be declared with the CHECK command and
apply to every tuple in the table. For example, the CHECK Salary >30,000
statement in the STAFF table in Fig. 4.3 ensures that the annual salary of each
employee is at least USD 30,000.
• Set to NULL for changes or deletions: ON UPDATE SET NULL or ON
DELETE SET NULL declares for dependent tables that the foreign key value
of a dependent tuple is set to NULL when the corresponding tuple in the
referenced table is modified or removed.
• Restricted changes or deletion: If ON UPDATE RESTRICT or ON DELETE
RESTRICT is set, tuples cannot be manipulated or deleted while there are still
dependent tuples referencing them.
• Cascading changes or deletion: ON UPDATE CASCADE or ON DELETE
CASCADE defines that the modification or removal of a reference tuple is
extended to all dependent tuples.
128 4 Database Security
D# DepartmentName
CREATE TABLE DEPARTMENT(
D# CHAR(2), D3 IT
DepartmentName VARCHAR(2) D5 HR
PRIMARY KEY (D#) D6 Accounting
)
Reference
EMPLOYEE
In Fig. 4.3, a restrictive deletion rule has been specified for the two tables
DEPARTMENT and EMPLOYEE. This ensures that individual departments can
only be removed if they have no dependent employee tuples left. The command
would therefore return an error message, since the employees Stewart and Bell are
listed under the accounting department.
4.3 Integrity Constraints 129
Aside from delete operations, declarative integrity constraints can also affect
insert and update operations. For instance, the insert operation
will also return an error message: Department D7 is not yet listed in the referenced
table DEPARTMENT, but due to the foreign key constraint, the DBMS checks
whether the key D7 exists in the referenced table before the insertion.
Declarative, or static, integrity constraints can be defined during table generation
(CREATE TABLE statement). On the other hand, procedural, or dynamic, integrity
constraints compare database states before and after a change, i.e., they can only be
checked during runtime. The triggers are an alternative to declarative integrity
constraints because they initiate a sequence of procedural branches via instructions.
Triggers are mostly defined by a trigger name, a database operation, and a list of
subsequent actions:
The example above shows a situation where employees’ salaries must not be cut,
so before updating the EMPLOYEE table, the trigger checks whether the new salary
is lower than the old one. If that is the case, the integrity constraint is violated, and
the new salary is reset to the original value from before the update. This is a very
basic example meant to illustrate the core concept. In a production environment, the
user would also be notified.
Working with triggers can be tricky, since individual triggers may prompt other
triggers, which raises the issue of terminating all subsequent actions. In most
commercial database systems, the simultaneous activation of multiple triggers is
prohibited to ensure a clear action sequence and the proper termination of triggers.
In graph databases, implicit and explicit integrity constraints exist. The graph
database model implicitly checks referential integrity by ensuring that all edges are
130 4 Database Security
connected to existing nodes. The four consistency conditions that Cypher explicitly
supports are the following:
With the opposite command DROP CONSTRAINT, the integrity condition can
be deleted again:
There can be more than one key for a node type. The key condition is simply a
combination of the existence and uniqueness conditions.
These statements insert two EMPLOYEE nodes with the Ssn property, first with
type Integer and then with type String. In terms of schema freedom, this is possible.
However, schema freedom does not mean that Cypher has no datatypes. The apoc.
meta.type() function can be used to output the list of data types it stores for an input:
Referential Integrity
As a graph database language, Cypher implicitly checks all edges for referential
integrity, i.e., neither primary nor foreign keys need to be explicitly declared for
linking nodes to directed edges. The database management system ensures that
edges refer to existing nodes in all cases. Nodes can therefore only be deleted if
there are no edges associated with them. In addition, it is not possible to insert edges
without nodes. Edges must always be created as triples together with source and
target nodes. If necessary, the connected nodes can also be created directly during
the insertion of the edge.
The MongoDB database system, with its concept of schema freedom, offers a
flexible insertion of new data. This means that any data structure can be inserted
into all collections by default. This is especially useful for dealing with heteroge-
neous data (Big Data variety). Nevertheless, it is possible to ensure data integrity in
MQL in various ways.
Uniqueness Constraints
It is possible in MQL to create an index for a property that ensures the uniqueness of
the property values. For example, the following statement prevents records from
being inserted into the EMPLOYEE collection if the value of the EMPLOYEE.
EMailAddress field already exists.
db.EMPLOYEES.createIndex(
{ "EMPLOYEE.EMailAddress": 1},
{ unique: true } )
Existence Constraints
MQL supports validation of input documents with JSON Schema (see Sect. 2.5.1).
This is the variant of schema validation recommended by the manufacturer. For
example, JSON Schema can be used to specify which properties must be present for
a document. The following example creates a collection EMPLOYEE with validator
that sets an existence condition for the fields EMPLOYEE.Name and EMPLOYEE.
Status. Thus, only documents can be inserted which have at least these two fields.
4.4 Transaction Consistency 133
db.createCollection("EMPLOYEE", {
validator: {
$jsonSchema: {
required: [ "EMPLOYEE.Name", "EMPLOYEE.Status" ]
}
}
})
Domain Constraints
In addition to JSON Schema, MQL supports validation rules which allow filters with
all existing filter operators, with few exceptions. This allows sophisticated value
range conditions to be defined. For example, the following statement creates a new
collection EMPLOYEE with a validator that checks if titles are of type string and
restricts the Status field to three possible values.
db.createCollection( "PROJECTS",
{ validator: { $and:[
{ PROJECTS.title: { $type: "string" } },
{ PROJECTS.status: { $in:
[ "Requested", "Active", "Performed"]
} } ] } } )
The terms consistency and integrity of a database describe a state in which the stored
data does not contradict itself. Integrity constraints are to ensure that data consis-
tency is maintained for all insert and update operations.
134 4 Database Security
4.4.2 ACID
Ensuring the integrity of data is a major requirement for many database applications.
The transaction management of a database system allows conflict-free simultaneous
work by multiple users. Changes to the database are only applied and become visible
if all integrity constraints as defined by the users are fulfilled.
The term transaction describes database operations bound by integrity rules,
which update database states while maintaining consistency. More specifically, a
transaction is a sequence of operations that has to be atomic, consistent, isolated, and
durable.
• Atomicity (A): Transactions are either applied in full or not at all, leaving no trace
of its effects in the database. The intermediate states created by the individual
operations within a transaction are not visible to other concurrent transactions. A
transaction can therefore be seen as a unit for the resettability of incomplete
transactions.
• Consistency (C): During the transaction, integrity constraints may be temporarily
violated; however, at the end of the transaction, all of them must be met again. A
transaction therefore always results in moving the database from one consistent
state into another and ensures the integrity of data. It is considered a unit for
maintaining consistency.
• Isolation (I): The concept of isolation requires that parallel transactions generate
the same results as transactions in single-user environments. Isolating individual
transactions from transactions executed simultaneously protects them from
unwanted side effects. This makes transactions a unit for serializability.
4.4 Transaction Consistency 135
• Durability (D): Database states must remain valid and be maintained until they
are changed by a transaction. In case of software errors, system crashes, or errors
on external storage media, durability retains the effects of a correctly completed
transaction. In relation to the reboot and recovery procedures of databases,
transactions can be considered a unit for recovery.
These four principles, Atomicity (A), Consistency (C), Isolation (I), and Durabil-
ity (D), describe the ACID concept of transactions, which is the basis of several
database systems and guarantees that all users can only make changes that lead from
one consistent database state to another. Inconsistent interim states remain invisible
externally and are rolled back in case of errors.
4.4.3 Serializability
Concept of Serializability
A system of simultaneous transactions is synchronized correctly if there is a serial
execution creating the same database state.
The principle of serializability ensures that the results in the database are identi-
cal, whether the transactions are executed one after the other or in parallel. The focus
in defining conditions for serializability is on the READ and WRITE operations
within each transaction, i.e., the operations which read and write records in the
database.
Banking provides typical examples of concurrent transactions. The basic integrity
constraint for posting transactions is that debit and credit have to be balanced.
Figure 4.4 shows two simultaneously running posting transactions with their
READ and WRITE operations in chronological order. Neither transaction on its
own changes the total amount of the accounts a, b, and c. The transaction TRX_1
credits account a with 100 units of currency and, at the same time, debits account b
with 100 units of currency. The posting transaction TRX_2 similarly credits account
b and debits account c for 200 currency units each. Both transactions therefore fulfill
the integrity constraint of bookkeeping, since the ledgers are balanced.
However, if both transactions are executed simultaneously, a conflict arises: The
transaction TRX_1 misses the credit b := b+2001 done by TRX_2, since this change
is not immediately written back, and reads a “wrong” value for account b. After both
1
The notation b := b+200 means that the current balance of account b is increased by 200 currency
units.
136 4 Database Security
BEGIN_OF_TRX_1 BEGIN_OF_TRX_2
READ(a)
a := a + 100 READ(b)
WRITE(a)
b := b + 200
READ(b)
WRITE(b)
b := b - 100
READ(c)
WRITE(b) c := c - 200
WRITE(c)
END_OF_TRX_1 END_OF_TRX_2
Time
transactions are finished, account a holds the original amount + 100 units (a+100),
the amount in account b is reduced by 100 units (b-100), and c holds 200 units less
(c-200). Due to the Transaction TRX_1 missing the b+200 step for account b and
not calculating the amount accordingly, the total credits and debits are not balanced,
and the integrity constraint is violated.
Potential conflicts can be discovered beforehand. To do so, those READ and
WRITE operations affecting a certain object, i.e., a single data value, a record, a
table, or sometimes even an entire database, are filtered from all transactions. The
granularity (relative size) of the object decides how well the picked transactions can
be synchronized. The larger the granularity, the smaller the degree of transaction
synchronization and vice versa. All READ and WRITE operations from different
transactions that apply to a specific object are therefore listed in the log of the object
x, short LOG(x). The LOG(x) of object x contains, in chronological order, all READ
and WRITE operations accessing the object.
4.4 Transaction Consistency 137
In our example of the concurrent posting transactions TRX_1 and TRX_2, the
objects in question are the accounts a, b, and c. As shown in Fig. 4.5, the log for
object b, for instance, contains four entries (see also Fig. 4.4). First, TRX_2 reads the
value of b, and then TRX_1 reads the same value, before TRX_2 gets to write back
the modified value of b. The last log entry is caused by TRX_1 when it overwrites
the value from TRX_2 with its own modified value for b. Assessing the logs is an
easy way to analyze conflicts between concurring transactions. A precedence graph
represents the transactions as nodes and possible READ_WRITE or
WRITE_WRITE conflicts as directed edges (arched arrows). For any one object,
WRITE operations following READs or WRITEs can lead to conflicts, while
multiple READ operations are generally not a conflict risk. The precedence graph
does therefore not include any READ_READ edges.
Figure 4.5 shows not only the log of object b for the posting transactions TRX_1
and TRX_2 but also the corresponding precedence graph. Starting from the TRX_1
node, a READ on object b is followed by a WRITE on it by TRX_2, visualized as a
directed edge from the TRX_1 node to the TRX_2 node. According to the log, a
WRITE_WRITE edge goes from the TRX_2 node to the TRX_1 node, since the
WRITE operation by TRX_2 is succeeded by another WRITE on the same object by
TRX_1. The precedence graph is therefore cyclical, in that there is a directed path
from a node that leads back to the same node. This cyclical dependency between the
transactions TRX_1 and TRX_2 shows that they are not serializable.
Serializability Condition
A set of transactions is serializable if the corresponding precedence graphs contain
no cycles.
The serializability condition states that multiple transactions have to yield the
same results in a multi-user environment as in a single-user environment. In order to
138 4 Database Security
BEGIN_OF_TRX_1
Locks
LOCK(a)
READ(a)
a := a + 100
WRITE(a)
LOCK(b)
READ(b) LOCK(b) UNLOCK(a)
UNLOCK(a)
b := b - 100
WRITE(b)
LOCK(a) UNLOCK(b)
UNLOCK(b)
END_OF_TRX_2 Time
Fig. 4.6 Sample two-phase locking protocol for the transaction TRX_1
only lifted at the end of the transaction, concurring transactions would have to wait
the entire processing time of TRX_1 for the release of objects a and b.
Overall, two-phase locking ensures the serializability of simultaneous
transactions.
BEGIN_OF_TRX_1 BEGIN_OF_TRX_2
LOCK(a)
READ(a)
LOCK(b)
READ(b)
a := a + 100
WRITE(a) b := b + 200
LOCK(b) WRITE(b)
READ(b) LOG(b)
UNLOCK(a) TRX_2:READ
LOCK(c)
READ(c) TRX_2:WRITE
b := b - 100
TRX_1:READ
UNLOCK(b) TRX_1:WRITE
WRITE(b)
c := c - 200
WRITE(c)
UNLOCK(b)
UNLOCK(c)
END_OF_TRX_1 END_OF_TRX_2
Time
2PL causes a slight delay in the transaction TRX_1, but after both transactions are
finished, integrity is retained. The value of account a has increased by 100 units (a
+100), as has the value of account b (b+100), while the value of account c has been
reduced by 200 units (c-200). The total amount across all three accounts has
therefore remained the same.
A comparison between the LOG(b) from Fig. 4.7 and the previously discussed
log from Fig. 4.5 shows a major difference: It is now strictly one read (TRX_2:
READ) and one write (TRX_2: WRITE) by TRX_2 before TRX_1 gets access to
account b and can also read (TRX_1: READ) and write (TRX_1:WRITE) on it. The
corresponding precedence graph contains neither READ_WRITE nor
WRITE_WRITE edges between the nodes TRX_1 and TRX_2, i.e., it is free of
cycles. The two posting transactions therefore fulfill the integrity constraint.
In many database applications, the demand for high serializability prohibits the
use of entire databases or tables as locking units. Consequently, it is common to
define smaller locking units, such as database excerpts, parts of tables, tuples, or
4.4 Transaction Consistency 141
even individual data values. Ideally, locking units are defined in a way that allows for
hierarchical dependencies in lock management. For instance, if a set of tuples is
locked by a specific transaction, the superordinate locking units such as the
containing table or database must not be completely blocked by other transactions
during the lock’s validity. When an object is put under an exclusive lock, locking
hierarchies can be used to automatically evaluate and mark superordinate objects
accordingly.
Various locking modes are also important: The most basic classification of locks
is the dichotomy of read and write locks. Read locks (or shared locks) grant read-
only access for the object to a transaction, while write locks (or exclusive locks)
permit read and write access to the object.
Another pessimistic method ensuring serializability are timestamps that allow for
strictly ordered object access according to the age of the transactions. Such time
tracking methods preserve the chronological order of the individual operations
within the transactions and therefore avoid conflicts.
Optimistic methods are based on the assumption that conflicts between concurring
transactions will be rare occurrences. No locks are set initially in order to increase the
degree of synchronization and reduce wait times. Before transactions can conclude
successfully, they are validated retroactively.
Transactions with optimistic concurrency control have three parts: read phase,
validation phase, and write phase. During the read phase, all required objects are
read, saved to a separate transaction workspace, and processed there, without any
preventative locks being placed. After processing, the validation phase is used to
check whether the applied changes conflict with any other transactions. The goal is
to check currently active transactions for compatibility and absence of conflicts. If
two transactions block each other, the transaction currently in the validation phase is
deferred. In case of successful validation, all changes from the workspace are entered
into the database during the write phase.
The use of transaction-specific workspaces increases concurrency in optimistic
methods, since reading transactions do not impede each other. Checks are only
necessary before writing back changes. This means that the read phases of multiple
transactions can run simultaneously without any objects being locked. Instead, the
validity of the objects in the workspace, i.e., whether they still match the current state
of the database, must be confirmed in the validation phase.
For the sake of simplicity, we will assume that validation phases of different
transactions do not overlap. To ensure this, the time the transaction enters the
validation phase is marked. This allows for both the start times of validation phases
and the transactions themselves to be sorted chronologically. Once a transaction
enters the validation phase, it is checked for serializability.
The procedure to do so in optimistic concurrency control is as follows: Let TRX_t
be the transaction to be validated and TRX_1 to TRX_k be all concurrent
142 4 Database Security
READ_SET(TRX_1) WRITE_SET(TRX_2)
transactions that have already been validated during the read phase of TRX_t. All
other transactions can be ignored since they are handled in strict chronological order.
All objects read by TRX_t must be validated, since they could have been modified
by any of the critical transactions TRX_1 to TRX_k in the meantime. The set of
objects read by TRX_t is labeled READ_SET(TRX_t), and the set of objects written
by the critical transactions is labeled WRITE_SET(TRX_1,. . .,TRX_k). This gives
us the following serializability condition:
4.4.6 Recovery
Various errors can occur during database operation and will normally be mitigated or
corrected by the database system itself. Some error cases, such as integrity violations
or deadlocks, have already been mentioned in the sections on concurrency control.
Other issues may be caused by operating systems or hardware, for instance, when
data remains unreadable after a save error on an external medium.
The restoration of a correct database state after an error is called recovery. It is
essential for recovery to know where an error occurred: in an application, in the
database software, or in the hardware. In case of integrity violations or after an
application program “crashes,” it is sufficient to roll back and then repeat one or
several transactions. With severe errors, it may be necessary to retrieve earlier saves
from backup archives and restore the database state by partial transaction re-runs.
In order to roll back transactions, the database system requires certain informa-
tion. Usually, a copy of an object (called before image) is written to a log file2 before
the object is modified. In addition to the object’s old values, the file also contains
markers signaling the beginning and end of the transaction. In order for the log file to
be used efficiently in case of errors, checkpoints are set either based on commands in
the application program or for certain system events. A system-wide checkpoint
contains a list of the transactions active up until that time. If a restart is needed, the
database system merely has to find the latest checkpoint and reset the unfinished
transaction.
This procedure is illustrated in Fig. 4.9: After system failure, the log file must be
read backward until the last checkpoint. Of special interest are those transactions that
had not been able to indicate their correct conclusion with an EOT (end of transac-
tion) marker, such as the transactions TRX_2 and TRX_5 in our example. For them,
the previous database state has to be restored with the help of the log file (undo). For
TRX_5, the file has to be read back until the BOT (beginning of transaction) marker
in order to retrieve the transaction’s before image. Regardless of the type of
checkpoint, the newest state (after image) must be restored for at least TRX_4
(redo).
The recovery of a database after a defect in an external storage medium requires a
backup of the database and an inventory of all updates since the creation of the
backup copy. Backups are usually made before and after the end-of-day processing,
since they are quite time-consuming. During the day, changes are recorded in the log
file, with the most up-to-date state for each object being listed.
Securing databases requires a clear-cut disaster prevention procedure on the part
of the responsible database specialists. Backup copies are usually stored in
generations, physically separate, and sometimes redundant. The creation of backup
files and the removal of old versions have to be fully documented. In case of errors or
for disaster drills, the task is to restore current data from backup files and logged
changes within a reasonable timeframe.
2
This log file is not to be confused with the log from Sect. 4.2.2.
144 4 Database Security
BOT TRX_2
BOT TRX_5
Time
Checkpoint System
crash
It has become clear in practice that for large and distributed data storage systems,
consistency cannot always be the primary goal; sometimes, availability and partition
tolerance take priority.
In relational database systems, transactions at the highest isolation level are
always atomic, consistent, isolated, and durable (see ACID, Sect. 4.4.2).
Web-based applications, on the other hand, are geared toward high availability and
the ability to continue working if a computer node or a network connection fails.
Such partition-tolerant systems use replicated computer nodes and a softer consis-
tency requirement called BASE (Basically Available, Soft state, Eventually consis-
tent): This allows replicated computer nodes to temporarily hold diverging data
versions and only be updated with a delay.
During a symposium in 2000, Eric Brewer of the University of California,
Berkeley, presented the hypothesis that the three properties of consistency, avail-
ability, and partition tolerance cannot exist simultaneously in a massive distributed
computer system.
4.5 Soft Consistency in Massive Distributed Data 145
C A C A
P P
Fig. 4.10 The three possible combinations under the CAP theorem
This hypothesis was later proven by researchers at MIT in Boston and established
as the CAP theorem.
CAP Theorem
The CAP theorem states that in any massive distributed data management system,
only two of the three properties consistency, availability, and partition tolerance can
be ensured. In short, massive distributed systems can have a combination of either
consistency and availability (CA), consistency and partition tolerance (CP), or
availability and partition tolerance (AP); but it is impossible to have all three at
once (see Fig. 4.10). Use cases of the CAP theorem may include:
• Stock exchange systems requiring consistency and availability (CA), which are
achieved by using relational database systems following the ACID principle.
• Country-wide networks of ATMs, which still require consistency, but also parti-
tion tolerance, while somewhat long response times are acceptable (CP);
distributed and replicated relational or NoSQL systems supporting CP are best
suited for this scenario.
• The Internet service Domain Name System (DNS) is used to resolve URLs into
numerical IP addresses in TCP/IP (Transmission Control Protocol/Internet Proto-
col) communication and must therefore be always available and partition tolerant
146 4 Database Security
(AP), a task that requires NoSQL data management systems, since a relational
database system cannot provide global availability and partition tolerance.
order to determine which is the newest. This is done with the help of vector clocks
(see Sect. 4.5.3).
The fourth and final case is “consistency by quorum” with the formula W+R>N
(Fig. 4.11, bottom right). In our example, both parameters W and R are set to two,
i.e., W=2 and R=2. This requires two nodes to be written and two nodes to be read
successfully. The read operation once again definitely returns both versions A and B
so that the chronological order has to be determined using vector clocks.
In distributed systems, various events may occur at different times due to concurring
processes. Vector clocks can be used to bring some order into these events. They are
not time-keeping tools, but counting algorithms allowing for a partial chronological
ordering of events in concurrent processes.
Below, we will look at concurrent processes in a distributed system. A vector
clock is a vector with k components or counters Ci with i=1,...,k, where k equals the
number of processes. Each process Pi therefore has a vector clock Vi=[C1,...,Ck]
with k counters.
148 4 Database Security
Event A Event G
Process P1
[1,0,0] [3,2,3]
N1
,3]
wi
2
[1,
th
th
[1,
wi
0
,0]
3
N
Event C Event F
Process P2
[1,1,0] [1,2,3]
]
0,3
[0,
th
wi
2
Event B Event D N
Event E
Process P3
[0,0,1] [0,0,2] [0,0,3]
• Initially, all vector clocks are set to zero, i.e., Vi=[0,0,...,0], for all processes Pi
and counters Ck.
• In each interprocess message, the sender includes its own vector clock for the
recipient.
• When a process receives a message, it increments its own counter Ci in its vector
by one, i.e., Ci=Ci+1. It also merges its own updated vector Vi with the received
vector W component by component by keeping the higher of two corresponding
counter values, i.e., Vi[j]=max(Vi[j],W[j]), for all j=1,...,k.
Figure 4.12 shows a possible scenario with three concurrent processes P1, P2, and
P3. Process P3 includes the three events B, D, and E in chronological order. It
increments its own counter C3 in its vector clock by one for each event, resulting in
the vector clocks [0,0,1] for event B, [0,0,2] for event D, and [0,0,3] for event E.
In process P1, event A occurs first, and the process’ counter C1 is raised by one in
its vector clock V1, which is then [1,0,0]. Next, P1 sends a message M1 to process P2,
including its current vector clock V1. Event C in process P2 first updates the process’
own vector clock V2 to [0,1,0] before merging it with the newly received vector
clock V1=[1,0,0] into [1,1,0].
4.5 Soft Consistency in Massive Distributed Data 149
Similar mergers are executed for the messages M2 and M3: First, the processes’
vector clocks V2/V1 are incremented by one in the process’ own counter, and then
the maximum of the two vector clocks to be merged is determined and included. This
results in the vector clocks V2=[1,2,3] (since [1,2,3]=max([1,2,0],[0,0,3])) for event
F and V1=[3,2,3] for event G.
Causality can be established between two events in a distributed system: Event X
happened before event Z if the vector clock V(X)=[X1,X2,...,Xk] of X is less than the
vector clock V(Y)=[Y1,Y2,...,Yk] of Y. In other words:
There are some major differences between the ACID (Atomicity, Consistency,
Isolation, Durability) and BASE (Basically Available, Soft state, Eventually consis-
tent) approaches, as summarized in Fig. 4.13.
Most SQL and NoSQL database systems are strictly based on ACID, meaning
that consistency is ensured at any time in both centralized and distributed systems.
Distributed database systems require a coordinating program that implements all
changes to table contents in full and creates a consistent database state. In case of
errors, the coordinating program makes sure that the distributed database is not
affected in any way and the transaction can be restarted.
150 4 Database Security
ACID BASE
Transaction A Transaction B
START TRANSACTION;
INSERT INTO Account (Id,
1
Balance) VALUES(1,300),(2,200);
COMMIT; START TRANSACTION;
UPDATE Account
2
SET Balance = Balance - 100
SET TRANSACTION WHERE Id = 1;
ISOLATION LEVEL
3
REPEATABLE READ;
START TRANSACTION;
transaction is complete, process A reads this balance again in step 9. Which balances
does process A read at times 4, 6, 8, and 9?
4.6 Transaction Control Language Elements 153
To protect data integrity, the Neo4j database system supports transactions that
basically comply with the ACID principle. All database operations that access
graphs, indexes, or the schema must be performed in a transaction. Deadlock
detection is also integrated into the central transaction management. However, data
retrieved by graph traversal is not protected from modification by other transactions.
Individual Cypher queries are executed within one transaction each. Changes
made by updating queries are held in memory by the transaction until committed, at
which time the changes are saved to disk and visible to other transactions. If an error
occurs, either during query evaluation (e.g., division by zero) or during commit, the
transaction is automatically rolled back, and no changes are saved. Each updating
query is always either completely successful or not successful at all. Thus, a query
that makes many updates consumes large amounts of memory because the transac-
tion holds changes in memory.
154 4 Database Security
Isolation Levels
Transactions in the Neo4j database system use the READ COMMITTED isolation
level. Transactions see data changes once they have been committed, and they do not
see data changes that have not yet been committed. This type of isolation is weaker
than serializability but offers significant performance advantages. It is sufficient for
most cases. However, non-repeatable reads may occur because locks are only
maintained until the end of a transaction.
If this is not sufficient, the Neo4j Java API allows explicit locking of nodes and
relationships. One can manually set up write locks on nodes and relationships to
achieve the higher isolation level of serializability by explicitly requesting and
releasing locks. For example, if a lock is placed on a shared node or relationship,
all transactions on that lock will be serialized if the lock is maintained.
Transactions in Cypher
Cypher does not support explicit language elements for transaction management. By
default, each individual Cypher statement runs as a separate transaction. This means
that, for example, UPDATE commands in a transaction run atomically using the
ACID principle, even if they modify many nodes and edges simultaneously. How-
ever, it is currently not directly possible with Cypher to run multiple separate
statements as a single transaction.
To start multiple statements as one transaction in the Neo4J database manage-
ment system, there are several other ways. The easiest way to do this is usually via an
API, e.g., via HTTP or Java. Also in the command line program Cypher shell, it is
possible with the commands :begin, :commit, and :rollback the transaction control;
however, the commands are not part of the Cypher language. The following example
will illustrate this. Suppose the following sequence of commands is executed as a
batch script via Cypher shell. What result will be returned by the last line?
:begin
MATCH (k:account {Id:1}) SET k.balance = 5;
MATCH (k:account {Id:2}) SET k.balance = 6/0;
:commit
:rollback
MATCH(k:Account) RETURN k;
The following is returned as the result: Account with Id 1 has balance 3, and
Account with Id 2 has balance 2. Why? The first two statements create two nodes of
type Account with Id 1 and 2 and set the balance to 1 and 2, respectively. The third
4.6 Transaction Control Language Elements 155
In the MongoDB database system, changes to a single document are always atomic.
Single document transactions are very efficient to process. Since all relevant entities
for a subject can be aggregated into a single document type (see Sect. 2.5), the need
for multiple document transactions is eliminated in many use cases. If atomic reads
and writes are needed across multiple documents, in different collections, or across
different machines, MongoDB supports distributed transactions. However, this is
associated with performance penalties.
Atomicity of Transactions
When a transaction is committed, all data changes made in the transaction are stored
and visible outside the transaction. As long as a transaction is not committed, the
data changes made in the transaction are not visible outside the transaction. When a
156 4 Database Security
transaction commits, all changes made in the transaction are discarded without ever
becoming visible. If a single action in the transaction fails, the entire transaction is
rolled back.
Let’s look at this with an example.3 If we assume that on the database named “db”
the collection named “ACCOUNT” is empty, what will be the return of the com-
mand on the last line, and why?
s = db.getMongo().startSession()
c = s.getDatabase('db').getCollection('ACCOUNT')
c.createIndex( { "Key": 1 }, { "unique": true } )
c.insertMany([{"Key":1, "Val":1},{" Key":1, "Val":2}])
s.startTransaction( )
c.insertMany([{"Key3, "Val":3},{" Key":3, "Val":4}])
s.commitTransaction( )
c.find({},{_id:0})
• The first line starts a new session. Transactions are bound to sessions in
MongoDB.
• The second line instantiates the collection ACCOUNT within the session so that
the following transactions are linked to it. Since the collection does not yet exist,
it is newly created.
• The third line sets a uniqueness constraint for the “Key” field to test transaction
behavior in terms of integrity.
• The fourth line tries to insert two documents with the same “Key” which is not
possible because of the uniqueness condition. Since this happens outside of a
distributed transaction, the statement over two documents is not executed atom-
ically. Therefore, the first document with "Key"=1 and "Val"=1 is successfully
committed. The second document generates an error due to the duplicate “Key”
and is discarded.
• The fifth line starts a transaction with s.startTransaction().
3
Transactions work in MongoDB only within replica sets. So the database server (mongod) must be
started first with the corresponding option --replSet <name>. In addition, the command rs.initiate()
must then be executed in the Mongo Shell (mongo).
4.6 Transaction Control Language Elements 157
• The sixth line again wants to insert two documents with the same “Key,” this time
within the transaction started above. The statement is now executed atomically
according to the all-or-nothing principle. An error occurs, because duplicates in
the “Key” field are not accepted due to the unique index.
• The seventh line terminates the transaction with s.commitTransaction().
• On the eighth line, we ask for all documents in the collection, excluding the object
id. We then see the following output:
rs0:PRIMARY> c.find({},{_id:0})
{ "Key" : 1, "Val" : 1 }
Within the distributed transaction, the insertion of both documents on the sixth
line has been rolled back, even though only the one document with “Key”=3 and
“Balance”=4 generated an error due to the duplicate. The example shows how
atomicity is ensured within a transaction across multiple documents.
session = db.getMongo().startSession()
session.startTransaction({
"readConcern": { "level": "majority" },
"writeConcern": { "w": "majority" }
})
158 4 Database Security
Bibliography
Basta, A., Zgola, M.: Database Security. Cengage Learning (2011)
Bowman, A.: Protecting Against Cypher injection. Neo4j Knowledge Base. https://neo4j.com/
developer/kb/protecting-against-cypher-injection/ (n.d.). Accessed 4 July 2022
Brewer E.: Keynote – Towards Robust Distributed Systems. In: 19th ACM Symposium on
Principles of Distributed Computing, Portland, Oregon, 16–19 July 2000
Dindoliwala, V.J., Morena, R.D.: Comparative study of integrity constraints, storage and profile
management of relational and non-relational database using MongoDB and Oracle.
Int. J. Comp. Sci. Eng. 6(7), 831–837 (2018) https://www.ijcseonline.org/pdf_paper_view.
php?paper_id=2520&134-IJCSE-04376.pdf
Eswaran, K.P., Gray, J., Lorie, R.A., Traiger, I.L.: The notion of consistency and predicate locks in
a data base system. Commun. ACM. 19(11), 624–633 (1976)
Gilbert, S., Lynch, N.: Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-
Tolerant Web Services. Massachusetts Institute of Technology, Cambridge (2002)
Gray, J., Reuter, A.: Transaction Processing – Concepts and Techniques. Morgan Kaufmann (1992)
Härder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM Comput. Surv.
15(4), 287–317 (1983)
MongoDB, Inc.: MongoDB Documentation. https://www.mongodb.com/docs/ (2022)
Neo4J, Inc.: Neo4j Documentation. Neo4j Graph Data Platform. https://neo4j.com/docs/ (2022)
Onyancha, B.H.: Securing MongoDB from External Injection Attacks. Severalnines. https://web.
archive.org/web/20210618085021/https://severalnines.com/database-blog/securing-mongodb-
external-injection-attacks (2019)
Papiernik, M.: How To Use Transactions in MongoDB. DigitalOean. https://www.digitalocean.
com/community/tutorials/how-to-use-transactions-in-mongodb (2019)
Riak: Open Source Distributed Database. siehe http://basho.com/riak/
Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks – A Guide to Modern Databases and
the NoSQL Movement. The Pragmatic Bookshelf (2012)
Spiegel, P.: NoSQL Injection – Fun with Objects and Arrays. German OWASP-Day, Darmstadt.
https://owasp.org/www-pdf-archive/GOD16-NOSQL.pdf (2016)
Vogels, W.: Eventually consistent. Commun. ACM. 52(1), 40–44 (2009)
Weikum, G., Vossen, G.: Transactional Information Systems – Theory, Algorithms, and the
Practice of Concurrency Control and Recovery. Morgan Kaufmann (2002)
System Architecture
5
Throughout the 1950s and 1960s, file systems were kept on secondary storage media
(tape, drum memory, magnetic disk), before database systems became available on
the market in the 1970s. Those file systems allowed for random, or direct, access to
the external storage, i.e., specific records could be selected by using an address,
without the entirety of records needing to be checked first. The access address was
determined via an index or a hash function (see Sect. 5.2).
The mainframe computers running these file systems were largely used for
technical and scientific applications (computing numbers). With the emergence of
database systems, computers also took over in business contexts (computing num-
bers and text) and became the backbone of administrative and commercial
applications, since database systems support consistency in multi-user operation
(see ACID, Sect. 4.4.2). Today, many information systems are based on the rela-
tional database technology which replaced most of the previously used hierarchic or
network-like database systems. More and more NoSQL database systems such as
graph databases or document databases are being used for Big Data applications.
This applies not only to large data volumes but also to a large variety of different
structured and unstructured data (variety) and fast data streams (velocity).
Relational database systems use only tables to store and handle data. A table is a
set of records that can flexibly process structured data. Structured data strictly
adheres to a well-defined data structure with a focus on the following properties:
• Schema: The structure of the data must be communicated to the database system
by specifying a schema (see the SQL command CREATE TABLE in Chap. 3). In
addition to table formalization, integrity constraints are also stored in the schema
(cf., e.g., the definition of referential integrity and the establishment of appropri-
ate processing rules).
• Data types: The relational database schema guarantees that for each use of the
database, the data manifestations always have the set data types (e.g.,
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 159
M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9_5
160 5 System Architecture
CHARACTER, INTEGER, DATE, TIMESTAMP, etc.; see also the SQL tutorial
at www.sql-nosql.org). To do so, the database system consults the system tables
(schema information) at every SQL invocation. Special focus is on authorization
and data protection rules, which are checked via the system catalog (see VIEW
concept and privilege assignment via GRANT and REVOKE in Sect. 4.2.1).
• They consist of a set of data objects whose structure and content are subject to
continuous changes.
• Data objects are either atomic or composed of other data objects (complex
objects).
• Atomic data objects contain data values of a specified data type.
Data management systems for semi-structured data work without a fixed database
schema, since structure and content change constantly. A possible use case are
content management systems for websites which can flexibly store and process
Web pages and multimedia objects. Such systems require extended relational data-
base technology (see Chap. 6), XML databases, or NoSQL databases (see Chap. 7).
A data stream is a continuous flow of digital data with a variable data rate
(records per unit of time). Data within a data stream is sorted chronologically and
often given a timestamp. Besides audio and video data streams, this can also be a
series of measurements which are analyzed with the help of analysis languages or
specific algorithms (language analysis, text analysis, pattern recognition, etc.).
Unlike structured and semi-structured data, data streams can only be analyzed
sequentially.
Figure 5.1 shows a simple use case for data streams. The setting is a multi-item
auction via an electronic bidding platform. In this auction, bidding starts at a set
minimum. Participants can make multiple bids that have to be higher than the
previous highest bid. Since electronic auctions have no physical location, time and
duration of the auction are set in advance. The bidder who makes the highest bid
during the set time wins the auction.
Any AUCTION can be seen as a relationship set between the two entity sets
OBJECT and BIDDER. The corresponding foreign keys O# and B# are
complemented by a timestamp and the offered sum (e.g., in USD) per bid. The
data stream is used to show bidders the current standing bids during the auction.
After the auction is over, the highest bids are made public, and the winners of the
5.2 Storage and Access Structures 161
Timestamp O# B# Bid
individual items are notified. The data stream can then be used for additional
purposes, for instance, bidding behavior analyses or disclosure in case of legal
contestation.
Unstructured data are digital data without any fixed structure. This includes
multimedia data such as continuous text, music files, satellite imagery, or audio/
video recordings. Unstructured data is often transmitted to a computer via digital
sensors, for example, in the data streams explained above, which can sequentially
transport structured and/or unstructured data.
The processing of unstructured data or data streams calls for special adapted
software packages. NoSQL databases or specific data stream management systems
are used to fulfill the requirements of Big Data processing.
The next sections discuss several architectural aspects of SQL and NoSQL
databases.
Storage and access structures for relational and non-relational database systems
should be designed to manage data in secondary storage as efficiently as possible.
For large amounts of data, the structures used in the main storage cannot simply be
reproduced on the background memory. It is necessary to instead optimize the
162 5 System Architecture
storage and access structures in order to enable reading and writing contents on
external storage media with as few accesses as possible.
5.2.1 Indexes
For each name in the EMPLOYEE table, sorted alphabetically, either the identi-
fication key E# or the internal address of the employee tuple is recorded. The
database system uses this index of employee names for increasing access speed of
corresponding queries or when executing a join. In this case, the Name attribute is
the access key.
In this example, USING HASH is used to create a hash index (see Sect. 5.2.3) that
is optimized for equality queries.
Another possibility is to use balanced trees (B-trees; see Sect. 5.2.2) for indexes.
These are suitable for range queries such as “greater than” or “less than.”
Tree structures can be used to store records or access keys and to index attributes in
order to increase efficiency. For large amounts of data, the root, internal, and leaf
nodes of the tree are not assigned individual keys and records, but rather entire data
pages. In order to find a specific record, the tree then has to be searched.
With central memory management, the database system usually uses binary trees
in the background in which the root node and each internal node have two subtrees.
5.2 Storage and Access Structures 163
Such trees cannot be used unlimitedly for storing access keys or records for
extensive databases, since their height grows exponentially for larger amounts of
data; large trees, however, are impractical for searching and reading data content on
external storage media, since they require too many page accesses.
The height of a tree, i.e., the distance between the root node and the leaves, is an
indicator for the number of accesses required on external storage media. To keep the
number of external accesses as low as possible, it is common to make the storage tree
structures for database systems grow in width instead of height. One of the most
important of those tree structures is the B-tree (see Fig. 5.2).
A B-tree is a tree whose root node and internal nodes generally have more than
two subtrees. The data pages represented by the individual internal and leaf nodes
should not be empty, but ideally filled with key values or entire records. They are
therefore usually required to be filled at least halfway with records or keys (except
for the page associated with the root node).
B-Tree
A tree is a B-tree of the nth order if:
• It is fully balanced (the paths from the root to each leaf have the same length)
• Each node (except for the root node) has at least n and at the most 2*n entries in
its data page
That second condition also means that, since every node except the root node has
at least n entries, each node has at least n subtrees. On the other hand, each node has a
maximum of 2*n entries, i.e., no node of a B-tree can have more than 2*n subtrees.
Assume, for instance, that the key E# from the EMPLOYEE table is to be stored
in a B-tree of the order n=2 as an access structure, which results in the tree shown in
Fig. 5.2.
Nodes and leaves of the tree cannot contain more than four entries due to the order
2. Apart from the keys, we will assume that the pages for the nodes and leaves hold
not only key values but also pointers to the data pages containing the actual records.
This means that the tree in Fig. 5.2 represents an access tree, not the data manage-
ment for the records in the EMPLOYEE table.
In our example, the root node of the B-tree contains the four keys E1, E4, E7, and
E19 in numerical order. When the new key E3 is added, the root node must be split
because it cannot hold any more entries. The split is done in a way that produces a
balanced tree. The key E4 is declared the new root node, since it is in between two
equal halves of the remaining key set. The left subtree is formed of key values that
meet the condition “E# lower than E4” (in this case E1 and E3); the right subtree
consists of key values where “E# higher than E4” (i.e., E7 and E19). Additional keys
can be inserted in the same way, while the tree retains a fixed height.
The database system searches for individual keys top-down, e.g., if the candidate
key E15 is requested from the B-tree B4 in Fig. 5.2, it checks against the entries in
the root node. Since E15 lies between the keys E4 and E18, it selects the
corresponding subtree (in this case, only one leaf node) and continues the search
164 5 System Architecture
E4 Bell Kent D6
Insert:
E3
E1 E4 E7 E19 Tree B1
E4 Tree B2
E1 E3 E7 E19
Insert:
E9, E18, E2, E24
Split
E4 Tree B3 into
subtrees
E1 E2 E3 E7 E9 E18 E19
Insert:
E26, E15 E4 E18 Tree B4
E1 M2
M1 E2 M3
E3 E7 E9 E15 E19 E24 E26
E# < E4 E4 < E# < E18 E# > E18
until it finds the entry in the leaf node. In this simple example, the search for E15
requires only two page accesses, one for the root node and one for the leaf node.
The height of a B-tree determines the access times for keys as well as the data
associated with a (search) key. The access times can be reduced by increasing the
branching factor of the B-tree.
Another option is a leaf-oriented B-tree (commonly called B*-Tree), where the
actual records are never stored in internal nodes but only in leaf nodes. The internal
roots contain only key entries in order to keep the tree as low as possible.
5.2 Storage and Access Structures 165
• It must be possible to follow the transformation rule with simple calculations and
little resources.
• The assigned addresses must be distributed evenly across the address space.
• The probability of assignment collisions, i.e., the use of identical addresses for
multiple keys, must be the same for all key values.
There is a wide variety of hash functions, each of which has its pros and cons. One
of the simplest and best-known algorithms is the division method.
The integer “k mod p”—the remainder from the division of the key value k by the
prime number p—is used as a relative address or page number. In the division
method, the choice of the prime number p determines the memory use and the
uniformity of distribution.
Figure 5.3 shows the EMPLOYEE table and how it can be mapped to different
pages with the division method of hashing.
In this example, each page can hold four key values. The prime number chosen
for p is 5. Each key value is now divided by 5, with the remaining integer
determining the page number.
Inserting the key E14 causes problems, since the corresponding page is already
full. The key E14 is placed in an overflow area. A link from page 4 to the overflow
area maintains the affiliation of the key with the co-set on page 4.
There are multiple methods for handling overflows. Instead of an overflow area,
additional hash functions can be applied to the extra keys. Quickly growing key
ranges or complex delete operations often cause difficulties in overflow handling. In
order to mitigate these issues, dynamic hashing methods have been developed.
Such dynamic hash functions are designed to keep memory use independent from
the growth of keys. Overflow areas or comprehensive redistribution of addresses is
mostly rendered unnecessary. The existing address space for a dynamic hash
function can be extended either by a specific choice of hashing algorithm or by the
166 5 System Architecture
E4 Bell Kent D6
Insert: E19 E1 E7 E4
k mod 5: 4 1 2 4
Page 0 Page 1 Page 2 Page 3 Page 4
E1 E7 E19
E4
E1 E7 E3 E19
E2 E18 E4
E9
E24
E15 E1 E7 E3 E19
E20 E26 E2 E18 E4
M19
E22 E9
M4
E24
M9
Overflow area E14
use of a page assignment table kept in the main memory, without the need to reload
all keys or records already stored.
Node K ²
O15
O18
O18
In Big Data applications, the key-value pairs are assigned to different nodes in the
computer network. Based on the keys (e.g., term or day), their values (e.g.,
frequencies) are stored in the corresponding node. The important part is that with
consistent hashing, address calculation is used for both the node addresses and the
storage addresses of the objects (key-value).
Figure 5.4 provides a schematic representation of consistent hashing. The address
space of 0 to 2x key values is arranged in a circle; then a hash function is selected to
run the following calculations:
The key-value pairs are stored on their respective storage nodes according to a
simple assignment rule: The objects are assigned to the next node (clockwise) and
managed there.
Figure 5.4 shows an address space with three nodes and eight objects (key-value
pairs). The positioning of the nodes and objects results from the calculated
addresses. According to the assignment rule, objects O58, O1, and O7 are stored on
node K1; objects O15 and O18 on node K2; and the remaining three objects on node
K 3.
The strengths of consistent hashing best come out in flexible computer structures,
where nodes may be added or removed at any time. Such changes only affect objects
directly next to the respective nodes on the ring, making it unnecessary to recalculate
168 5 System Architecture
Node K ³ x
Address space 0 ... 2
O37 O1
O58
O39
O45 O7
O18 Node K ¹
O45 O58
O1
O7
O39
O15
O37
K 2 removed K 4 added
Node K ² Node K 4
O15 O15
O18
O18
and reassign the addresses for a large number of key-value pairs with each change in
the computer network.
Figure 5.5 illustrates two changes: Node K2 is removed, and a new node K4 is
added. After the local adjustments, object O18, which was originally stored in node
K2, is now stored in node K3. The remaining object O15 is transferred to the newly
added node K4 according to the assignment rule.
Consistent hashing can also be used for replicated computer networks. The
desired copies of the objects are simply given a version number and entered on the
ring. This increases partition tolerance and the availability of the overall system.
Another option is the introduction of virtual nodes in order to spread the objects
across nodes more evenly. In this method, the nodes’ network addresses are also
assigned version numbers in order to be represented on the ring.
Consistent hashing functions are used in many NoSQL systems, especially in
implementations of key-value store systems.
Multi-dimensional data structures support access to records with multiple access key
values. The combination of all those access keys is called multi-dimensional key. A
multi-dimensional key is always unique, but does not have to be minimal.
A data structure that supports such multi-dimensional keys is called a multi-
dimensional data structure. For instance, an EMPLOYEE table with the two key
parts Employee Number and Year of Birth can be seen as a two-dimensional data
structure. The employee number forms one part of the two-dimensional key, but
5.2 Storage and Access Structures 169
remains unique in itself. The Year attribute is the second part and serves as an
additional access key, without having to be unique.
Unlike tree structures, multi-dimensional data structures are designed so that no
one key part controls the storage order of the physical records. A multi-dimensional
data structure is called symmetrical if it permits access with multiple access keys
without favoring a certain key or key combination. For the sample EMPLOYEE
table, both key parts, Employee Number and Year of Birth, should be equally
efficient in supporting access for a specific query.
One of the most important multi-dimensional data structures is the grid file or
bucket grid.
Grid File
A grid file is a multi-dimensional data structure with the following properties:
A grid file consists of a grid index and a file containing the data pages. The grid
index is a multi-dimensional space with each dimension representing a part of the
multi-dimensional access key. When records are inserted, the index is partitioned
into cells, alternating between the dimensions. Accordingly, the example in Fig. 5.6
alternates between Employee Number and Year of Birth for the two-dimensional
access key. The resulting division limits are called the scales of the grid index.
One cell of the grid index corresponds to one data page and contains at least n and
at the most 2*n entries, n being the number of dimensions of the grid file, where n is
the number of dimensions of the grid file. Empty cells must be combined with other
cells so that the associated data pages can have the minimum number of entries. In
our example, data pages can hold no more than four entries (n=2).
Since the grid index is generally large, it has to be stored in secondary memory
along with the records. The set of scales, however, is small and can be held in the
main memory. The procedure for accessing a specific record is therefore as follows:
The system searches the scales with the k key values of the k-dimensional grid file
and determines the interval in which each individual part of the search key is located.
These intervals describe a cell of the grid index which can then be accessed directly.
Each index cell contains the number of the data page with the associated records, so
that one more access, to the data page of the previously identified cell, is sufficient to
find whether there is a record matching the search key or not.
The two-disk-access maximum, i.e., no more than two accesses to secondary
memory, is guaranteed for the search for any record. The first access is to the
appropriate cell of the grid index and the second to the associated data page. As an
example, the employee with number E18, born in 1969, is searched in the grid file
G4 from Fig. 5.6: The employee number E18 is located in the scale interval E15 to
E30, i.e., in the right half of the grid file. The year 1969 can be found between the
170 5 System Architecture
E19,1958
E4,1956
File G2 E#
Year
E3,1962
Insert: Split at
E3,1962 1960
File G3 E#
Year
E18,1969
Insert:
E24,1964 E9,1952
Split at
E18,1969
M15
E2,1953
E24,1964
E9,1952
E2,1953
E#
File G4
Year
Insert:
Split at
E26,1952
1955
E26,1958 E15,1953
E15,1953
E#
scales 1960 and 1970 or in the top half. With those scales, the database system finds
the address of the data page in the grid index with its first access. The second access,
to the respective data page, leads to the requested records with the access keys (E18,
1969) and (E24, 1964).
A k-dimensional grid file supports queries for individual records or record areas.
Point queries can be used to find a specific record with k access keys. It is also
5.2 Storage and Access Structures 171
possible to formulate partial queries specifying only a part of the key. With a range
query, on the other hand, users can examine a range for each of the k key parts. All
records whose key parts are in the defined range are returned. Again, it is possible to
only specify and analyze a range for part of the keys (partial range query).
The search for the record (E18, 1969) described above is a typical example of a
point query. If only the employee’s year of birth is known, the key part 1969 is
specified for a partial point query. A search for all employees born between 1960 and
1969, for instance, would be a (partial) range query. In the example from Fig. 5.6,
this query targets the upper half of grid index G4, so only those two data pages have
to be searched. This indexing method allows for the results of range and partial range
queries in grid files to be found without the need to sift through the entire file.
In recent years, various multi-dimensional data structures efficiently supporting
multiple access keys symmetrically have been researched and described. The market
range of multi-dimensional data structures for SQL and NoSQL databases is still
very limited, but Web-based searches are increasing the demand for such storage
structures. Especially geographic information systems must be able to handle both
topological and geometrical queries (also called location-based queries) efficiently.
JSON documents are text files (cf. Sect. 2.5.1). They contain spaces and line breaks
and are not compact enough for database storage on disk. BSON, or Binary JSON, is
a more storage-efficient solution for storing structured documents in database
systems. BSON is a binary serialization of JSON-structured documents. Like
JSON, BSON supports the mapping of complex objects (cf. Sect. 2.5.1). However,
BSON is stored in bytecode. Additionally, BSON provides data types that are not
part of the JSON specification, such as date and time values. BSON was first used in
2009 in the MongoDB document database system for physical storage of documents.
Today, there are over 50 implementations in 30 programming languages.
To illustrate the BSON format, let’s start with a comparison. In Fig. 5.7 above, we
see a JSON document of an employee with name Murphy and city Kent. In Fig. 5.7
below, we see the same structure in BSON format. The readable strings (UTF-8)
are shown in bold. Two-digit hexadecimal values for encoding bytes start with the
letter x.
BSON is a binary format in which data can be stored in units named documents.
These documents can be nested recursively. A document consists of an element list,
which is embedded between a length specification and a so-called null byte. There-
fore, on the first line in the BSON document in Fig. 5.7, we see the length of the
document, and on the last line, the document is terminated with a null byte. In
between, there is an element list.
According to line 1 in the BSON document in Fig. 5.7, the length of the document
is 58. The length is an integer, which is stored in BSON in a total of four bytes. For
readability, all integer values in this example are represented as decimal numbers.
172 5 System Architecture
JSON-Document
{
"EMPLOYEE": {
"Name": "Murphy",
"City": "Kent"
}}
BSON-Document
58 Length of document
38 Length of document
6 Length of String
4 Length of String
On line 14, the null byte x00 indicates the end of the entire document. A null byte
is a sequence of eight bits, each of which stores the value 0. In hexadecimal notation,
this is represented by x00.
5.2 Storage and Access Structures 173
In Fig. 5.7, the element list of the overall BSON document consists of lines
2 through 13. An element list consists of one element, optionally followed by
another element list.
An element starts with a type code in the first byte. For example, on line 2 in the
BSON document in Fig. 5.7, we see the type code x03, which announces an
embedded document as an element. This is followed by a key string. A key string
is a sequence of non-empty bytes followed by a null byte. For example, the key
EMPLOYEE is given on line 3. This is followed by the value of the element
corresponding to the type code.
In BSON, there are different element types, e.g., embedded document (type code
x03), array (type code x04), or string (type code x02). In the BSON example in
Fig. 5.7, the first element value stored is an embedded document which, like the
parent document, again starts with the length specification (line 4), has an element
list, and ends with the null byte (line 13).
A string element starts with the type code x02. In the example in Fig. 5.7, we see
on line 5 the beginning of the string element for the property “Name,” whose key
name is specified on line 6. The value of this element starts on line 7 with length (6),
followed by the actual string on line 8 (Murphy). A string in BSON is a sequence of
UTF-8 characters followed by a null byte. Then follows analogously another
element of type String with key City and value Kent.
This is the way BSON allows to store JSON data in binary form on disk in a
space-saving and efficient way. BSON is used by document databases to write
documents to disk.
As we saw in Sects. 5.2.1 through 5.2.3, relational databases do not explicitly store
links between records. A key value is resolved with time-consuming searching in the
referenced table. An index allows this search process to be sped up; however, even
indexed queries take longer as more data needs to be searched.
To solve this problem, graph database systems provide resolution of a record
reference in constant time. This is of great importance for traversing networks. To do
this, they make use of the principle of pointers and addresses on the level of binary
files stored on the disk and manipulated in memory by the operating system. Pointers
are used to build doubly linked lists that enable the traversal of the graph. Central to
this is the fact that the edges of the network are stored as separate records (cf. theory
of multigraphs in Sect. 2.4.1). In the following, we will look at a concrete example.
In Fig. 5.8, a simple property graph is shown above. An employee named Murphy
is part of the team of the project with title ITsec with a workload of 50%. In addition,
another previously unnamed node is linked for illustrative purposes. The elements of
the graph are numbered and labeled. The example contains three nodes N1 to N3,
two arrows A1 to A2, and seven properties P1 to P7. For simplicity, node and edge
types are represented here as properties.
174 5 System Architecture
P7
TEAM N3
N1 A2
P1 PROJECT A1
P3
Title: ITsec TEAM
Workload: 0.5 P4
P2
N2
P5 EMPLOYEE
P6
Name: Murphy
Nodes
@ First First
Arrow Property Arrows
N1 *A1 *P1 @ Node Next Node Next First
1 Arrow1 2 Arrow2 Property
N2 *A1 *P4
A1 *N1 *A2 *N2 - *P3
N3 *A2 -
A2 *N1 *A1 *N3 - *P7
Properties
@ Key Value Next
Property
P1 Type PROJECT *P2
P2 Title ITsec -
P3 Type TEAM *P4
P4 Workload 0.5
P5 Type EMPLOYEE *P6
P6 Name Murphy -
P7 Type TEAM
Fig. 5.8 Index-free neighborhood using doubly linked lists with pointers
In Fig. 5.8 below, we see illustrations of three store files, one for nodes, one for
arrows, and one for properties. The first field (@) shows the respective file location
address. In the other fields, the effective data is shown.
In the node store file, the second field (FirstArrow) contains pointers to the
first arrow of the node. The third field (FirstProperty) contains pointers to the first
property of the node. For example, the first arrow of node N1 is A1, and the first
property is P1.
5.3 Translation and Optimization of Relational Queries 175
If we follow the pointer *P1, we find in the property store file, shown in Fig. 5.8
below, the property with address P1: the node has the type PROJECT. The store file
shows in the second and third field the key (Key) and the value (Value) of each
property. In the fourth field, we find a pointer (NextProperty) to possible further
properties. In the example of node N1, we find the pointer to property P2 one line
further down: the project has the title ITsec. In addition, there are no further
properties for this node, so that this storage field remains empty.
In the node store file, the entry for node N1 also points to the first edge that
connects it to the network, A1. The edge store file shown on the right illustrates that
two nodes are connected by edge A1. The first node (Node1) is N1, and the second
node (Node2) is N2. Moreover, we find a pointer to another arrow: the next arrow
from the perspective of the first node (NextArrow1) is arrow A2 in this case. The
next edge from the perspective of the second node (NextArrow2) is empty in this
example, because node N2 has no further connections. Further, the node store file
shows a pointer to node properties in the FirstProperty field, exactly the same as the
edge store file (see above).
This pointer structure results in a doubly linked list of nodes and arrows. This
makes traversal in the graph efficient and linearly scalable in the number of arrows.
In fact, there is no need to even store the actual address, since an integer number as
an offset can simply be multiplied by the size of the memory file entries to get to the
correct location in the binary store file. For the operating system, the cost of calling
to a file address based on a pointer is always constant, or O(1),1 no matter how much
data the database contains. Using this mechanism, native graph database systems
provide index-free adjacency: graph edges can be traversed efficiently without the
need for building explicit index structures.
The user interfaces of relational database systems are set-oriented, since entire tables
or views are provided for the users. When a relational query and data manipulation
language are used, the database system has to translate and optimize the respective
commands. It is vital that neither the calculation nor the optimization of the query
tree requires user actions.
1
The Landau symbol O(f(x)), also called Big O notation, is used in computer science when
analyzing the cost or complexity of algorithms. It gives a measure of the growth f(x) of the number
of computational steps or memory units as a function of the size x of a given problem. For example,
the runtime of an algorithm with computation complexity O(n2) grows quadratically as a function of
a parameter n, e.g., the number of data records.
176 5 System Architecture
EMPLOYEE DEPARTMENT
DepartmentName=IT
EMPLOYEE DEPARTMENT
Leaf node
Query Tree
Query trees graphically visualize relational queries with the equivalent expressions
of relational algebra. The leaves of a query tree are the tables used in the query; root
and internal nodes contain the algebraic operators.
Figure 5.9 illustrates a query tree using SQL and the previously introduced
EMPLOYEE and DEPARTMENT tables. Those tables are queried for a list of the
cities where the IT department members live:
5.3 Translation and Optimization of Relational Queries 177
SELECT City
FROM EMPLOYEE, DEPARTMENT
WHERE Sub=D# AND Department_Name='IT'
This expression first calculates a join of the EMPLOYEE and the DEPART-
MENT tables via the shared department number. Next, those employees working in
the department with the name IT are selected for an intermediate result; and finally,
the requested cities are returned with the help of a projection. Figure 5.10 shows this
expression of algebraic operators represented in the corresponding query tree.
Optimized π City
query tree
|× | Sub=D#
π Sub,City π D#
EMPLOYEE
σDepartmentName=IT
DEPARTMENT
This query tree can be interpreted as follows: The leaf nodes are the two tables
EMPLOYEE and DEPARTMENT used in the query. They are first combined in one
internal node (join operator) and then reduced to those entries with the department
name IT in a second internal node (select operator). The root node represents the
projection generating the results table with the requested cities.
Root and internal nodes of query trees refer to either one or two subtrees. If the
operator forming a node works with one table, it is called a unary operator; if it
affects two tables, it is a binary operator. Unary operators, which can only manipu-
late one table, are the project and select operators.
Binary operators involving two tables as operands are the set union, set intersec-
tion, set difference, Cartesian product, join, and divide operators.
Creating a query tree is the first step in translating and executing a relational
database query. The tables and attributes specified by the user must be available in
the system tables before any further processing takes place. The query tree is
therefore used to check both the query syntax and the user’s access permissions.
Additional security measures, such as value-dependent data protection, can only be
assessed during the runtime.
The second step after this access and integrity control is the selection and
optimization of access paths; the actual code generation or interpretative execution
of the query is the third step. With code generation, an access module is stored in a
module library for later use; alternatively, an interpreter can take over dynamic
control to execute the command.
TABLE :=
S City (
S Sub,City (EMPLOYEE) |u|
|u|Sub=D#
S D# (V Department_Name=IT (DEPARTMENT) ) )
• Multiple selections on one table can be merged into one so the selection predicate
only has to be validated once.
• Selections should be done as early as possible to keep intermediate results small.
To this end, the selection operators should be placed as close to the leaves (i.e.,
the source tables) as possible.
• Projections should also be run as early as possible, but never before selections.
Projection operations reduce the number of columns and often also the tuples.
• Join operators should be calculated near the root node of the query tree, since they
require a lot of computational expense.
access times for external storage media, caches, and main memories, as well as the
internal processing power.
A relational database system must provide various algorithms that can execute the
operations of relational algebra and relational calculus. The selection of tuples from
multiple tables is significantly more expensive than a selection from one table. The
following section will therefore discuss the different join strategies, even though
casual users will hardly be able to influence the calculation options.
Implementing a join operation on two tables aims to compare each tuple of one
table with all tuples of the other table concerning the join predicate and, when there
is a match, insert the two tuples into the results table as a combined tuple. Regarding
the calculation of equi-joins, there are two basic join strategies: nested join and sort-
merge join.
Nested Join
For a nested join between a table R with an attribute A and a table S with an
attribute B, each tuple in R is compared to each tuple in S to check whether the join
predicate R.A=S.B is fulfilled. If R has n tuples and S has m tuples, this requires n
times m comparisons.
The algorithm for a nested join calculates the Cartesian product and simulta-
neously checks whether the join predicate is met. Since we compare all tuples of R in
an outer loop with all tuples of S from an inner loop, the expense is quadratic. It can
be reduced if an index (see Sect. 5.2.1) exists for attribute A and/or attribute B.
Figure 5.9 illustrates a heavily simplified algorithm for a nested join of employee
and department information from the established example tables. OUTER_LOOP
and INNER_LOOP are clearly visible and show how the algorithm compares all
tuples of the EMPLOYEE table to all tuples of the DEPARTMENT table.
For the join operation in Fig. 5.11, there is an index for the D# attribute, since it is
the primary key2 of the DEPARTMENT table. The database system uses the index
structure for the department number by not going through the entire DEPART-
MENT table tuple by tuple for each iteration of the inner loop, but rather accessing
tuples directly via the index. Ideally, there is also an index for the Sub (subordinate)
attribute of the EMPLOYEE table for the database system to use for optimization.
This example illustrates the importance of the selection of suitable index structures
by database administrators.
A more efficient algorithm than a nested join is available if the tuples of tables R
and S are already sorted physically in ascending or descending order by the attributes
A and B of the join predicate, respectively. This may require an internal sort before
2
The database system automatically generates index structures for each primary key; advanced
index structures are used for concatenated keys.
5.3 Translation and Optimization of Relational Queries 181
the actual join operation in order to bring both of the tables into matching order. The
computation of the join then merely requires going through the tables for ascending
or descending attribute values of the join predicate and simultaneously compares the
values of A and B. This strategy is characterized as follows:
Sort-Merge Join
A sort-merge join requires the tables R and S with the join predicate R.A=S.B to be
sorted by the attribute values for A of R and B of S, respectively. The algorithm
computes the join by making comparisons in the sorting order. If the attributes A
and B are uniquely defined (e.g., as primary and foreign key), the computational
expense is linear.
Figure 5.12 shows a basic algorithm for a sort-merge join. First, both tables are
sorted by the attributes used in the join predicate and made available as cursors i and
j. Then the cursor i is passed in the sort order, and the comparisons are executed.
If the compound predicate i==j is true, that is, if the values at both cursors’
current positions are equal, both data sets are merged at this point and output. To do
this, a Cartesian product of the two subsets of records with the same key, i and j, is
output. The function GET_SUBSET(x) fetches all records in the cursor x where the
key x is equal and sets the pointer of the cursor to the immediately following record
with the next larger key value.
If either key is less than the other, the GET_SUBSET function is also run for the
cursor with smaller value, not for output, but to set the cursor’s pointer to the next
larger key value. This is looped until there are no more records for the first cursor.
With this algorithm, both tables only need to be traversed once. The cross-product
is only executed locally for small subsets of records, which increases the execution
182 5 System Architecture
SORT_MERGE_JOIN (Unt,A#): D3 =
SORT (EMPLOYEE) ON (Sub) AS i
D3
SORT (DEPARTMENT) ON (D#) AS j
WHILE (HAS_ROWS(Sub)) DO D5 =
IF(i==j) THEN
OUTPUT CARD_PROD ( D5
GET_SUBSET(i),
GET_SUBSET(j) ) END IF D5 <
IF (i<j) THEN GET_SUBSET(i) END IF
= D6
IF (i>j) THEN GET_SUBSET(j) END IF
END WHILE D6
END SORT_MERGE_JOIN
Example
speed significantly. In the query of the EMPLOYEE and DEPARTMENT tables, the
sort-merge join is linearly dependent from the occurrences of the tuples, since D# is a
key attribute. The algorithm only has to go through both tables once to compute
the join.
Database systems are generally unable to select a suitable join strategy—or any
other access strategy—a priori. Unlike algebraic optimization, this decision hinges
on the current content state of the database. It is therefore vital that the statistical
information contained in the system tables is regularly updated, either automatically
at set intervals or manually by database specialists. This enables cost-based
optimization.
inefficient searches still exist. This keyword is often used by database specialists to
examine queries for their performance and to improve them manually.
For example, there are situations where an index exists on a search column, but it
is not used by the optimizer. This is the case, among others, when functions are
applied to the search column. Let’s assume that there is an index IX1 on the column
Date of birth in the table Employees:
Now we want to output the list of employees born before 1960. We can try this
with the following SQL query:
When we do this, we will notice that even though the index exists, the query is
slow for large amounts of data.
With the EXPLAIN keyword, we can observe the optimizer’s execution plan,
which defines which queries are run in which order and which indexes are used to
access the data.
EXPLAIN
SELECT * FROM EMPLOYEES
WHERE YEAR(date_of_birth) < 1960
When we do this, we will notice from the database system’s response that the
optimizer did not recognize index IX1 as a possible access path (POSSIBLE_KEY)
and that the query is of type “ALL,” meaning that all records in the table must be
searched. This is called full table scan. This is because the optimizer cannot use the
index when functions are applied to the search columns. The solution to this problem
is to remove this function call:
EXPLAIN
SELECT * FROM EMPLOYEES
WHERE date_of_birth < '1960-01-01'.
184 5 System Architecture
A renewed call of the execution plan with EXPLAIN now shows that due to this
change, the query is now of type “RANGE,” i.e., a range query, and that for this, the
index IX1 can be used as an efficient access path. Thus, the SQL query will now run
much more efficiently. This is an example of how, in principle, analyzing the
optimizer’s execution plan works.
• Map phase: Subtasks are distributed between various nodes of the computer
network to use parallelism. On the individual nodes, simple key-value pairs are
extracted based on a query and then sorted (e.g., via hashing) and output as
intermediate results.
• Reduce phase: In this phase, the abovementioned intermediate results are
consolidated for each key or key range and output as the final result, which
consists of a list of keys with the associated aggregated value instances.
Hashing A to N
NoSQL Algorithm 1
Algorithm 1
Algorithm NoSQL 2 NoSQL 4
SQL NoSQL 2
Database 3
M1 R1
Sharding Database 3 Algorithm 1
NoSQL SQL 1
Sharding NoSQL 2
Sharding 2
Hashing O to Z
Database 3 SQL 1
SQL Database
NoSQL Set Sharding 2 SQL 3
NoSQL 2
Sharding 2
M2 R2
SQL Database Set 1 Set 1
NoSQL Set 1
Database SQL 2
SQL 2
for all elements of an original list as an intermediate result. The reduce() function
aggregated individual results and reduces them into an output value.
MapReduce has been improved and patented by Google developers for huge
amounts of semi- and unstructured data. However, the function is also available in
many open-source tools. The procedure plays an important role in NoSQL databases
(see Chap. 7), where various manufacturers use the approach for retrieving database
entries. Due to its use of parallelism, the MapReduce method is useful not only for
data analysis but also for load distribution, data transfer, distributed searches,
categorizations, and monitoring.
It is considered a vital rule for the system architecture of database systems that future
changes or expansions must be locally limitable. Similar to the implementation of
operating systems or other software components, fully independent system layers
that communicate via defined interfaces are introduced into relational and
non-relational database systems.
Figure 5.14 gives an overview of the five layers of system architecture based on
relational database technology. The section below further shows how those layers
186 5 System Architecture
Query translation,
1 access optimization
Search and
Tuples
navigation
Transaction and
2 cursor management
Internal records
Record insertion,
(B-tree, hashing,
structure changes
grid file)
5 File management
Traces,
Channel commands
cylinder
Storage medium
correspond to the major features described in Chap. 4 and the previous sections of
Chap. 5.
Many Web-based applications use different data storage systems to fit their various
services. Using just one database technology, e.g., relational databases, is no longer
enough: The wide range of requirements regarding consistency, availability, and
partition tolerance demand a mix of storage systems, especially due to the CAP
theorem.
Figure 5.15 shows a schematic representation of an online store. In order to
guarantee high availability and partition tolerance, session management and shop-
ping carts utilize key-value stores (see Chap. 6). Orders are saved to a document
store (see Chap. 7), and customers and accounts are managed in a relational database
system.
188 5 System Architecture
Web store
Cloud Database
A cloud database is a database that is operated as a cloud service. Access to the
database is provided by cloud providers as an Internet application. The database
service is obtained over the Internet and does not require active installation or
maintenance by the user. A cloud database system is available promptly immediately
after the online order is placed. Thus, the installation, operation, backup, security,
and availability of the database service are automated. This is also called database as
a service (DBaaS).
• Computing: The hardware, i.e., the physical computers with processor and main
and fixed storage of cloud services, is built in highly automated data centers.
Robots are used for this purpose, which can install and replace individual parts.
• Network: The computers are integrated into a network with high-performance
data cables so that all components can communicate with each other and with the
outside world.
• Operating system: Virtual machines are operated on this basis, providing the
operating system on which the database system runs.
• Database software: The database system is automatically installed, configured,
operated, and maintained by appropriate software on the virtual machines.
• Security: The database system is configured for security. This includes securing
all layers, from hardware, including geographic redundancy, to securing the
network and firewall, to securing the operating system and database software.
• Big Data: One advantage of automation is that new resources such as memory
and processors are allocated autonomously by the cloud service at short notice
and at any time, providing scalability in Big Data applications.
These benefits of cloud database systems create a clearly noticeable added value,
which is also reflected in the price of the services. This consideration must be made
and calculated for each use case.
Bibliography
BSON (binary JSON): https://bsonspec.org/. Accessed 24 Aug 2022
Bayer, R.: Symmetric binary B-trees: data structures and maintenance algorithms. Acta Inform.
1(4), 290–306 (1992)
Celko, J.: Joe Celko’s Complete Guide to NoSQL - Was jeder SQL-Profi über nicht-relationale
Datenbanken wissen muss. Morgan Kaufmann (2014)
Dean, J., Ghemawat, S.: MapReduce: Vereinfachte Datenverarbeitung auf großen Clustern.
Commun. ACM. 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492
Bibliography 191
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo – Amazon’s highly available
key-value store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles
(SOSP’07), Stevenson, Washington, 14–17 October 2007, pp. 205–220
Deka, G.C.: A survey of cloud database systems. IT Prof. 16(2), 50–57 (2014) https://ieeexplore.
ieee.org/document/6401099. Accessed 29 Aug 2022
Edlich, S., Friedland, A., Hampe, J., Brauer, B., Brückner, M.: NoSQL - Einstieg in die Welt
nichtrelationaler Web 2.0 Datenbanken. Carl Hanser Verlag (2011)
Härder, T., Rahm, E.: Datenbanksysteme - Konzepte und Techniken der Implementierung. Springer
(2001)
Karger, D., Lehmann, E., Leighton, T., Levine, M., Lewin, D., Panigrahy, R.: Consistent hashing
and random trees – distributed caching protocols for relieving hot spots on the world wide
web. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, El Paso,
Texas (1997)
Maier, D.: Die Theorie der relationalen Datenbanken. Computer Science Press (1983)
Maurer, W.D., Lewis, T.G.: Hash table methods. ACM Comput. Surv. 7(1), 5–19 (1975)
Nievergelt, J., Hinterberger, H., Sevcik, K.C.: The grid file: an adaptable, symmetric multikey file
structure. ACM Trans. Database Syst. 9(1), 38–71 (1984)
Perkins, L., Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks 2e: A Guide to Modern
Databases and the Nosql Movement, 2. Auflage edn. O’Reilly UK Ltd, Raleigh, NC (2018)
Robinson, I., Webber, J., Eifrem, E.: Graph database internals. In: Graph Databases: New
Opportunities for Connected Data, 2nd edn. O’Reilly Media (2015)
Sadalage, P.J., Fowler, M.: NoSQL Distilled – A Brief Guide to the Emerging World of Polyglot
Persistence. Addison-Wesley (2013)
Tilkov, S.: REST und HTTP - Einsatz der Architektur des Web für Integrationsszenarien. dpunkt
(2011)
Post-relational Databases
6
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 193
M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9_6
194 6 Post-relational Databases
In this chapter and the next, we present a selection of problem cases and possible
solutions. Some demands not covered by classical relational databases can be met by
individual enhancements of relational database systems; others have to be
approached with fundamentally new concepts and methods. Both of these trends
are summarized under post-relational database systems. We also consider NoSQL
post-relational, but we cover it in a separate Chap. 7.
EMPLOYEE DEPARTMENT
E# Name City Sub D# DepartmentName
E1 Murphy Kent D3 D5 HR
E4 Bell Kent D6
F1 in Cleveland F2 in Cleveland
E# Name City Sub D# DepartmentName
E1 Murphy Kent D3 D3 IT
E7 Howard Cleveland D5 D5 HR
F3 in Cincinnati F4 in Cincinnati
E# Name City Sub D# DepartmentName
1
Periodically extracted parts of tables (called snapshots) improve local autonomy.
196 6 Post-relational Databases
Name DepartmentName
Stewart Accounting
SELECT Name, DepartmentName
Murphy IT
FROM EMPLOYEE, DEPARTMENT
Howard HR
WHERE Sub = D#
Bell Accounting
CLEVELAND ∪ CINCINNATI
CLEVELAND:= CINCINNATI:=
π Name, DepartmentName π Name, DepartmentName
F1 F2 F3 F4
EMPLOYEE DEPARTMENT EMPLOYEE DEPARTMENT
read and update tables. In order to achieve this, it has to provide a distributed
transaction and recovery concept. These concepts demand special protection
mechanisms for distributed databases.
The internal processing strategy for distributed database queries is vital here, as
the example of querying for employees and department names in Fig. 6.2 illustrates.
The query can be formulated in normal SQL without specifying the fragment. The
task of the database system is to determine the optimal calculation strategy for this
non-centralized query. Both the EMPLOYEE and the DEPARTMENT table are
fragmented between Cleveland and Cincinnati. Therefore, certain calculations are
executed locally and synchronously. Each node organizes the join between the
EMPLOYEE and DEPARTMENT fragments independently from the other. After
6.3 Temporal Databases 197
these partial calculations, the final result is formed by a set union of the partial
results.
For further optimization, the single nodes make projections on the requested
attributes Name and Department Name. Then, the join operations on the reduced
table fragments are calculated separately in Cleveland and Cincinnati. Finally, the
preliminary results are reduced once more by projecting them on the requested
names and department names before a set union is formed.
In calculating non-centralized queries, union and join operations are typically
evaluated late in the process. This supports high parallelism in processing and
improves performance on non-centralized queries. The maxim of optimization is
to put the join operations in the query tree close to the root node, while selections and
projections should be placed near the leaves of the query tree.
The first prototypes of distributed database systems were developed in the early
1980s. Today, relational databases fulfilling the aforementioned demands only
partially are available. Moreover, the conflict between partition tolerance and
schema integration remains, so that many distributed databases, especially NoSQL
databases (cf. Chap. 7), either offer no schema federation, like key-value stores,
column family stores, or document stores, or do not support the fragmentation of
their data content, like graph databases.
2
In distributed SQL expressions, the two-phase commit protocol guarantees consistency.
198 6 Post-relational Databases
EMPLOYEE
The four tuples can be interpreted as follows: Employee Murphy used to live in
Cleveland from July 1, 2014, to September 12, 2016, and then in Kent until March
31, 2019, and has lived in Cleveland again since April 1, 2019. From the day they
started working for the company until May 3, 2017, they worked as a programmer
and between May 4, 2017, and March 31, 2019, as a programmer analyst, and since
April 1, 2019, they have been working as an analyst. The table TEMP_EMPLOYEE
is indeed temporal, as it shows not only current states but also information about data
values related to the past. Specifically, it can answer queries that do not only concern
current instants or periods.
For instance, it is possible in Fig. 6.4 to determine the role employee Murphy had
on January 1, 2018. Instead of the original SQL expression of a nested query with the
ALL function (see Sect. 3.3), a language directly supporting temporal queries is
conceivable. The keyword VALID_AT determines the time for which all valid
entries are to be queried.
• Supports the time axis as valid time by ordering attribute values or tuples by time
• Contains temporal language elements for queries into future, present, and past
200 6 Post-relational Databases
TEMP_EMPLOYEE (excerpt)
In the field of temporal databases, there are several language models facilitating
work with time-related information. Especially the operators of relational algebra
and relational calculus have to be expanded in order to enable a join of temporal
tables. The rules of referential integrity also need to be adapted and interpreted as
relating to time. Even though these kinds of methods and corresponding language
extensions have already proven themselves in research and development, very few
database systems today support temporal concepts. The SQL standard also supports
temporal databases.
Sales indicator:
e.g. 30 pieces for: keyboard, east, Q1/2023
Product
dimension
Harddrive
Screen
Keyboard 30
n
Mouse sio
en Q4/2023
im
ed Q3/2023
Tim
Q2/2023
Q1/2023
Region dimension
West East North South
were designed primarily for day-to-day business, not for analysis and evaluation.
Recent years have therefore seen the development of specialized databases and
applications for data analysis and decision support, in addition to transaction-
oriented databases. This process is termed online analytical processing or OLAP.
At the core of OLAP is a multi-dimensional database, where all decision-relevant
information can be stored according to various analysis dimensions (data cube).
Such databases can become rather large, as they contain decision-making factors
from multiple points in time. Sales figures, for instance, can be stored and analyzed
in a multi-dimensional database by quarter, region, or product.
This is demonstrated in Fig. 6.5, which also illustrates the concept of a multi-
dimensional database. It shows the three analysis dimensions product, region, and
time. The term dimension describes the axes of the data cube. The design of the
dimensions is important, since analyses are executed along these axes. The order of
the dimensions does not matter; every user can and should analyze the data from
their own perspective. Product managers, for instance, prioritize the product dimen-
sion; salespeople prefer sales figures to be sorted by region.
The dimensions themselves can be structured further: The product dimension can
contain product groups; the time dimension could cover not only quarters but also
202 6 Post-relational Databases
Region
Product Time
Mouse Sales Time
Keyboard Q1/2023
Q2/2023
Region
West
East
Dimension tables
days, weeks, and months. A dimension therefore also describes the desired aggrega-
tion levels valid for the data cube.
From a logical point of view, in a multi-dimensional database or a data cube, it is
necessary to specify not only the dimensions but also the indicators.3 An indicator is
a key figure or parameter needed for decision support. These key figures are
aggregated by analysis and grouped according to the dimension values. Indicators
can relate to quantitative as well as qualitative characteristics of the business. Apart
from financial key figures, meaningful indicators concern the market and sales,
customer base and customer fluctuation, business processes, innovation potential,
and know-how of the employees. Indicators, in addition to dimensions, are the basis
for the management’s decision support, internal and external reporting, and a
computer-based performance measurement system.
The main characteristic of a star schema is the classification of data as either
indicator data or dimension data. Both groups are shown as tables in Fig. 6.6. The
indicator table is at the center, and the descriptive dimension tables are placed around
it, one table per dimension. The dimension tables are attached to the indicator table
forming a star-like structure.
Should one or more dimensions be structured, the respective dimension table
could have other dimension tables attached to it. The resulting structure is called a
snowflake schema showing aggregation levels of the individual dimensions. In
Fig. 6.6, for instance, the time dimension table for the first quarter of 2023 could
3
Indicators are often also called facts, e.g., by Ralph Kimball. See also Sect. 6.7 on facts and rules of
knowledge databases.
6.4 Multi-dimensional Databases 203
Find the Apple sales for the 1. quarter of 2023 by sales lead Mahoney.
SELECT SUM(Revenue)
FROM D_PRODUCT D1, D_REGION D2, D_TIME D3, F_SALES F
WHERE D1.P# = F.P# AND
D2.R# = F.R# AND
D3.T# = F.T# AND
D1.Supplier = ‘Apple’ AND
D2.SalesLead = ‘Mahoney’ AND
D3.Year = 2023 AND
D3.Quarter = 1
D_TIME F_SALES
T# Year Quarter P# R# T# Quantity Revenue
T1 2023 1 P2 R2 T1 30 160,000
D_PRODUCT D_REGION
P# ProductName Supplier R# Name SalesLead
have another dimension table attached, listing the calendar days from January to
March 2023. Should the dimension month be necessary for analysis, a month
dimension table would be defined and connected to the day dimension table.
The classic relational model can be used for the implementation of a multi-
dimensional database. Figure 6.6 shows how indicator and dimension tables of a
star schema are implemented. The indicator table is represented by the relation
F_SALES, which has a multi-dimensional key. This concatenated key needs to
contain the keys for the dimension tables D_PRODUCT, D_REGION, and
D_TIME. In order to determine sales lead Mahoney’s revenue on Apple devices in
the first quarter of 2023, it is necessary to formulate a complicated join of all
involved indicator and dimension tables (see SQL statement in Fig. 6.7).
A relational database system reaches its limits when faced with extensive multi-
dimensional databases. Formulating queries with a star schema is also complicated
and prone to error. There are multiple other disadvantages when working with the
classic relational model and conventional SQL: In order to aggregate several levels,
a star schema has to be expanded into a snowflake schema, and the resulting physical
tables further impair the response time behavior. If users of a multi-dimensional
204 6 Post-relational Databases
database want to query more details for deeper analysis (drill-down) or analyze
further aggregation levels (roll-up), conventional SQL will be of no use. Moreover,
extracting or rotating parts of the data cube, as commonly done for analysis, requires
specific soft- or even hardware. Because of these shortcomings, some providers of
database products have decided to add appropriate tools for these purposes to their
software range. In addition, the SQL standard has been extended on the language
level in order to simplify the formulation of cube operations, including aggregations.
• For the design, several dimension tables with arbitrary aggregation levels can be
defined, especially for the time dimension.
• The analysis language offers functions for drill-down and roll-up.
• Slicing, dicing, and rotation of data cubes are supported.
Multi-dimensional databases are often the core of data warehouses. Unlike multi-
dimensional databases alone, a data warehouse is a distributed database system that
combines aspects of federated, temporal, and multi-dimensional databases. It
provides mechanisms for integration, historization, and analysis of data across
several applications of a company, along with processes for decision support and
the management and development of data flows within the organization.
The more and easier digital data is available, the greater the need to analyze this
data for decision support. The management of a company is supposed to base their
decisions on facts that can be gathered from the analysis of the available data. The
process of data preparation and analysis for decision support is called business
intelligence. Due to heterogeneity, volatility, and fragmentation of the data, cross-
application data analysis is often complex: Data is stored heterogeneously in several
databases in an organization. Additionally, often only the current version is avail-
able. In the source systems, data from one larger subject area, like customers or
contracts, is rarely available in one place, but has to be gathered, or integrated, via
various interfaces. Furthermore, this data distributed among many databases needs to
be sorted into timelines for various subject areas, each spanning several years.
Business intelligence therefore makes three demands on the data to be analyzed:
6.5 Data Warehouse and Data Lake Systems 205
Data Warehouse
A data warehouse or DWH is a distributed information system with the following
properties:
4
For more information, look up the KDD (knowledge discovery in databases) process.
5
See the ETL (extract, transform, and load) process below.
206 6 Post-relational Databases
Data
Web ERP CRM DB
Data warehouses can integrate various internal and external data sets (data
sources). The aim is to be able to store and analyze, for various business purposes,
a consistent and historicized set of data on the information scattered across the
company. To this end, data from many sources is integrated into the data warehouse
via interfaces and stored there, often for years. Building on this, data analyses can be
carried out to be presented to decision-makers and used in business processes.
Furthermore, business intelligence as a process has to be controlled by management.
The individual steps of data warehousing are summarized in the following
paragraphs (see Fig. 6.8).
The data of an organization is distributed across several source systems, for
instance, Web platforms, accounting (enterprise resource planning, ERP), and cus-
tomer databases (customer relationship management, CRM). In order to analyze and
relate this data, it needs to be integrated.
For this integration of the data, an ETL (extract, transform, load) process is
necessary. The corresponding interfaces usually transfer data in the evening or on
weekends, when the IT system is not needed by the users. High-performance
systems today feature continuous loading processes, feeding data 24/7 (trickle
feed). When updating a data warehouse, periodicity is taken into account, so users
can see how up to date their evaluation data is. The more frequently the interfaces
load data into the data warehouse, the more up to date is the evaluation data. The aim
of this integration is historization, i.e., the creation of a timeline in one logically
central storage location. The core of a data warehouse (Core DWH) is often modeled
in second or third normal form. Historicization is achieved using validity statements
(valid_from, valid_to) in additional columns of the tables, as described in Sect. 6.3
on temporal databases. In order to make the evaluation data sorted by subject
available for OLAP analysis, individual subject areas are loaded into data marts,
which are often realized multi-dimensionally with star schemas.
6.6 Object-Relational Databases 207
The data warehouse exclusively serves for the analysis of data. The dimension of
time is an important part of such data collections, allowing for more meaningful
statistical results. Periodic reporting produces lists of key performance indicators.
Data mining tools like classification, selection, prognosis, and knowledge acquisi-
tion use data from the data warehouse in order to, for instance, analyze customer or
purchasing behavior and utilize the results for optimization purposes. In order for the
data to generate value, the insights including the results of the analysis need to be
communicated to the decision-makers and stakeholders. The respective analyses or
corresponding graphics are made available using a range of interfaces of business
intelligence tools (BI tools) or graphical user interfaces (GUI) for office automation
and customer relationship management. Decision-makers can utilize the analysis
results from data warehousing in business processes as well as in strategy, market-
ing, and sales.
The data warehouse is designed to process and integrate structured data. Since
unstructured and semi-structured data are analyzed more frequently today in the
context of Big Data (see Sect. 5.1), a new concept of the data lake has become
established for this purpose. This offers an alternative extract-load-transform (ELT)
approach for the federation, historization, and analysis of large amounts of unstruc-
tured and semi-structured data. The data lake periodically extracts and loads data
from different source systems as it is, thus eliminating the need for time-consuming
integration. Only when the data is eventually used by data scientists is the data
transformed for the desired analysis.
Data Lake
A data lake is a distributed information system with the following characteristics:
• Data fusion: Unstructured and semi-structured data from different data sources
and applications (source systems) are extracted in the given structure and
archived centrally.
• Schema-on-read: Federated data is integrated into a unified schema only when it
is needed for an evaluation.
• Snapshots: Data can be evaluated according to different points in time, thanks to
timestamps.
• Data-based value creation: The data in the data lake unfolds its value through data
science analyses, which generate added value by optimizing decisions.
BOOK
B# Title Publisher
PART_OF (BOOK)
AUTHOR KEYWORD
A# B# Name K# B# Term Weighting
A1 B1 Miller K1 B1 Database 80
K4 B2 Computer geometry 20
Results table
Title
Relational Databases
Fig. 6.9 Query of a structured object with and without implicit join operator
The attribute Name is not fully functionally dependent on the combined key of
the Author and Book Number, which is why the table is neither in the second nor in
any higher normal form. The same holds true for the KEYWORD table, because
there is a complex-complex relationship between books and their keywords.
Weighting is a typical relationship attribute; Label, however, is not fully functionally
6.6 Object-Relational Databases 209
dependent on the Keyword Number and Book Number key. For proper normaliza-
tion, the management of books would therefore require several tables, since in
addition to the relationship tables AUTHOR and KEYWORD, separate tables for
the attributes Author and Keyword would be necessary. A relational database would
certainly also include information on the publisher in a separate PUBLISHER table,
ideally complemented by a table for the relationship between BOOK and
PUBLISHER.
Splitting the information about a book between different tables has its
disadvantages and is hardly understandable from the point of view of the users,
who want to find the attributes of a certain book well-structured in a single table. The
relational query and data manipulation language should serve to manage the book
information using simple operators. There are also performance disadvantages if the
database system has to search various tables and calculate time-consuming join
operators in order to find a certain book. To mitigate these problems, extensions to
the relational model have been suggested.
A first extension of the relational database technology is to explicitly declare
structural properties to the database system, for instance, by assigning surrogates. A
surrogate is a permanent, invariant key value defined by the system, which uniquely
identifies every tuple in the database. Surrogates, as invariant values, can be used to
define system-controlled relationships even in different places within a relational
database. They support referential integrity as well as generalization and aggregation
structures.
In the BOOK table in Fig. 6.9, the book number B# is defined as a surrogate. This
number is used again in the dependent tables AUTHOR and KEYWORD under the
indication PART_OF(BOOK). Because of this reference, the database system
explicitly recognizes the structural properties of the book, author, and keyword
information and is able to use them in database queries, given that the query and
manipulation language is extended accordingly. An example for this is the implicit
hierarchical join operator in the FROM clause that connects the partial tables
AUTHOR and KEYWORD belonging to the BOOK table. It is not necessary to
state the join predicates in the WHERE clause, as those are already known to the
database system through the explicit definition of the PART_OF structure.
Storage structures can be implemented more efficiently by introducing to the
database system a PART_OF or analogously an IS_A structure. This means that the
logical view of the three tables BOOK, AUTHOR, and KEYWORD is kept, while
the book information is physically stored as structured objects6 so that a single
database access makes it possible to find a book. The regular view of the tables is
kept, and the individual tables of the aggregation can be queried as before.
Another possibility for the management of structured information is giving up the
first normal form7 and allowing tables as attributes. Figure 6.10 illustrates this with
an example presenting information on books, authors, and keywords in a table. This
6
Research literature also calls them “complex objects.”
7
The NF2 model (NF2 = non-first normal form) supports nested tables.
210 6 Post-relational Databases
BOOK_OBJECT
Autor Keyword
B# Title Publisher A# Name K# Term Wgt.
K2 Relational model 20
• It allows the definition of object types (often called classes in reference to object-
oriented programming), which themselves can consist of other object types.
• Every database object can be structured and identified through surrogates.
• It supports generic operators (methods) affecting objects or parts of objects,
while their internal representation remains invisible from the outside (data
encapsulation).
• Properties of objects can be inherited. This property inheritance includes the
structure and the related operators.
6.6 Object-Relational Databases 211
Object-oriented (classes)
Mapping
ORM
Relational (tables)
AUTHOR BOOK
AUTHORED
RDBMS
The SQL standard has for some years been supporting certain object-relational
enhancements: object identifications (surrogates); predefined data types for set, list,
and field; general abstract data types with the possibility of encapsulation;
parametrizable types; type and table hierarchies with multiple inheritance; and
user-defined functions (methods).
Object-Relational Mapping
Most modern programming languages are object-oriented; at the same time, the
majority of the database systems used are relational. Instead of migrating to object-
relational or even object-oriented databases, which would be rather costly, objects
and relations can be mapped to each other during software development if relational
data is accessed with object-oriented languages. This concept of object-relational
mapping (ORM) is illustrated in Fig. 6.11. In this example, there is a relational
database management system (RDBMS) with a table AUTHOR, a table BOOK, and
a relationship table AUTHORED, since there is a complex-complex relationship
(see Sect. 2.2.2) between books and authors. The data in those tables is to be used
directly as classes in software development in a project with object-oriented pro-
gramming (OOP).
An ORM software can automatically map classes to tables, so for the developers,
it seems as if they were working with object-oriented classes even though the data is
212 6 Post-relational Databases
saved in database tables in the background. The programming objects in the main
memory are thus persistently written, i.e., saved to permanent memory.
In Fig. 6.11, the ORM software provides the two classes Author and Book for the
tables AUTHOR and BOOK. For each line in the table, there is one object as an
instance of the respective class. The relationship table AUTHORED is not shown as
a class: object orientation allows for the use of non-atomic object references; thus,
the set of books the author has written is saved in a vector field books[] in the Author
object, and the group of authors responsible for a book are shown in the field authors
[] in the Book object.
The use of ORM is simple. The ORM software automatically derives the
corresponding classes based on existing database tables. Records from these tables
can then be used as objects in software development. ORM is therefore one possible
way toward object orientation with which the underlying relational database tech-
nology can be retained.
Knowledge databases or deductive databases cannot only manage the actual data—
called facts—but also rules, which are used to deduct new table contents or facts.
The EMPLOYEE table in Fig. 6.12 is limited to the names of the employees for
simplicity. It is possible to define facts or statements on the information in the table,
Bell
Murphy Bell
Stewart
in this case on the employees. Generally, facts are statements that unconditionally
take the truth value TRUE. For instance, it is true that Howard is an employee. This
is expressed by the fact “is_employee (Howard).” For the employees’ direct
supervisors, a new SUPERVISOR table can be created, showing the names of the
direct supervisors and the employees reporting to them as a pair per tuple. Accord-
ingly, facts “is_supervisor_of (A,B)” are formulated to express that “A is a direct
supervisor of B.”
The job hierarchy is illustrated in a tree in Fig. 6.13. Looking for the direct
supervisor of employee Murphy, the SQL query analyzes the SUPERVISOR table
and finds supervisor Howard. Using a logic query language (inspired by Prolog)
yields the same result.
Besides actual facts, it is possible to define rules for the deduction of unknown
table contents. In the relational model, this is called a derived relation or deduced
relation. Simple examples of a derived relation and the corresponding derivation rule
are given in Fig. 6.14. It shows how the supervisor’s supervisor for every employee
can be found. This may, for instance, come in useful for large companies or
214 6 Post-relational Databases
Murphy Bell
Stewart
supervisors, the view SUPERIOR defined in Fig. 6.14 is used. The SQL query of this
view results in a table with the information that there is only one relationship with a
superior supervisor, specifically employee Stewart and their superior supervisor
Howard. Applying the corresponding derivation rule “is_superior_of” yields the
same result.
A deductive database as a vessel for facts and rules also supports the principle of
recursion, making it possible to draw an unlimited amount of correct conclusions
due to the rules included in the deductive database. Any true statement always leads
to new statements.
The principle of recursion can refer to either the objects in the database or the
derivation rules. Objects defined as recursive are structures that themselves consist
of structures and, similar to the abstraction concepts of generalization and aggrega-
tion, can be understood as hierarchical or network-like object structures. Further-
more, statements can be determined recursively; in the company hierarchy example,
all direct and indirect supervisor relationships can be derived from the facts
“is_employee” and “is_supervisor_of.”
The calculation process which derives all transitively dependent tuples from a
table forms the transitive closure of the table. This operator does not belong to the
original operators of relational algebra; rather, the transitive closure is a natural
extension of the relational operators. It cannot be formed with a fixed number of
calculation steps, but only by several relational join, projection, and union operators,
whose number depends on the content of the table in question.
These explanations can be condensed into the following definition:
• The attribute values in the databases are precise, i.e., they are unambiguous. The
first normal form demands attribute values to be atomic and come from a well-
defined domain. Vague attribute values, such as “2 or 3 or 4 days” or “roughly
3 days” for the delivery delay of supplier, are not permitted.
• The attribute values saved in a relational database are certain, i.e., the individual
values are known and therefore true. An exception are NULL values, i.e., attribute
values that are not known or not yet known. Apart from that, database systems do
not offer modeling components for existing uncertainties. Probability
distributions for attribute values are therefore impossible; expressing whether
an attribute value correspondents to the true value or not remains difficult.
• Queries to the database are crisp. They always have a binary character, i.e., a
query value specified in the query must either be identical or not identical with the
attribute values. Querying a database with a query value “more or less” identical
with the stored attribute values is not allowed.
In recent years, discoveries from the field of fuzzy logic have been applied to data
modeling and databases. Permitting incomplete or vague information opens a wider
field of application. Most of these works are theoretical; however, some research
groups are trying to demonstrate the usefulness of fuzzy database models and
database systems with implementations.
The approach shown here is based on the context model to define classes of data
sets in the relational database schema. There are crisp and fuzzy classification
methods. For a crisp classification, database objects are binarily assigned to a
class, i.e., the membership function of an object to a class is 0 for “not included”
or 1 for “included.” A conventional process would therefore group a customer either
into the class “Customers with revenue problems” or into the class “Customers to
expand business with.” A fuzzy process, however, allows for membership function
values between 0 and 1. A customer can belong in the “Customers with revenue
problems” class with a value of 0.3 and at the same time in the “Customers to expand
business with” class with a value of 0.7. A fuzzy classification therefore allows for a
more differentiated interpretation of class membership: Database objects can be
distinguished between border and core objects; additionally, database objects can
belong to two or more different classes at the same time.
In the fuzzy-relational database model with contexts, context model for short,
every attribute Aj defined on a domain D(Aj) has a context assigned. A context K
(Aj) is a partition of D(Aj) into equivalence classes. A relational database schema
with contexts therefore consists of a set of attributes A=(A1,. . .,An) and another set
of associated contexts K=(K1(A1),. . .,Kn(An)).
6.8 Fuzzy Databases 217
Revenue
1000 Stewart
C1
C2 C1
500 Bell
499 Howard
C4 C3
0 Murphy
Loyalty
bad weak good great
Fig. 6.15 Classification matrix with the attributes Revenue and Loyalty
For the assessment of customers, revenue and loyalty are used as an example.
Additionally, those qualifying attributes are split into two equivalence classes each.
The according attributes and contexts for the customer relationship management are:
• Revenue in dollars per month: The domain for revenue in dollars is defined as
[0. . .1000]. Two equivalence classes [0. . .499] for small revenues and
[500. . .1000] for large revenues are also created.
• Customer loyalty: The domain {bad, weak, good, great} supplies the values for
the Customer loyalty attribute. It is split further into the equivalence classes {bad,
weak} for negative loyalty and {good, great} for positive loyalty.
The partitioning of the revenue and loyalty domains results in the four equiva-
lence classes C1, C2, C3, and C4 shown in Fig. 6.15. The meaning of the classes is
expressed by semantic class names; for instance, customers with little revenue and
weak loyalty are labeled “Don’t invest” in C4; C1 could stand for “Retain customer,”
C2 for “Improve loyalty,” and C3 for “Increase revenue.” It is the database
administrators’ job, in cooperation with the marketing department, to define the
attributes and equivalence classes and to specify them as an extension of the database
schema.
218 6 Post-relational Databases
• Customer Bell has barely any incentives to increase revenue or loyalty. They
belong to the premium class C1 and enjoy the corresponding advantages.
• Customer Bell could face an unpleasant surprise, should their revenue drop
slightly or their loyalty rating be reduced. They may suddenly find themselves
in a different customer segment; in an extreme case, they could drop from the
premium class C1 into the low value class C4.
• Customer Howard has a robust revenue and medium customer loyalty, but is
treated as a low value customer. It would hardly be surprising if Howard
investigated their options on the market and moved on.
• A sharp customer segmentation also creates a critical situation for customer
Stewart. They are, at the moment, the most profitable customer with an excellent
reputation, yet the company does not recognize and treat them according to their
customer value.
μnegative μpositive
1
0.66
Revenue
0.33
1 0
0
1000 Stewart
C1
μhigh
C2 C1
500 Bell
499 Howard
C4 C3
μlow
0 Murphy
Loyalty
bad weak good great
CLASSIFY Object
FROM Table
WITH Classification condition
CLASSIFY Customer
FROM Customer table
CLASSIFY Customer
FROM Customer table
WITH CLASS IS Increase revenue
specifically targets class C3. Bypassing the definition of a class, it is also possible to
select a certain set of objects by using the linguistic descriptions of the equivalence
classes. The following query is an example:
CLASSIFY Customer
FROM Customer table
WITH Revenue IS small AND Loyalty IS strong
This query consists of the identifier of the object to be classified (Customer), the
name of the base table (Customer table), the critical attribute names (Revenue and
Loyalty), the term “small” of the linguistic variable Revenue, and the term “strong”
of the linguistic variable Loyalty.
Based on the example and the explanations above, fuzzy databases can be
characterized as follows:
• The data model is fuzzily rational, i.e., it accepts imprecise, vague, and uncertain
attribute values.
• Dependencies between attributes are expressed with fuzzy normal forms.
• Relational calculus as well as relational algebra can be extended to fuzzy rela-
tional calculus and fuzzy relational algebra using fuzzy logic.
• Using a classification language enhanced with linguistic variables, fuzzy queries
can be formulated.
Only a few computer scientists have been researching the field of fuzzy logic and
relational database systems over the years (see Bibliography). Their works are
mainly published and acknowledged in the field of fuzzy logic, not in the database
field. It is to be hoped that both fields will grow closer and the leading experts on
database technology will recognize the potential that lies in fuzzy databases and
fuzzy query languages.
Bibliography
Bordogna, G., Pasi, G. (eds.): Recent Issues on Fuzzy Databases. Physica-Verlag (2000)
Bosc, P., Kacprzyk, J. (eds.): Fuzziness in Database Management Systems. Physica-Verlag (1995)
Bibliography 221
Ceri, S., Pelagatti, G.: Distributed Databases – Principles and Systems. McGraw-Hill (1985)
Chen, G.: Design of fuzzy relational databases based on fuzzy functional dependencies. PhD Thesis
Nr. 84, Leuven, Belgium (1992)
Chen, G.: Fuzzy Logic in Data Modeling – Semantics, Constraints, and Database Design. Kluwer
Academic (1998)
Clocksin, W.F., Mellish, C.S.: Programming in Prolog. Springer (1994)
Dittrich, K.R. (ed.): Advances in Object-Oriented Database Systems. Lecture Notes in Computer
Science, vol. 334. Springer (1988)
Etzion, O., Jajodia, S., Sripada, S. (eds.): Temporal Databases – Research and Practice. Lecture
Notes in Computer Science. Springer (1998)
Inmon, W.H.: Building the Data Warehouse. Wiley (2005)
Kimball, R., Ross, M., Thorntwaite, W., Mundy, J., Becker, B.: The Data warehouse Lifecycle
Toolkit. Wiley (2008)
Lorie, R.A., Kim, W., McNabb, D., Plouffe, W., Meier, A.: Supporting complex objects in a
relational system for engineering databases. In: Kim, W., et al. (eds.) Query Processing in
Database Systems, pp. 145–155. Springer (1985)
Meier, A., Werro, N., Albrecht, M., Sarakinos, M.: Using a fuzzy classification query language for
customer relationship management. Proceedings of the 31st International Conference on Very
Large Databases (VLDB), Trondheim, Norway, 2005, pp. 1089–1096
Meier, A., Schindler, G., Werro, N.: Fuzzy classification on relational databases
(Chapter XXIII). In: Galindo, J. (ed.) Handbook of Research on Fuzzy Information Processing
in Databases, vol. II, pp. 586–614. IGI Global (2008)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Prentice Hall (1991)
Petra, F.E.: Fuzzy Databases – Principles and Applications. Kluwer Academic (1996)
Pons, O., Vila, M.A., Kacprzyk, J. (eds.): Knowledge Management in Fuzzy Databases. Physica-
Verlag (2000)
Snodgrass, R.T.: The Temporal Query Language TQuel. ACM Trans. Database Syst. 12(2),
247–298 (1987)
Snodgrass, R.T., et al.: A TSQL2 tutorial. SIGMOD-Rec. 23(3), 27–33 (1994)
Stonebraker, M.: The Ingres Papers. Addison-Wesley (1986)
Stonebraker, M.: Object-Relational DBMS’s – The Next Great Wave. Morgan Kaufmann (1996)
Werro, N.: Fuzzy Classification of Online Customers. Springer (2015)
Williams, R., et al.: R*: an overview of the architecture. In: Scheuermann, P. (ed.) Improving
Database Usability and Responsiveness, pp. 1–27. Academic Press (1982)
Zadeh, L.A.: Fuzzy sets. Inf. Control. 8, 338–353 (1965)
NoSQL Databases
7
In Chaps. 1–5, all aspects were described in detail for relational, graph, and docu-
ment databases. In Chap. 6, we covered post-relational extensions of SQL databases.
Chapter 7 now concludes with a rounding overview of important NoSQL database
systems.
The term NoSQL was first used in 1998 for a database that (although relational)
did not have an SQL interface. NoSQL became of growing importance during the
2000s, especially with the rapid expansion of the Internet. The growing popularity of
global Web services saw an increase in the use of Web-scale databases, since there
was a need for data management systems that could handle the enormous amounts of
data (sometimes in the petabyte range and up) generated by Web services.
SQL database systems are much more than mere data storage systems. They
provide a large degree of processing logic:
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 223
M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9_7
224 7 NoSQL Databases
NoSQL Database
NoSQL databases usually have the following properties (see also Sect. 1.3.2):
Although the term NoSQL originally referred to database functions that are not
covered by the SQL standard or the SQL language, the phrase “not only SQL” has
become widespread as an explanation of the term. More and more typical NoSQL
systems offer an SQL language interface, and classic relational databases offer
additional functions outside of SQL that can be described as NoSQL functionalities.
The term NoSQL is therefore a class of database functionalities that extend and
supplement the functionalities of the SQL language. Core NoSQL technologies are:
These four database models, also called core NoSQL models, are discussed in this
chapter. Other types of NoSQL described in this chapter are the family of XML
databases (Sect. 7.5), search engine databases (Sect. 7.7), and time series databases
(Sect. 7.8).
The simplest way of storing data is assigning a value to a variable or a key. At the
hardware level, CPUs work with registers based on this model; programming
languages use the concept in associative arrays. Accordingly, the simplest database
model possible is data storage that stores a data object as a value for another data
object as key.
7.2 Key-Value Stores 225
In key-value stores, a specific value can be stored for any key with a simple
command, e.g., SET. Below is an example in which data for users of a website is
stored: first name, last name, e-mail, and encrypted password. For instance, the value
John is stored for the key User:U17547:firstname.
Data objects can be retrieved with a simple query using the key:
GET User:U17547:email
> john.doe@blue_planet.net
The key space can only be structured with special characters such as colons or
slashes. This allows for the definition of a namespace that can represent a rudimen-
tary data structure. Apart from that, key-value stores do not support any kind of
structure, neither nesting nor references. Key-value stores are schema-less, i.e., data
objects can be stored at any time and in arbitrary formats, without a need for any
metadata objects such as tables or columns to be defined beforehand. Going without
a schema or referential integrity makes key-value stores performant for queries, easy
to partition, and flexible regarding the types of data to be stored.
Key-Value Store
A database is a key-value store if it has the following properties:
Key-value stores have seen a large increase in popularity as part of the NoSQL
trend, since they are scalable for huge amounts of data. As referential integrity is not
checked in key-value stores, it is possible to write and read extensive amounts of data
efficiently. Processing speed can be enhanced even further if the key-value pairs are
buffered in the main memory of the database. Such setups are called in-memory
226 7 NoSQL Databases
Key-value
Primary cluster (in-Memory) pairs
Shard C Shard A
Shard B Shard B
Shard A Shard C
Fig. 7.1 Key-value store with sharding and hash-based key distribution
databases. They employ technologies that allow to cache values in the main memory
while constantly validating them against the long-term persistent data in the back-
ground memory.
There is almost no limit to increasing a key-value store’s scalability with frag-
mentation or sharding of the data content. Partitioning is rather easy in key-value
stores, due to the simple model. Individual computers within the cluster, called
shards, take on only a part of the key space. This allows for the distribution of the
database onto a large number of individual machines. The keys are usually
distributed according to the principles of consistent hashing (see Sect. 5.2.4).
Figure 7.1 shows a distributed architecture for a key-value store: A numerical
value (hash) is generated from a key; using the modules operator, this value can now
be positioned on a defined number of address spaces (hash slots) in order to
determine on which shard within the distributed architecture the value for the key
will be stored. The distributed database can also be copied to additional computers
and updated there to improve partition tolerance, a process called replication. The
original data content in the primary cluster is synchronized with multiple replicated
data sets, the replica clusters.
Figure 7.1 shows an example of a possible massively distributed high-
performance architecture for a key-value store. The primary cluster contains three
7.3 Column-Family Stores 227
computers (shards A, B, and C). The data is kept directly in the main memory
(RAM) to reduce response times. The data content is replicated to a replica cluster
for permanent storage on a hard drive. Another replica cluster further increases
performance by providing another replicated computer cluster for complex queries
and analyses.
Apart from the efficient sharding of large amounts of data, another advantage of
key-value stores is the flexibility of the data schema. In a relational database, a
pre-existing schema in the shape of a relation with attributes is necessary for any
record to be stored. If there is none, a schema definition must be executed before
saving the data. For database tables with large numbers of records or for the insertion
of heterogeneous data, this is often a lot of work. Key-value stores are schema-free
and therefore highly flexible regarding the type of data to be stored. It is not
necessary to specify a table with columns and data types; rather, the data can simply
be stored under an arbitrary key. On the other hand, the lack of a database schema
often causes a clutter in data management.
Even though key-value stores are able to process large amounts of data performantly,
their structure is still quite rudimentary. Often, the data matrix needs to be structured
with a schema. Most Column-family stores enhance the key-value concept accord-
ingly by providing additional structure.
In practical use, it has shown to be more efficient for optimizing read operations
to store the data in relational tables not per row, but per column. This is because
rarely all columns in one row are needed at once, but there are groups of columns
that are often read together. Therefore, in order to optimize access, it is useful to
structure the data in such groups of columns—column families—as storage units.
Column-family stores, which are named after this method, follow this model; they
store data not in relational tables, but in enhanced and structured multi-dimensional
key spaces.
Google presented its Bigtable database model for the distributed storage of
structured data in 2008, significantly influencing the development of column-family
stores.
Bigtable
In the Bigtable model, a table is a sparse, distributed, multi-dimensional, sorted map.
It has the following properties:
• The data structure is a map which assigns elements from a domain to elements in a
co-domain.
• The mapping function is sorted, i.e., there is an order relation for the keys
addressing the target elements.
• The addressing is multi-dimensional, i.e., the function has more than one
parameter.
228 7 NoSQL Databases
• The data is distributed by the map, i.e., it can be stored on different computers in
different places.
• The map is sparse, so not for every possible key an entry is required.
In Bigtable, a table has three dimensions: It maps an entry of the database for one
row and one column at a certain time as a string:
Column family:
Column family: Contact
Access
Contact:Mail (Spaltenschlüssel)
doe.john Contact:Name
Max Müller
t2
Timestamp: t1
t4
...
Column-Family Store
Databases using a data model similar to the Bigtable model are called column-family
stores. They can be defined as NoSQL databases with the following properties:
The advantages of column-family stores are their high scalability and availability
due to their massive distribution, just as with key-value stores. Additionally, they
provide a useful structure with a schema offering access control and localization of
distributed data on the column family level; at the same time, they provide enough
flexibility within the column family by making it possible to use arbitrary
column keys.
230 7 NoSQL Databases
1
HyperText Transfer Protocol.
2
JavaScript Object Notation.
7.4 Document Databases 231
Document
_id: U17547,
Key: U17547
_rev: 2-82ec54af78febc2790
userName: U17547,
firstName: John,
lastName: Doe,
gender: m,
visitHistory: [
index: 2015-03-30 07:55:12,
blogroll: 2015-03-30 07:56:30,
login: 2015-03-30 08:02:45
…
]
Document Database
To summarize, a document store is a database management system with the follow-
ing properties:
• It is a key-value store.
• The data objects stored as values for keys are called documents; the keys are used
for identification.
• The documents contain data structures in the form of recursively nested attribute-
value pairs without referential integrity.
• These data structures are schema-free, i.e., arbitrary attributes can be used in
every document without defining a schema first.
• In contrast with key-value databases, document databases support ad hoc queries
not only using the document key but using any document attribute.
232 7 NoSQL Databases
Queries on a document store can be parallelized and therefore sped up with the
MapReduce procedure (see Sect. 5.4). Such processes are two-phased, where Map
corresponds to grouping (group by) and Reduce corresponds to aggregation (count,
sum, etc.) in SQL.
During the first phase, a map function which carries out a predefined process for
every document is executed, building and returning a map. Such a map is an
associative array with one or several key-value pairs per document. The map
phase can be calculated per document independently from the rest of the data
content, thereby always allowing for parallel processing without dependencies if
the database is distributed among different computers.
In the optional reduce phase, a function is executed to reduce the data, returning
one row per key in the index from the map function and aggregating the
corresponding values. The following example demonstrates how MapReduce can
be used to calculate the number of users, grouped by gender, in the database from
Fig. 7.3.
Because of the absence of a schema, as part of the map function, a check is
executed for every document to find out if the attribute userName exists. If that is the
case, the emit function returns a key-value pair, with the key being the user’s
gender, the value the number 1. The reduce function then receives two different
keys, m and f, in the keys array and for every document per user of the respective
gender a number 1 as values in the values array. The reduce function returns the
sum of the ones, grouped by key, which equals the respective number.
// map
function(doc){
if(doc.userName) {
emit(doc.gender, 1)
}
}
// reduce
function(keys, values) {
return sum(values)
}
XML (eXtensible Markup Language) was developed by the World Wide Web
Consortium (W3C). The content of hypertext documents is marked by tags, just as
in HTML. An XML document is self-describing, since it contains not only the actual
data but also information on the data structure.
<address>
<street> W Broad Street </street>
<number> 333 </number>
<ZIP code> 43215 </ZIP code>
<city> Columbus </city>
</address>
The basic building blocks of XML documents are called elements. They consist
of a start tag (in angle brackets <name>) and an end tag (in angle brackets with slash
</name>) with the content of the element in-between. The identifiers of the start and
the end tag have to match.
The tags provide information on the meaning of the specific values and therefore
make statements about the data semantics. Elements in XML documents can be
nested arbitrarily. It is best to use a graph to visualize such hierarchically structured
documents, as shown in the example in Fig. 7.4.
As mentioned above, XML documents also implicitly include information about
the structure of the document. Since it is important for many applications to know the
structure of the XML documents, explicit representations (DTD = document type
definition or XML schema) have been proposed by W3C. An explicit schema shows
which tags occur in the XML document and how they are arranged. This allows for,
e.g., localizing and repairing errors in XML documents. The XML schema is
illustrated here as it has undeniable advantages for use in database systems.
An XML schema and a relational database schema are related as follows:
Usually, relational database schemas can be characterized by three degrees of
element nesting, i.e., the name of the database, the relation names, and the attribute
names. This makes it possible to match a relational database schema to a section of
an XML schema and vice versa.
Figure 7.4 shows the association between an XML document and a relational
database schema. The section of the XML document gives the relation names
DEPARTMENT and ADDRESS, each with their respective attribute names and
the actual data values. The use of keys and foreign keys is also possible in an XML
schema, as explained below.
The basic concept of XML schemas is to define data types and match names and
data types using declarations. This allows for the creation of completely arbitrary
XML documents. Additionally, it is possible to describe integrity rules for the
correctness of XML documents.
234 7 NoSQL Databases
Department
DEPARTMENT
D# DepartmentName Address Website
D3 IT Add07 www.example.com
ADDRESS
Add# Street Number ZIP code City
There are a large number of standard data types, such as string, Boolean, integer,
date, time, etc., but apart from that, user-defined data types can also be introduced.
Specific properties of data types can be declared with facets. This allows for the
properties of a data type to be specified, for instance, the restriction of values by an
upper or lower limit, length restrictions, or lists of permitted values:
<xs:simpleType name=«city»>
<xs:restriction base=«xs:string»>
<xs:length value=«20»/>
</xs:restriction>
</xs:simpleType>
For cities, a simple data type based on the predefined data type string is proposed.
Additionally, the city names cannot consist of more than 20 characters.
7.5 XML Databases 235
Several XML editors have been developed that allow for the graphical represen-
tation of an XML document or schema. These editors can be used for both the
declaration of structural properties and the input of data content. By showing or
hiding individual sub-structures, XML documents and schemas can be arranged
neatly.
It is desirable to be able to analyze XML documents or XML databases. Unlike
relational query languages, selection conditions are linked not only to values (value
selection) but also to element structures (structure selection). Other basic operations
of an XML query include the extraction of subelements of an XML document and
the modification of selected subelements. Furthermore, individual elements from
different source structures can be combined to form new element structures. Last but
not least, a suitable query language needs to be able to work with hyperlinks; path
expressions are vital for that.
XQuery, influenced by SQL, various XML languages (e.g., XPath as navigation
language for XML documents), and object-oriented query languages, was proposed
by the W3C. XQuery is an enhancement of XPath, offering the option not only to
query data in XML documents but also to form new XML structures. The basic
elements of XQuery are FOR-LET-WHERE-RETURN expressions: FOR and LET
bind one or more variables to the results of a query of expressions. WHERE clauses
can be used to further restrict the result set, just as in SQL. The result of a query is
shown with RETURN.
There is a simple example to give an outline of the principles of XQuery: The
XML document “Department” (see Figs. 7.4 and 7.5) is queried for the street names
of the individual departments:
<streetNames>
{FOR $Department IN //department RETURN
$Department/address/street }
</streetNames>
The query above binds the variable $Department to the <Department> nodes
during processing. For each of these bindings, the RETURN expression evaluates
the address and returns the street. The query in XQuery produced the following
result:
<streetNames>
<street> W Broad Street </street>
<street>........... </street>
<street>........... </street>
</streetNames>
236 7 NoSQL Databases
External Developers,
applications administrators
Data User
interface (API) interface
XML
Dok.
doc
File upload
XML XML
Dok.
doc Dok.
doc
In XQuery, variables are marked with the $ sign added to their names, in order to
distinguish them from the names of elements. Unlike in some other programming
languages, variables cannot have values assigned to them in XQuery; rather, it is
necessary to analyze expressions and bind the result to the variables. This variable
binding is done in XQuery with the FOR and LET expressions.
In the query example above, no LET expression is specified. Using the WHERE
clause, the result set could be reduced further. The RETURN clause is executed for
every FOR loop, but does not necessarily yield a result. The individual results,
however, are listed and form the result of the FOR-LET-WHERE-RETURN
expression.
XQuery is a powerful query language for hyper documents and is offered for
XML databases as well as some post-relational database systems. In order for
relational database systems to store XML documents, some enhancements in the
storage component have to be applied.
Many relational database systems are nowadays equipped with XML column data
types and therefore the possibility to directly handle XML. This allows for data to be
7.6 Graph Databases 237
stored in structured XML columns and for elements of the XML tree to be queried
and modified directly with XQuery or XPath. Around the turn of the millennium,
XML documents for data storage and data communication experienced a boom and
were used for countless purposes, especially Web services. As part of this trend,
several database systems that can directly process data in the form of XML
documents were developed. Particularly in the field of open source, support for
XQuery in native XML databases is far stronger than in relational databases.
• The data is stored in documents; the database is therefore a document store (see
Sect. 7.4).
• The structured data in the documents is compatible with the XML standard.
• XML technologies such as XPath, XQuery, and XSL/T can be used for querying
and manipulating data.
Native XML databases store data strictly hierarchically in a tree structure. They
are especially suitable if hierarchical data needs to be stored in a standardized format,
for instance, for Web services in service-oriented architectures (SOA). A significant
advantage is the simplified data import into the database; some database systems
even support drag and drop of XML files. Figure 7.5 shows a schematic illustration
of a native XML database. It facilitates reading and writing access to data in a
collection of XML documents for users and applications.
An XML database cannot cross-reference like nodes. This can be problematic
especially with multi-dimensionally linked data. An XML database therefore is best
suited for data that can be represented in a tree structure as a series of nested
generalizations or aggregations.
The fourth and final type of core NoSQL databases differs significantly from the data
models presented up to this point, i.e., the key-value stores, column-family stores,
and document stores. Those three data models forgo database schemas and referen-
tial integrity for the sake of easier fragmentation (sharding). Graph databases,
however, have a structuring schema: that of the property graph presented in Sect.
1.4.1. In a graph database, data is stored as nodes and edges, which belong to a node
type or edge type, respectively, and contain data in the form of attribute-value pairs.
Unlike in relational databases, their schema is implicit, i.e., data objects belonging to
a not-yet existing node or edge type can be inserted directly into the database without
defining the type first. The DBMS implicitly follows the changes in the schema
based on the available information and thereby creates the respective type.
As an example, Fig. 7.6 illustrates the graph database G_USERS, which
represents information on a Web portal with users, Web pages, and the relationships
238 7 NoSQL Databases
CREATED_BY
USER:
WEBPAGE: userName: U17547
Name: index firstName: John
FOLLOWS lastName: Doe
VISITED FOLLOWS
date: 2015-03-30 USER:
userName: U17548 FOLLOWS
firstName: Jane
lastName: Smith
USER:
WEBPAGE: VISITED userName: U17555
Name: blogroll date: 2015-03-30 firstName: Thomas
lastName: Taylor
between them. As explained in Sect. 1.4.1, the database has a schema with node and
edge types. There are two node types, USER and WEBPAGE, and three edge types,
FOLLOWS, VISITED, and CREATED_BY. The USER node type has the attributes
userName, firstName, and lastName; the node type WEBPAGE has only the attri-
bute Name; and the edge type VISITED has one attribute as well, date with values
from the date domain. It therefore is a property graph.
This graph database stores a similar type of data as the D_USERS document
database in Fig. 7.6; for instance, it also represents users with username, first name,
last name, and the visited Web pages with date. There is an important difference
though: The relationships between data objects are explicitly present as edges, and
referential integrity is ensured by the DBMS.
Graph Database
A graph database is a database management system with the following properties:
• The data and the schema are shown as graphs (see Sect. 2.4) or graph-like
structures, which generalize the concept of graphs (e.g., hypergraphs).
7.7 Search Engine Databases 239
Graph databases are used when data is organized in networks. In these cases, it is
not the individual record that matters, but the connection of all records with each
other, for instance, in social media, but also in the analysis of infrastructure networks
(e.g., water network or electricity grid), in Internet routing, or in the analysis of links
between websites. The advantage of the graph database is the index-free adjacency
property: For every node, the database system can find the direct neighbor, without
having to consider all edges, as would be the case in relational databases using a
relationship table. Therefore, the effort for querying the relationships with a node is
constant, independent of the volume of the data. In relational databases, the effort for
determining referenced tuples increases with the number of tuples, even if indexes
are used.
Just as relational databases, graph databases need indexes to ensure a quick and
direct access to individual nodes and edges via their properties. As illustrated in Sect.
5.2.1, balanced trees (B-trees) are generated for indexing. A tree is a special graph
that does not contain any cycles; therefore, every tree can be represented as a graph.
This is interesting for graph databases, because it means that the index of a graph can
be a subgraph of the same graph. The graph contains its own indexes.
The fragmentation (see Sect. 6.2) of graphs is somewhat more complicated. One
reason why the other types of core NoSQL databases do not ensure relationships
between records is that records can be stored on different computers with fragmen-
tation (sharding) without further consideration, since there are no dependencies
between them. The opposite is true for graph databases. Relationships between
records are the central element of the database. Therefore, when fragmenting a
graph database, the connections between records have to be taken into account,
which often demands domain-specific knowledge. There is, however, no efficient
method to optimally divide a graph into subgraphs. The existing algorithms are
NP-complete, which means the computational expense is exponential. As a heuristic,
clustering algorithms can determine highly interconnected partial graphs as
partitions. Today’s graph databases, however, do not yet support sharding.
In the context of Big Data (see variety, Sects. 1.3 and 5.1), more and more text data
such as Web pages, e-mails, notes, customer feedback, contracts, and publications
are being processed. Search engines are suitable for the efficient retrieval of large
240 7 NoSQL Databases
amounts of unstructured and semi-structured data. These are database systems that
enable information retrieval in collections of texts. Due to the spread of Internet
search engines, this concept is known to the general public. Search engines are also
used as database systems in IT practice. A search engine is a special form of
document database that has an inverted index for full-text search, i.e., all fields are
automatically indexed, and each term in the field value automatically receives an
index entry for fast return of relevant documents to search terms.
The basic concepts of search engines are index, document, field, and term. An
index contains a sequence of documents. A document is a sequence of fields. A field
is a named sequence of terms. A term is a string of characters. The same string in two
different fields is considered a different term. Therefore, terms are represented as a
pair of strings, the first denoting the field and the second denoting the text within the
field.
Let’s take a digital library of journal articles as an example. These documents can
be divided into different fields such as title, authors, abstract, keywords, text,
bibliography, and appendices. The fields themselves consist of unstructured and
semi-structured text. This text can be used to identify terms that are relevant to the
query. In the simplest case, spaces and line breaks divide text into terms. The
analyzer process defines which terms are indexed and how. For example, word
combinations can be indexed, and certain terms can be filtered, such as very common
words (so-called stop words).
Internally, a search engine builds an index structure during the so-called indexing
of documents. A term dictionary contains all terms used in all indexed fields of all
documents. This dictionary also contains the number of documents in which the term
occurs, as well as pointers to the term’s frequency data. A second important structure
is the inverted index. This stores statistics about terms to make term-based searches
efficient. For a term, it can list the documents that contain it. This is the inverse of the
natural relationship where documents list terms. For each term in the dictionary, it
stores the keys of all documents that contain that term and the frequency of the term
in that document.
There is a possibility to define the structure of the documents, i.e., the fields in the
documents, the data type of the values stored in each of the fields, and the metadata
associated with the document type. It is similar to the table schema of a relational
database. This type of schema definition is often called mapping in search engines.
An inverted index allows efficient querying of the database with terms. Thus, no
query language is needed, but the full-text search is defined directly by entering the
searched terms. The inverted index can immediately return all documents that
contain the term or combination of terms. However, this is not sufficient for large
amounts of data. If a term occurs in thousands of documents, the search engine
should sort the document list by relevance. The inverted index and the term
dictionary allow a statistical evaluation of relevance with a simple formula
TF*IDF (TF, term frequency; IDF, inverted document frequency). The relevance
of a term T to a document D can be estimated as follows:
7.7 Search Engine Databases 241
Here, DF(T) is the document frequency of the term B, i.e., the number of
documents containing T. The search engine finds this key figure in the inverted
index.
This formula favors documents with frequent mentions of the search term and
prioritizes rarer terms over more frequent terms. The simple formula works surpris-
ingly well in practice. Interestingly, this formula can also be used in reverse to search
for keywords in a given document by indexing a reference corpus for this purpose.
This is a process called keyword extraction.
• Search engine: The SDB indexes terms in fields of semi-structured and unstruc-
tured data and returns lists of documents sorted by relevance, which contain
search terms in the full text of specific fields.
• Data analysis: SDB provides advanced data analysis tools for pre-processing,
evaluation, and visualization.
• Interfaces: SDB supports advanced data interfaces for database integration with
read and write access.
• Security: SDB supports data protection with users, roles, and access rights.
• Scalability: The SDB can provide short response times even for large amounts of
data with the principle of splitting in a cluster of several computers.
• Fail-safety: SDB can operate multiple redundant databases with the principle of
replication, so that if one instance fails, other instances can continue operation.
242 7 NoSQL Databases
• Scalability of write performance: Time series data, e.g., from IoT sensors, is
recorded in real time and at high frequency, which requires scalable writes. Time
series databases must therefore provide high availability and high performance
for both reads and writes during peak loads. Time series can generate large
amounts of data quickly. For example, an experiment at CERN sends 100 GB
of data per second to the database for storage. Traditional databases are not
designed for this scalability. TSDB offer the highest write throughput, faster
queries at scale, and better data compression.
• Time-oriented sharding: Data within the same time range is stored on the same
physical part of the database cluster, enabling fast access and more efficient
analysis.
• Time series management: Time series databases contain functions and
operations that are required when analyzing time series data. For example, they
use data retention policies, continuous queries, flexible time aggregation, range
queries, etc. This enhances usability by improving the user experience when
dealing with time-related analytics.
• Highest availability: When collecting time series data, availability at all times is
often critical. The architecture of a database designed for time series data avoids
any downtime for data, even in the event of network partitions or hardware
failures.
• Decision support: Storing and analyzing real-time sensor data in time series
database enables faster and more accurate adjustments to infrastructure changes,
Bibliography 243
With the advent of the Internet of Things (IoT), more and more sensor data is
being generated. The IoT is a network of physical devices connected to the Internet,
through which data from the devices’ sensors can be transmitted and collected. This
generates large amounts of data with timestamps or time series. The proliferation of
the IoT has led to a growing interest in time series databases, as they are excellent for
efficiently storing and analyzing sensor data. Other use cases for time series
databases include monitoring software systems such as virtual machines, various
services, or applications; monitoring physical systems such as weather, real estate,
and health data; and also collecting and analyzing data from financial trading
systems. Time series databases can also be used to analyze customer data and in
business intelligence application to track key metrics and the overall health of the
business.
The key concepts in time series databases are time series, timestamps, metrics,
and categories. A time column is included in each time series and stores discrete
timestamps associated with the records. Other attributes are stored with the
timestamp. Measured values store the effective size of the time series, such as a
temperature or a device status. The measured values can also be qualified with tags,
such as location or machine type. These categories are indexed to speed up
subsequent aggregated queries. The primary key of a time series consists of the
timestamp and the categories. Thus, there is exactly one tuple of measurements per
timestamp and combination of categories. Retention policies can be defined with the
time series, such as how long it is historized and how often it is replicated in the
cluster for failover. A time series in a TSDB is thus a collection of specific
measurement values on defined category combinations over time, stored with a
common retention policy.
Sharding is the horizontal partitioning of data in a database. Each partition is
called a shard. TSDBs store data in so-called shard groups, which are organized
according to retention policies. They store data with timestamps that fall within a
specific time interval. The time interval of the shard group is important for efficient
read and write operations, where the entire data of a shard can be selected highly
efficiently without searching.
Bibliography
Anderson, J.C., Lehnardt, J., Slater, N.: CouchDB: The Definitive Guide. O’Reilly. http://guide.
couchdb.org/editions/1/en/index.html (2010)
Angles, R., Gutierrez, C.: Survey of graph database models. ACM Comput. Surv. 40(1), 1–39
(2008)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, D., Fikes,
A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans.
Comput. Syst. 26(2), 1–26 (2008., Article No. 4)
Charu, A., Haixun, W.: Managing and Mining Graph Data, vol. 40. Springer (2010)
244 7 NoSQL Databases
Edlich, S., Friedland, A., Hampe, J., Brauer, B., Brückner, M.: NoSQL – Einstieg in die Welt
nichtrelationaler Web 2.0 Datenbanken. Carl Hanser Verlag (2011)
Fawcett, J., Quin, L.R.E., Ayers, D.: Beginning XML. Wiley (2012)
McCreary, D., Kelly, A.: Making Sense of NoSQL – A Guide for Managers and the Rest of
Us. Manning (2014)
Montag, D.: Understanding Neo4j Scalability. White Paper, netechnology (2013)
Naqvi, S.N.Z., Yfantidou, S.: Time Series Databases and InfluxDB. Seminar Thesis, Universite
Libre de Bruxelles. https://cs.ulb.ac.be/public/_media/teaching/influxdb_2017.pdf (2018)
Perkins, L., Redmond, E., Wilson, J.R.: Seven Databases in Seven Weeks: A Guide to Modern
Databases and the Nosql Movement, 2nd edn. O’Reilly UK, Raleigh, NC (2018)
Redis: Redis Cluster Tutorial. http://redis.io/topics/cluster-tutorial (2015)
Robinson, I., Webber, J., Eifrem, E.: Graph Databases – New Opportunities for Connected Data,
2nd edn. O’Reilly Media (2015)
Sadalage, P.J., Fowler, M.: NoSQL Distilled – A Brief Guide to the Emerging World of Polyglot
Persistence. Addison-Wesley (2013)
Wegrzynek, A.: InfluxDB at CERN and Its Experiments. Case Study, Influxdata. https://www.
influxdata.com/customer/cern/ (2018)
Glossary
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 245
M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9
246 Glossary
prevents conflicts between concurrent transactions from the start, while optimistic
concurrency control resets conflicting transactions after completion.
Cursor Management Cursor management enables the record-by-record
processing of a set of data records in a procedural programming language with
the help of a pointer.
Cypher Cypher is a database language for graph databases, originally from Neo4j.
It has been released with openCypher and is now offered by several graph
database systems. Under the GQL (Graph Query Language) project, the ISO
(International Organization for Standardization) is working to extend and estab-
lish the language as a new international standard.
Database A database is an organized and structured set of records stored and
managed for a common purpose.
Database Language A database language allows to query, manipulate, define,
optimize, scale, and secure databases by specifying database commands. It
includes comprehensive database management functionalities in addition to the
query language.
Database Management System A database management system, or DBMS, is a
software that automates electronic databases. It provides functions for database
definition, creation, query, manipulation, optimization, backup, security, data
protection, scalability, and failover.
Database Schema A database schema is the formal specification of the structure of
a database, such as classes of records and their characteristics, data types, and
integrity constraints.
Database Security Database security is a subcategory of information security that
focuses on maintaining the confidentiality, integrity, and availability of database
systems.
Database System A database system consists of a storage and a management
component. The storage component, i.e., the actual database, is used to store
data and relationships; the management component, called the database manage-
ment system or DBMS, provides functions and language tools for data mainte-
nance and management.
Data Dictionary System Data dictionary systems are used for the description,
storage, and documentation of the data schema, including database structures,
fields, types, etc., and their connections with each other.
Data Independence Data independence in database management systems is
established by separating the data from the application tools via system
functionalities.
Data Lake A data lake is a system of databases and loaders that makes historized
unstructured and semi-structured data from various distributed data repositories
available in its original raw format for data integration and data analysis.
Data Management Data management encompasses all operational, organizational,
and technical functions of the data architecture of data administration and data
technology that organize the use of data as a resource.
Glossary 247
Data Mining Data mining is the search for valuable information within data sets
and aims to discover previously unknown data patterns.
Data Model Data models provide a structured description of the data and data
relationships required for an information system.
Data Protection Data protection is the prevention of unauthorized access to and
use of data.
Data Record A data record is an information element which, as a unit, describes a
complex set of facts.
Data Scientist Data scientists are business analytics specialists and experts on tools
and methods for SQL and NoSQL databases, data mining, statistics, and the
visualization of multi-dimensional connections within data.
Data Security Data security includes all technical and organizational safeguards
against the falsification, destruction, and loss of data.
Data Stream A data stream is a continuous flow of digital data with a variable data
rate (records per unit of time). Data in a data stream is in chronological order and
may include audio and video data or series of measurements.
Data Warehouse A data warehouse is a system of databases and loading
applications which provides historized data from various distributed data sets
for data analysis via integration.
Document Database A document database is a NoSQL database which stores
structured data records called documents that describe a fact completely and
self-contained, i.e., without dependencies and relationships. This property
eliminates foreign key lookups and enables efficient sharding and massive scal-
ability for Big Data.
End User End users are employees in the various company departments who work
with the database and have basic IT knowledge.
Entity Entities are equivalent to real-world or abstract objects. They are
characterized by attributes and grouped into entity sets.
Entity-Relationship Model The entity-relationship model is a data model defining
data classes (entity sets) and relationship sets. In graphic representations, entity
sets are depicted as rectangles, relationship sets as rhombi, and attributes as ovals.
Fuzzy Database Fuzzy databases support incomplete, unclear, or imprecise infor-
mation by employing fuzzy logic.
Generalization Generalization is the abstraction process of combining entity sets
into a superordinate entity set. The entity subsets in a generalization hierarchy are
called specializations.
Graph Database Graph databases manage graphs consisting of vertices
representing objects or concepts and edges representing the relationships between
them. Both vertices and edges can have attributes.
Graph-Based Model The graph-based model represents real-world and abstract
information as vertices (objects) and edges (relationships between objects). Both
vertices and edges can have properties, and edges can be either directed or
undirected.
248 Glossary
Redundancy Multiple records with the same information in one database are
considered redundancies.
Relational Algebra Relational algebra provides the formal framework for the
relational query languages and includes the set union, set difference, Cartesian
product, project, and select operators.
Relational Model The relational model is a data model that represents both data
and relationships between data as tables.
Replication Replication or mirroring of databases means redundant multiple stor-
age of identical databases with the purpose of fail-safety.
Search Engine Database System A search engine is a system for indexing,
querying, and relevance sorting of semi-structured and unstructured text
documents with full-text search terms. A search engine database is a database
system that, in addition to the pure search engine, provides mechanisms of a
database management system for data interfaces, data analysis, security, scalabil-
ity, and failover.
Selection Selection is a database operation that yields all records from a database
that match the criteria specified by the user.
Sharding Database sharding means splitting the database across multiple
computers in a federation. This is often used for Big Data to process more volume
at higher speed.
SQL SQL (Structured Query Language) is the most important database language. It
has been standardized by ISO (International Organization for Standardization).
SQL Injection SQL injection is a potential security vulnerability in information
systems with SQL databases, where user input is used to inject SQL code that is
processed by the database, thereby making data available or modifying it without
authorization.
Table A table (also called relation) is a set of tuples (records) of certain attribute
categories, with one attribute or attribute combination uniquely identifying the
tuples within the table.
Transaction A transaction is a sequence of operations that is atomic, consistent,
isolated, and durable. Transaction management allows conflict-free simultaneous
work by multiple users.
Tree A tree is a data structure in which every node apart from the root node has
exactly one previous node and where there is a single path from each leaf to
the root.
Two-Phase Locking Protocol The two-phase locking (2PL) protocol prohibits
transactions from acquiring a new lock after a lock on another database object
used by the transaction has already been released.
Vector Clock Vector clocks are no time-keeping tools, but counting algorithms
allowing for a partial chronological ordering of events in concurrent processes.
XML XML (eXtensible Markup Language) describes semi-structured data, con-
tent, and form in a hierarchical manner.
Index
B
Basically Available, Soft state, Eventually D
consistent (BASE), 144, 146, 149 Data administrator, 116
Big Data, 10, 12, 161 Data analysis, 21, 27, 185, 201, 206
Bigtable, 227 Data architect, 27, 32, 35
Boyce-Codd normal form (BCNF), 42 Data architecture, 21
BSON, 171 Database anomaly, 36
B*-Tree, 164 Database design, 27
B-tree, 163, 232 Database management system, 3
Built-in function, 86 Database model, 9, 14
Business intelligence, 204 Database modeling, 26
Database schema, 14, 25, 42
Database system, 3
C Data definition language, 127
CAP theorem, 14, 145, 146 Data independence, 9
Cardinality, 31 Data integrity, 9, 127, 223
Cartesian product, 71, 74, 77, 79, 80 Data lake, 207
CHECK, 127 Data management, 21
Checkpoint, 143 Data manipulation language, 3, 6, 175, 209
Cloud database, 190 Data mart, 206
COALESCE, 90 Data mining, 188, 207
Collection, 19 Data modeling, 32, 35, 65
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 251
M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9
252 Index
E J
End user, 7 JavaScript Object Notation (JSON), 56, 230
Entity-relationship model, 25, 30, 32, 35, 38, Join, 72, 76–79, 94, 115, 177, 178, 180, 196
51 Join operator, 76, 178
Entity set, 25, 28, 42, 51 JSON Schema, 58
Expert system, 215
eXtensible Markup Language (XML), 233
Extract, transform, load (ETL), 206 K
Key-value store, 225
Knowledge base, 215
F Knowledge database management system
Fact, 213 (KDBMS), 215
Federated database system, 197
FETCH, 107
File management, 187 L
First normal form (1NF), 38 Lock, 138
Foreign key, 27, 29, 42–45, 89, 127 LOCK, 138
Fragmentation, 194, 204, 226, 237, 239 Locking protocol, 138
Full functional dependency, 40 Log, 136
Functional dependency, 39 Log file, 143
Fuzzy database management system (FDBMS),
220
Fuzzy logic, 216 M
Map, 232
Mapping rule, 28, 38, 42, 44, 51
G MapReduce, 184, 232
Generalization, 32 Memory allocation, 187
Generic operator, 210 Memory management, 162
Graph-based model, 27 Minimality, 5, 42
Graph database, 14, 16, 47, 238 Multi-dimensional database management
Grid file, 169 system (MDBMS), 204
Index 253
U
Unique association, 30, 45 X
Unique-complex relationship, 31, 54 XML database system, 237
Unique-complex relationship set, 45 XPath, 235
Uniqueness, 5, 28, 42, 127 XQuery, 235