Alejandro Vaisman, Esteban Zimányi - Data Warehouse Systems - Design and Implementation (Data-Centric Systems and Applications) - Springer (2022)
Alejandro Vaisman, Esteban Zimányi - Data Warehouse Systems - Design and Implementation (Data-Centric Systems and Applications) - Springer (2022)
Alejandro Vaisman
Esteban Zimányi
Data
Warehouse
Systems
Design and Implementation
Second Edition
Data-Centric Systems and Applications
Series Editors
Michael J. Carey, University of California, Irvine, CA, USA
Stefano Ceri, Politecnico di Milano, Milano, Italy
Second Edition
Alejandro Vaisman Esteban Zimányi
Instituto Tecnológico de Buenos Aires Université Libre de Bruxelles
Buenos Aires, Argentina Brussels, Belgium
This Springer imprint is published by the registered company Springer-Verlag GmbH, DE part of Springer
Nature.
The registered company address is: Heidelberger Platz 3, 14197 Berlin, Germany
To Andrés and Manuel,
who bring me joy and
happiness day after day
A.V.
To Elena,
the star that shed light upon my path,
with all my love
E.Z.
Foreword to the Second Edition
Dear reader,
Assuming you are looking for a textbook on data warehousing and the
analytical processing of data, I can assure you that you are certainly in the
right spot. In fact, I could easily argue how panoramic and lucid the view
from this spot is, and in the next few paragraphs, this is exactly what I am
going to do.
Assembling a good book from the bits and pieces of writings, slides, and
article commentaries that an author has in his folders, is no easy task. Even
more, if the book is intended to serve as a textbook, it requires an extra dose
of love and care for the students who are going to use it (and their instructors,
too, in fact). The book you have at hand is the product of hard work and
deep caring by our two esteemed colleagues, Alejandro Vaisman and Esteban
Zimányi, who have invested a large amount of effort to produce a book that
is (a) comprehensive, (b) up-to-date, (c) easy to follow, and, (d) useful and
to-the-point. While the book is also addressing the researcher who, coming
from a different background, wants to enter the area of data warehousing,
as well as the newcomer to data processing, who might prefer to start the
journey of working with data from the neat setup of data cubes, the book
is perfectly suited as a textbook for advanced undergraduate and graduate
courses in the area of data warehousing.
The book comprehensively covers all the fundamental modeling issues, and
addresses also the practical aspects on querying and populating the ware-
house. The usage of concrete examples, consistently revisited throughout the
book, guide the student to understand the practical considerations, and a set
of exercises help the instructor with the hands-on design of a course. For what
it’s worth, I have already used the first edition of the book for my graduate
data warehouse course and will certainly switch to the new version in the
years to come.
If you, dear reader, have already read the first edition of the book, you
already know that the first part, covering the modeling fundamentals, and
the second part, covering the practical usage of data warehousing are both
vii
viii Foreword to the Second Edition
comprehensive and detailed. To the extent that the fundamentals have not
changed (and are not really expected to change in the future), apart from a set
of extensions spread throughout the first part of the book, the main improve-
ments concern readability on the one hand, and the technological advances
on the other. Specifically, the dedicated chapter 7 on practical data analysis
with lots of examples over a specific example, as well as the new topics cov-
ering partitioning and parallel data processing in the physical management
of the data warehouse provide an even more easy path to the novice reader
into the areas of querying and managing the warehouse.
I would like, however, to take the opportunity and direct your attention to
the really new features of this second edition, which are found in the last unit
of the book, concerning advanced areas of data warehousing. This part goes
beyond the traditional data warehousing modeling and implementation and
is practically completely refreshed compared to the first edition of the book.
The chapter on temporal and multiversion warehousing covers the problem
of time encoding for evolving facts and the management of versions. The part
on spatial warehouses has been significantly updated. There is a brand-new
chapter on graph data processing, and its application to graph warehous-
ing and graph OLAP. Last but extremely significant, the crown jewel of the
book, a brand-new chapter on the management of Big Data and the usage of
Hadoop, Spark and Kylin, as well as the coverage of distributed, in-memory,
columnar, and Not-Only-SQL DBMS’s in the context of analytical data pro-
cessing. Recent advents like data processing in the cloud, polystores and data
lakes are also covered in the chapter.
Based on all that, dear reader, I can only invite you to dive into the con-
tents of the book, feeling certain that, once you have completed its reading
(or maybe, targeted parts of it), you will join me in expressing our gratitude
to Alejandro and Esteban, for providing such a comprehensive textbook for
the field of data warehousing in the first place, and for keeping it up to date
with the recent developments, in this, current, second edition.
Having worked with data warehouses for almost 20 years, I was both honored
and excited when two veteran authors in the field asked me to write a foreword
for their new book and sent me a PDF file with the current draft. Already
the size of the PDF file gave me a first impression of a very comprehensive
book, an impression that was heavily reinforced by reading the Table of
Contents. After reading the entire book, I think it is quite simply the most
comprehensive textbook about data warehousing on the market.
The book is very well suited for one or more data warehouse courses,
ranging from the most basic to the most advanced. It has all the features
that are necessary to make a good textbook. First, a running case study,
based on the Northwind database known from Microsoft’s tools, is used to
illustrate all aspects using many detailed figures and examples. Second, key
terms and concepts are highlighted in the text for better reading and under-
standing. Third, review questions are provided at the end of each chapter so
students can quickly check their understanding. Fourth, the many detailed
exercises for each chapter put the presented knowledge into action, yielding
deep learning and taking students through all the steps needed to develop a
data warehouse. Finally, the book shows how to implement data warehouses
using leading industrial and open-source tools, concretely Microsoft’s suite of
data warehouse tools, giving students the essential hands-on experience that
enables them to put the knowledge into practice.
For the complete database novice, there is even an introductory chapter on
standard database concepts and design, making the book self-contained even
for this group. It is quite impressive to cover all this material, usually the topic
of an entire textbook, without making it a dense read. Next, the book provides
a good introduction to basic multidimensional concepts, later moving on to
advanced concepts such as summarizability. A complete overview of the data
warehouse and online analytical processing (OLAP) “architecture stack” is
given. For the conceptual modeling of the data warehouse, a concise and
intuitive graphical notation is used, a full specification of which is given in
ix
x Foreword to the First Edition
an appendix, along with a methodology for the modeling and the translation
to (logical-level) relational schemas.
Later, the book provides a lot of useful knowledge about designing and
querying data warehouses, including a detailed, yet easy to read, description
of the de facto standard OLAP query language: MultiDimensional eXpres-
sions (MDX). I certainly learned a thing or two about MDX in a short time.
The chapter on extract-transform-load (ETL) takes a refreshingly different
approach by using a graphical notation based on the Business Process Mod-
eling Notation (BPMN), thus treating the ETL flow at a higher and more
understandable level. Unlike most other data warehouse books, this book also
provides comprehensive coverage on analytics, including data mining and re-
porting, and on how to implement these using industrial tools. The book even
has a chapter on methodology issues such as requirements capture and the
data warehouse development process, again something not covered by most
data warehouse textbooks.
However, the one thing that really sets this book apart from its peers is
the coverage of advanced data warehouse topics, such as spatial databases
and data warehouses, spatiotemporal or mobility databases and data ware-
houses, and semantic web data warehouses. The book also provides a useful
overview of novel “big data” technologies like Hadoop and novel database
and data warehouse architectures like in-memory database systems, column
store systems, and right-time data warehouses. These advanced topics are a
distinguishing feature not found in other textbooks.
Finally, the book concludes by pointing to a number of exciting directions
for future research in data warehousing, making it an interesting read even
for seasoned data warehouse researchers.
A famous quote by IBM veteran Bruce Lindsay states that “relational
databases are the foundation of Western civilization.” Similarly, I would say
that “data warehouses are the foundation of twenty-first-century enterprises.”
And this book is in turn an excellent foundation for building those data ware-
houses, from the simplest to the most complex.
Happy reading!
Since the late 1970s, relational database technology has been adopted by most
organizations to store their essential data. However, nowadays, the needs of
these organizations are not the same as they used to be. On the one hand,
increasing market dynamics and competitiveness led to the need to have the
right information at the right time. Managers need to be properly informed
in order to take appropriate decisions to keep up with business successfully.
On the other hand, data held by organizations are usually scattered among
different systems, each one devised for a particular kind of business activity.
Further, these systems may also be distributed geographically in different
branches of the organization.
Traditional database systems are not well suited for these new require-
ments, since they were devised to support day-to-day operations rather than
for data analysis and decision making. As a consequence, new database tech-
nologies for these specific tasks emerged in the 1990s, namely, data warehous-
ing and online analytical processing (OLAP), which involve architectures,
algorithms, tools, and techniques for bringing together data from heteroge-
neous information sources into a single repository suited for analysis. In this
repository, called a data warehouse, data are accumulated over a period of
time for the purpose of analyzing their evolution and discovering strategic
information such as trends, correlations, and the like. Data warehousing is
a well-established and mature technology used by organizations to improve
their operations and better achieve their objectives.
xi
xii Preface
contrary, the book aims at being a main textbook for undergraduate and
graduate computer science courses on data warehousing and OLAP. As such,
it is written in a pedagogical rather than research style to make the work of
the instructor easier and to help the student understand the concepts being
delivered. Researchers and practitioners who are interested in an introduction
to the area of data warehousing will also find in the book a useful reference.
In summary, we aim at providing in-depth coverage of the main topics in the
field, yet keeping a simple and understandable style.
Throughout the book, we cover all the phases of the data warehousing
process, from requirements specification to implementation. Regarding data
warehouse design, we make a clear distinction between the three abstraction
levels of the American National Standards Institute (ANSI) database archi-
tecture, that is, conceptual, logical, and physical, unlike the usual approaches,
which do not distinguish clearly between the conceptual and logical levels. A
strong emphasis is placed on querying using the de facto standard language
MDX (MultiDimensional eXpressions) as well as the popular language DAX
(Data Analysis eXpressions). Though there are many practical books covering
these languages, academic books have largely ignored them. We also provide
in-depth coverage of the extraction, transformation, and loading (ETL) pro-
cesses. In addition, we study how key performance indicators (KPIs) and
dashboards are built on top of data warehouses. An important topic that
we also cover in this book is temporal and multiversion data warehouses, in
which the evolution over time of the data and the schema of a data warehouse
are taken into account. Although there are many textbooks on spatial data-
bases, this is not the case with spatial data warehouses, which we study in
this book, together with mobility data warehouses, which allow the analysis
of data produced by objects that change their position in space and time,
like cars or pedestrians. Data warehousing and OLAP on graph databases
and on the semantic web are also studied. Finally, big data technologies led
to the concept of big data warehouses, which are also covered in this book.
A key characteristic that distinguishes this book from other textbooks is
that we illustrate how the concepts introduced can be implemented using ex-
isting tools. Specifically, throughout the book we develop a case study based
on the well-known Northwind database using representative tools of different
kinds. In particular, the chapter on logical design includes a complete descrip-
tion of how to define an OLAP cube in Microsoft SQL Analysis Services using
both the multidimensional and the tabular models. Similarly, the chapter on
physical design illustrates how to optimize SQL Server and Analysis Services
applications. Further, in the chapter on ETL we give a complete example
of a process that loads the Northwind data warehouse, implemented using
Integration Services. We also use Analysis Services for defining KPIs, and use
Reporting Services to show how dashboards can be implemented. To illus-
trate spatial and spatiotemporal concepts we use the open-source database
PostgreSQL, its spatial extension PostGIS, and its mobility extension Mobil-
ityDB. In this way, the reader can replicate most of the examples and queries
Preface xiii
Part I of the book starts with Chap. 1, giving a historical overview of data
warehousing and OLAP. Chapter 2 introduces the main concepts of rela-
tional databases needed in the remainder of the book. We also introduce the
case study that we will use throughout the book, based on the well-known
Northwind database. Data warehouses and the multidimensional model are
introduced in Chap. 3, as well as the suite of tools provided by SQL Server.
Chapter 4 deals with conceptual data warehouse design, while Chap. 5 is
devoted to logical data warehouse design. Part I closes with Chaps. 6 and 7,
which study SQL/OLAP, the extension of SQL with OLAP features, as well
as MDX and DAX.
Part II covers data warehouse implementation issues. This part starts with
Chap. 8, which tackles classical physical data warehouse design, focusing on
indexing, view materialization, and database partitioning. Chapter 9 studies
conceptual modeling and implementation of ETL processes. Finally, Chap. 10
provides a comprehensive method for data warehouse design.
Part III covers advanced data warehouse topics. This part starts with
Chap. 11, which studies temporal and multiversion data warehouses, for both
data and schema evolution of the data warehouse. Then, in Chap. 12, we
study spatial data warehouses and their exploitation, denoted spatial OLAP
(SOLAP), illustrating the problem with a spatial extension of the North-
wind data warehouse denoted GeoNorthwind. We query this data warehouse
xiv Preface
using PostGIS, PostgreSQL’s spatial extension. The chapter also covers mo-
bility data warehousing, using MobilityDB, a spatiotemporal extension of
PostgreSQL. Chapters 13 and 14 address OLAP analysis over graph data
represented, respectively, natively using property graphs in Neo4j and using
RDF triples as advocated by the semantic web. Chapter 15 studies how novel
techniques and technologies for distributed data storage and processing can
be applied to the field of data warehousing. Appendix A summarizes the
notations used in this book.
The figure below illustrates the overall structure of the book and the inter-
dependencies between the chapters described above. Readers may refer to this
figure to tailor their use of this book to their own particular interests. The
dependency graph in the figure suggests many of the possible combinations
that can be devised to offer advanced graduate courses on data warehousing.
Part I
Fundamental 2. Database
Concepts Concepts
3. Data 4. Conceptual
1. Introduction Warehouse Data Warehouse
Concepts Design
5. Logical Data
6. Data Analysis in Warehouse
Data Warehouses Design
7. Data Analysis in
the Northwind
Data Warehouse
We would like to thank Innoviris, the Brussels Institute for Research and In-
novation, which funded Alejandro Vaisman’s work through the OSCB project;
without its financial support, the first edition of this book would never have
been possible. As mentioned above, some content of this book finds its roots
in a previous book written by one of the authors in collaboration with Elzbi-
eta Malinowski. We would like to thank her for all the work we did together
in making the previous book a reality. This gave us the impetus to start this
new book.
Parts of the material included in this book have been previously presented
in conferences or published in journals. At these conferences, we had the
opportunity to discuss with research colleagues from all around the world,
and we exchanged viewpoints about the subject with them. The anonymous
reviewers of these conferences and journals provided us with insightful com-
ments and suggestions that contributed significantly to improve the work
presented in this book. We would like to thank Zineb El Akkaoui, with
whom we have explored the use of BPMN for ETL processes, and Judith
Awiti, who continued this work. A very special thanks to Waqas Ahmed,
a doctoral student of our laboratory, with whom we explored the issue of
temporal and multiversion data warehouses. Waqas also suggested to include
tabular modeling and DAX in the second edition of the book, and without
his invaluable help, all the material related to the tabular model and DAX
would have not been possible. A special thanks to Mahmoud Sakr, Arthur
Lesuisse, Mohammed Bakli, and Maxime Schoemans, who worked with one
of the authors in the development of MobilityDB, a spatiotemporal exten-
sion of PostgreSQL and PostGIS that was used for mobility data warehouses.
This work follows that of Benoit Foé, Julien Lusiela, and Xianling Li, who
explored this topic in the context of their master’s thesis. Arthur Lesuisse
also provided invaluable help in setting up all the computer infrastructure
we needed, especially for spatializing the Northwind database. He also con-
tributed in enhancing some of the figures of this book. Thanks also to Leticia
Gómez from the Buenos Aires Technological Institute for her help on the im-
xv
xvi Acknowledgments
plementation of graph data warehouses and for her advice on the topic of big
data technologies. Bart Kuijpers, from Hasselt University, also worked with
us during our research on graph data warehousing and OLAP. We also want
to thank Lorena Etcheverry, who contributed with comments, exercises, and
solutions in Chap. 14.
Special thanks go to Panos Vassiliadis, professor at the University of Ioan-
nina in Greece, who kindly agreed to write the foreword for this second edi-
tion. Finally, we would like to warmly thank Ralf Gerstner of Springer for his
continued interest in this book. The enthusiastic welcome given to our book
proposal for the first edition and the continuous encouragements to write the
second edition gave us enormous impetus to pursue our project to its end.
xvii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 An Overview of Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Emerging Data Warehousing Technologies . . . . . . . . . . . . . . . . 7
1.3 Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Database Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Northwind Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Conceptual Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Logical Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 The Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.3 Relational Query Languages . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Physical Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.8 Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xix
xx Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
Part I
Fundamental Concepts
Chapter 1
Introduction
integrated means that the contents of a data warehouse result from the inte-
gration of data from various operational and external systems. Nonvolatile
indicates that a data warehouse accumulates data from operational systems
for a long period of time. Thus, data modification and removal are not al-
lowed in data warehouses, and the only operation allowed is the purging of
obsolete data that is no longer needed. Finally, time varying emphasizes
that a data warehouse keeps track of how its data have evolved over time,
for instance, to know the evolution of sales over the last months or years.
The basic concepts of databases are studied in Chap. 2. The design of oper-
ational databases is typically performed in four phases: requirements spec-
ification, conceptual design, logical design, and physical design. Dur-
ing the requirements specification process, the needs of users at various levels
of the organization are collected. The specification obtained serves as a basis
for creating a database schema capable of responding to user queries. Data-
bases are designed using a conceptual model, such as the entity-relationship
(ER) model, which describes an application without taking into account im-
plementation considerations. The resulting design is then translated into a
logical model, which is an implementation paradigm for database applica-
tions. Nowadays, the most-used logical model for databases is the relational
model. Finally, physical design particularizes the logical model for a specific
implementation platform in order to produce a physical model.
Relational databases must be highly normalized in order to guarantee con-
sistency under frequent updates and a minimum level of redundancy. This
is usually achieved at the expense of a higher cost of querying, because nor-
malization implies partitioning the database into multiple tables. Several au-
thors have pointed out that this design paradigm is not appropriate for data
warehouse applications. Data warehouses must aim at ensuring a deep under-
standing of the underlying data and deliver good performance for complex
analytical queries. This sometimes requires a lesser degree of normalization
or even no normalization at all. To account for these requirements, a dif-
ferent model was needed. Thus, multidimensional modeling was adopted for
data warehouse design. Multidimensional modeling, studied in Chap. 3,
represents data as a collection of facts linked to several dimensions. A fact
represents the focus of analysis (e.g., analysis of sales in stores) and typically
includes attributes called measures, usually numeric values, that allow a
quantitative evaluation of various aspects of an organization. Dimensions
are used to study the measures from several perspectives. For example, a
store dimension might help to analyze sales activities across various stores,
a time dimension can be used to analyze changes in sales over various peri-
ods of time, and a location dimension can be used to analyze sales according
to the geographical distribution of stores. Dimensions typically include at-
tributes that form hierarchies, which allow users to explore measures at
various levels of detail. Examples of hierarchies are month–quarter–year in
the time dimension and city–state–country in the location dimension.
6 1 Introduction
This chapter introduces the basic database concepts, covering modeling, de-
sign, and implementation aspects. Section 2.1 begins by describing the con-
cepts underlying database systems and the typical four-step process used for
designing them, starting with requirements specification, followed by concep-
tual, logical, and physical design. These steps allow a separation of concerns,
where requirements specification gathers the requirements about the appli-
cation and its environment, conceptual design targets the modeling of these
requirements from the perspective of the users, logical design develops an im-
plementation of the application according to a particular database technology,
and physical design optimizes the application with respect to a particular im-
plementation platform. Section 2.2 presents the Northwind case study that we
will use throughout the book. In Sect. 2.3, we review the entity-relationship
model, a popular conceptual model for designing databases. Section 2.4 is
devoted to the most used logical model of databases, the relational model.
Finally, physical design considerations for databases are covered in Sect. 2.5.
The aim of this chapter is to provide the necessary knowledge to under-
stand the remaining chapters in this book, making it self-contained. However,
we do not intend to be comprehensive and refer the interested reader to the
many textbooks on the subject.
The entity-relationship (ER) model is one of the most often used conceptual
models for designing database applications. Although there is general agree-
ment about the meaning of the various concepts of the ER model, a number of
14 2 Database Concepts
different visual notations have been proposed for representing these concepts.
Appendix A shows the notations we use in this book.
Figure 2.1 shows the ER model for the Northwind database. We next
introduce the main ER concepts using this figure.
Subordinate
ShippedVia Managed ReportsTo
(1,n) (1,1) (1,1) (1,n) (0,1)
Shippers Orders Employees
(0,n)
ShipperID OrderID EmployeeID Supervisor
CompanyName OrderDate Name
Phone RequiredDate FirstName
ShippedDate (0,1) LastName
Freight Title
Place ShipName TitleOfCourtesy
(1,1) ShipAddress BirthDate
(1,n)
ShipCity HireDate
Customers ShipRegion (0,1) Address
ShipPostalCode (0,1) City
CustomerID ShipCountry Region (0,1)
CompanyName (1,n) PostalCode
ContactName Country
ContactTitle OrderDetails HomePhone
Address Extension
City UnitPrice
Photo (0,1)
Region (0,1) Quantity
Notes (0,1)
PostalCode (0,1) Discount
PhotoPath (0,1)
Country (1,n)
(1,n)
Phone
Products Employee
Fax (0,1)
Territories
(1,1) ProductID (1,n)
Supplies ProductName
QuantityPerUnit Territories
(0,n) UnitPrice
UnitsInStock TerritoryID
Suppliers TerritoryDescription
UnitsOnOrder
SupplierID ReorderLevel (1,1)
CompanyName Discontinued
ContactName /NumberOrders Belongs
ContactTitle (1,1) (1,n)
Address
City HasCategory Regions
Region (0,1)
(0,n) RegionID
PostalCode
Country Categories RegionDescription
Phone
Fax (0,1) CategoryID
Homepage (0,1) CategoryName
Description
Picture
Entity types that do not have an identifier of their own are called weak
entity types, and are represented with a double line on its name box. In
contrast, regular entity types that do have an identifier are called strong
entity types. In Fig. 2.1, there are no weak entity types. However, note that
the relationship OrderDetails between Orders and Products can be modeled as
shown in Fig. 2.2.
2.3 Conceptual Database Design 17
Employees
EmployeeID
Name
FirstName
LastName
Title
...
(total,disjoint)
Permanent Temporary
Employees Employees
Salary ContractExpiration
ContractAmount
Finally, it is usual that people refer to the same concept using several
different perspectives with different abstraction levels. The generalization
(or is-a) relationship captures such a mental process. It relates two entity
types, called the supertype and the subtype, meaning that both types rep-
resent the same concept at different levels of detail. The Northwind database
does not include a generalization. To give an example, consider Fig. 2.3, in
which we have a supertype Employees, and two subtypes, PermanentEmploy-
ees and TemporaryEmployees. The former has an additional attribute Salary,
and the latter has attributes ContractExpiration and ContractAmount.
Generalization has three essential characteristics. The first one is pop-
ulation inclusion, meaning that every instance of the subtype is also an
18 2 Database Concepts
instance of the supertype. In our example, this means that every temporary
employee is also an employee of the Northwind company. The second char-
acteristic is inheritance, meaning that all characteristics of the supertype
(e.g., attributes and roles) are inherited by the subtype. Thus, in our exam-
ple, temporary employees also have, for instance, a name and a title. Finally,
the third characteristic is substitutability, meaning that each time an in-
stance of the supertype is required (e.g., in an operation or in a query), an
instance of the subtype can be used instead.
A generalization can be either total or partial, depending on whether
every instance of the supertype is also an instance of one of the subtypes.
In Fig. 2.3, the generalization is total, since employees are either permanent
or temporary. On the other hand, a generalization can be either disjoint or
overlapping, depending on whether an instance may belong to one or several
subtypes. In our example, the generalization is disjoint, since a temporary
employee cannot be a permanent one.
In this section, we describe the most used logical data model for databases,
that is, the relational model. We also study two well-known query languages
for the relational model: the relational algebra and SQL.
Relational databases have been successfully used for several decades for stor-
ing information in many application domains. In spite of alternative database
technologies that have appeared in the last decades, the relational model is
still the most often used approach for storing the information that is crucial
for the day-to-day operation of an organization.
Much of the success of the relational model, introduced by Codd in 1970,
is due to its simplicity, intuitiveness, and its foundation on a solid formal
theory: The relational model builds on the concept of a mathematical relation,
which can be seen as a table of values and is based on set theory and first-
order predicate logic. This mathematical foundation allowed the design of
declarative query languages, and a rich spectrum of optimization techniques
that led to efficient implementations. Note that in spite of this, only in the
early 1980s the first commercial relational DBMS (RDBMS) appeared.
The relational model has a simple data structure, a relation (or table)
composed of one or several attributes (or columns). Thus, a relational
schema describes the structure of a set of relations. Figure 2.4 shows a
relational schema that corresponds to the conceptual schema of Fig. 2.1. As
2.4 Logical Database Design 19
Fig. 2.4 Relational schema of the Northwind database that corresponds to the concep-
tual schema in Fig. 2.1
we will see later in this section, this relational schema is obtained by applying
a set of translation rules to the corresponding ER schema. The relational
schema of the Northwind database is composed of a set of relations, such
as Employees, Customers, and Products. Each of these relations is composed
of several attributes. For example, EmployeeID, FirstName, and LastName are
20 2 Database Concepts
some attributes of the relation Employees. In what follows, we use the notation
R.A to indicate the attribute A of relation R.
In the relational model, each attribute is defined over a domain, or data
type, that is, a set of values with an associated set of operations, the most
typical ones being integer, float, date, and string. One important restriction to
the model is that attributes must be atomic and monovalued. Thus, complex
attributes like Name of the entity type Employees in Fig. 2.1 must be split into
atomic values, like FirstName and LastName in the table of the same name
in Fig. 2.4. Therefore, a relation R is defined by a schema R(A1 : D1 , A2 :
D2 , . . . , An : Dn ), where R is the name of the relation, and each attribute Ai
is defined over the domain Di . The relation R is associated to a set of tuples
(or rows if we see the relation as a table) (t1 , t2 , . . . , tn ). This set of tuples
is a subset of the Cartesian product D1 × D2 × · · · × Dn , and it is sometimes
called the instance or extension of R. The degree (or arity) of a relation
is the number of attributes n of its relation schema.
The relational model allows several types of integrity constraints to be
defined declaratively.
• An attribute may be defined as being non-null, meaning that null val-
ues (or blanks) are not allowed in that attribute. In Fig. 2.4, only the
attributes marked with a cardinality (0,1) allow null values.
• One or several attributes may be defined as a key, that is, it is not al-
lowed that two different tuples of the relation have identical values in
such columns. In Fig. 2.4, keys are underlined. A key composed of sev-
eral attributes is called a composite key, otherwise it is a simple key.
In Fig. 2.4, the table Employees has a simple key, EmployeeID, while the
table EmployeeTerritories has a composite key, composed of EmployeeID
and TerritoryID. In the relational model, each relation must have a pri-
mary key and may have other alternate keys. Further, the attributes
composing the primary key do not accept null values.
• Referential integrity specifies a link between two tables (or twice the
same table), where a set of attributes in one table, called the foreign key,
references the primary key of the other table. This means that the values
in the foreign key must also exist in the primary key. In Fig. 2.4, referential
integrity constraints are represented by arrows from the referencing table
to the table that is referenced. For example, the attribute EmployeeID
in table Orders references the primary key of the table Employees. This
ensures that every employee appearing in an order also appears in the
table Employees. Note that referential integrity may involve foreign keys
and primary keys composed of several attributes.
• Finally, a check constraint defines a predicate that must be valid
when adding or updating a tuple in a relation. For example, a check
constraint can be used to verify that in table Orders the values of at-
tributes OrderDate and RequiredDate for a given order are such that
OrderDate ≤ RequiredDate. Note that many DBMSs restrict check con-
straints to a single tuple: references to data stored in other tables or
2.4 Logical Database Design 21
in other tuples of the same table are not allowed. Therefore, check con-
straints can be used only to verify simple constraints.
The above declarative integrity constraints do not suffice to express the
many constraints that exist in any application domain. Such constraints must
then be implemented using triggers. A trigger is a named event-condition-
action rule that is automatically activated when a relation is modified. Trig-
gers can also be used to compute derived attributes, such as attribute Num-
berOrders in table Products in Fig. 2.4. A trigger will update the value of the
attribute each time there is an insert, update, or delete in table OrderDetails.
Orders OrderDetails
OrderID OrderID
OrderDate LineNo
RequiredDate UnitPrice
ShippedDate Quantity
... Discount
SalesAmount
Fig. 2.6 Three possible translations of the schema in Fig. 2.3 (a) Using Rule 7a; (b)
Using Rule 7b; (c) Using Rule 7c
Note that the generalization type (total vs. partial and disjoint vs. over-
lapping) may preclude one of the above three approaches. For example,
the third possibility is not applicable for partial generalizations. Also,
note that the semantics of the partial, total, disjoint, and overlapping
characteristics are not fully captured by this translation mechanism. The
conditions must be implemented when populating the relational tables.
For example, assume a table T , and two tables T1 and T2 resulting from
the mapping of a total and overlapping generalization. Referential in-
tegrity does not fully capture the semantics. It must be ensured, among
other conditions, that when deleting an element from T , this element
is also deleted from T1 and T2 (since it can exist in both tables). Such
constraints are typically implemented with triggers.
24 2 Database Concepts
Applying these mapping rules to the ER schema given in Fig. 2.1 yields
the relational schema shown in Fig. 2.4. Note that the above rules apply in
the general case; however, other mappings are possible. For example, binary
one-to-one and one-to-many relationships may be represented by a table of
its own, using Rule 5. The choice between alternative representation depends
on the characteristics of the particular application at hand.
It must be noted that there is a significant difference in expressive power
between the ER model and the relational model. This difference may be
explained by the fact that the ER model is a conceptual model aimed at ex-
pressing concepts as closely as possible to the users’ perspective, whereas the
relational model is a logical model targeted toward particular implementation
platforms. Several ER concepts do not have a correspondence in the relational
model, and thus they must be expressed using only the available concepts in
the model, that is, relations, attributes, and the related constraints. This
translation implies a semantic loss in the sense that data invalid in an ER
schema are allowed in the corresponding relational schema, unless the latter
is supplemented by additional constraints. Many of such constraints must
be manually coded by the user using mechanisms such as triggers or stored
procedures. Furthermore, from a user’s perspective, the relational schema is
much less readable than the corresponding ER schema. This is crucial when
one is considering schemas with hundreds of entity or relationship types and
thousands of attributes. This is not a surprise, since this was the reason for
devising conceptual models back in the 1970s, that is, the aim was to better
understand the semantics of large relational schemas.
2.4.2 Normalization
For example, assume that in relation OrderDetails in Fig. 2.7a, each prod-
uct, no matter the order, is associated with a discount percentage. Here, the
discount information for a product p will be repeated for all orders in which
p appears. Thus, this information will be redundant. To solve this problem,
the attribute Discount must be removed from the table OrderDetails and must
be added to the table Products in order to store only once the information
about the product discounts.
Consider now the relation Products in Fig. 2.7b, which is a variation of the
relation with the same name in Fig. 2.4. In this case, we have included the
category information (name, description, and picture) in the Products rela-
tion. It is easy to see that such information about a category is repeated for
each product with the same category. Therefore, when, for example, the de-
scription of a category needs to be updated, we must ensure that all tuples in
the relation Products, corresponding to the same category, are also updated,
otherwise there will be inconsistencies. To solve this problem, the attributes
describing a category must be removed from the table and a table Category
like the one in Fig. 2.4 must be used to store the data about categories.
Finally, let us analyze the relation EmployeeTerritories in Fig. 2.7c, where an
additional attribute KindOfWork has been added with respect to the relation
with the same name in Fig. 2.4. Assume that an employee can do many kinds
of work, independently of the territories in which she carries out her work.
Thus, the information about the kind of work of an employee will be repeated
as many times as the number of territories she is assigned to. To solve this
problem, the attribute KindOfWork must be removed from the table, and a
table EmpWork relating employees to the kinds of work they perform must
be added.
Dependencies and normal forms are used to describe the redundancies
above. A functional dependency is a constraint between two sets of at-
tributes in a relation. Given a relation R and two sets of attributes X and Y
in R, a functional dependency X → Y holds if and only if, in all the tuples
of the relation, each value of X is associated with at most one value of Y .
In this case it is said that X determines Y . Note that a key is a particular
case of a functional dependency, where the set of attributes composing the
key functionally determines all of the attributes in the relation.
The redundancies in Fig. 2.7a,b can be expressed by means of functional
dependencies. For example, in the relation OrderDetails in Fig. 2.7a, there
is the functional dependency ProductID → Discount. Also, in the relation
Products in Fig. 2.7b, we have the functional dependencies ProductID →
CategoryName and CategoryName → Description.
The redundancy in the relation EmployeeTerritories in Fig. 2.7c is captured
by another kind of dependency. Given two sets of attributes X and Y in a rela-
tion R, a multivalued dependency X →→ Y holds if the value of X deter-
mines a set of values for Y , which is independent of R\XY , where ‘\’ indicates
the set difference. In this case we say that X multidetermines Y . In the rela-
tion in Fig. 2.7c, the multivalued dependency EmployeeID →→ KindOfWork
26 2 Database Concepts
Relational Algebra
The relational algebra is a collection of operations for manipulating relations.
These operations can be of two kinds: unary, which receive as argument a
relation and return another relation, or binary, which receive as argument
two relations and return a relation. As the operations always return relations,
the algebra is closed, and operations can be combined in order to compute the
answer to a query. Another classification of the operations is as follows. Ba-
sic operations cannot be derived from any combination of other operations,
while derived operations are a shorthand for a sequence of basic operations,
defined for making queries easier to express. In what follows, we describe the
relational algebra operations. We start first with the unary operations.
The projection operation, denoted by πC1 ,...,Cn (R), returns the columns
C1 , . . . , Cn from the relation R. Thus, it can be seen as a vertical partition of
R into two relations: one containing the columns mentioned in the expression,
and the other containing the remaining columns.
For the database given in Fig. 2.4, an example of a projection is:
πFirstName,LastName,HireDate (Employees).
This operation returns the three specified attributes from the Employees table.
2.4 Logical Database Design 27
The selection operation, denoted by σϕ (R), returns the tuples from the
relation R that satisfy the Boolean condition ϕ. In other words, it partitions a
table horizontally into two sets of tuples: the ones that do satisfy the condition
and the ones that do not. Therefore, the structure of R is kept in the result.
A selection operation over the database given in Fig. 2.4 is:
σHireDate≥‘01/01/2012’∧HireDate≤‘31/12/2014’ (Employees).
This operation returns the employees hired between 2012 and 2014.
Since the result of a relational algebra operation is a relation, it can be used
as input for another operation. To make queries easier to read, sometimes it
is useful to use temporary relations to store intermediate results. We will use
the notation T ← Q to indicate that relation T stores the result of query Q.
Thus, combining the two previous examples, we can ask for the first name,
last name, and hire date of all employees hired between 2012 and 2014. The
query reads:
Temp1 ← σHireDate≥‘01/01/2012’∧HireDate≤‘31/12/2014’ (Employees)
Result ← πFirstName,LastName,HireDate (Temp1).
The rename operation, denoted by ρA1 →B1 ,...,Ak →Bk (R), returns a rela-
tion where the attributes A1 , . . . , Ak in R are renamed to B1 , . . . , Bk , respec-
tively. Therefore, the resulting relation has the same tuples as the relation
R, although the schema of both relations is different.
We present next the binary operations, which are based on classic oper-
ations of the set theory. Some of these operations require that the relations
be union compatible. Two relations R1 (A1 , . . . , An ) and R2 (B1 , . . . , Bn )
are said to be union compatible if they have the same degree n and for all
i = 1, . . . , n, the domains of Ai and Bi are equal.
The three operations union, intersection, and difference on two union-
compatible relations R1 and R2 are defined as follows:
• The union operation, denoted by R1 ∪ R2 , returns the tuples that are in
R1 , in R2 , or in both, removing duplicates.
• The intersection operation, denoted by R1 ∩ R2 , returns the tuples that
are in both R1 and in R2 .
• The difference operation, denoted by R1 \ R2 , returns the tuples that
are in R1 but not in R2 .
If the relations are union compatible, but the attribute names differ, by con-
vention the attributes names of the first relation are kept in the result.
The union can be used to express queries like “Identifier of employees from
the UK or who are reported by an employee from the UK,” which reads:
UKEmps ← σCountry=‘UK’ (Employees)
Result1 ← πEmployeeID (UKEmp)
Result2 ← ρ ReportsTo→EmployeeID (πReportsTo (UKEmps))
Result ← Result1 ∪ Result2.
28 2 Database Concepts
Relation UKEmps contains the employees from the UK. Result1 contains the
projection of the former over EmployeeID, and Result2 contains the Employ-
eeID of the employees reported by an employee from the UK. The union of
Result1 and Result2 yields the desired result.
The intersection can be used to express queries like “Identifier of employ-
ees from the UK who are reported by an employee from the UK,” which is
obtained by the replacing the last expression above by the following one:
Result ← Result1 ∩ Result2.
Finally, the difference can be used to express queries like “Identifier of em-
ployees from the UK who are not reported by an employee from the UK,”
which is obtained by the replacing the expression above by the following one:
Result ← Result1 \ Result2.
The Cartesian product combines each product with all the suppliers. We are
only interested in the rows that relate a product to its supplier. For this, we
filter the meaningless tuples, select the ones corresponding to suppliers from
Brazil, and project the column we want, that is, ProductName:
Temp4 ← σSupplierID=SupID (Temp3)
Result ← πProductName (σCountry=‘Brazil’ (Temp4)).
Using the join operation, the query “Name of the products supplied by
suppliers from Brazil” will read:
Temp1← ρSupplierID→SupID (Suppliers)
Result ← πProductName (σCountry=‘Brazil’ (Product n
oSupplierID=SupID Temp1)).
Note that the join combines the Cartesian product in Temp3 and the selection
in Temp4 in a single operation, making the expression much more concise.
There are a number of variants of the join operation. An equijoin is a join
R1 onϕ R2 such that condition ϕ states only equality comparisons. A natural
join, denoted by R1 ∗R2 , is a type of equijoin that states the equality between
all the attributes with the same name in R1 and R2 . The resulting table is
defined over the schema R1 ∪R2 (i.e., all the attributes in R1 and R2 , without
duplicates). In the case that there are no attributes with the same names, a
cross join is performed.
For example, the query “List all product names and category names” reads:
Temp ← Products ∗ Categories
Result ← πProductName,CategoryName (Temp).
The first query performs the natural join between relations Products and
Categories. The attributes in Temp are all the attributes in Product, plus all
the attributes in Categories, except for CategoryID, which is in both relations,
so only one of them is kept. The second query performs the final projection.
The joins introduced above are known as inner joins, since tuples that do
no match the join condition are eliminated. In many practical cases we need
to keep in the result all the tuples of one or both relations, independently
of whether or not they verify the join condition. For these cases, a set of
operations, called outer joins, were defined. There are three kinds of outer
joins: left outer join, right outer join, and full outer join.
The left outer join, denoted by R1 o nϕ R2 , performs the join as defined
above, but instead of keeping only the matching tuples, it keeps every tuple
in R1 (the relation of the left of the operation). If a tuple in R1 does not
satisfy the join condition, the tuple is kept, and the attributes of R2 in the
result are filled with null values.
As an example, the query “Last name of employees, together with the last
name of their supervisor, or null if the employee has no supervisor,” reads in
relational algebra:
Supervisors ← ρEmployeeID→SupID,LastName→SupLastName (Employees)
Result ← πEmployeeID,LastName,SupID,SupLastName (
Employees n oReportsTo=SupID Supervisors).
The resulting table has tuples such as (2, Fuller, NULL, NULL), which corre-
spond to employees who do not report to any other employee.
The right outer join, denoted by R1 o n ϕ R2 , is analogous to the left
outer join, except that the tuples that are kept are the ones in R2 . The full
outer join, denoted by R1 o n ϕ R2 , keeps all the tuples in both R1 and R2 .
30 2 Database Concepts
With respect to the left outer join shown above, the resulting table has in
addition tuples such as (NULL, NULL, 1, Davolio), which correspond to em-
ployees who do not supervise any other employee.
The division, denoted by R1 ÷ R2 , is used to express queries involv-
ing universal quantification, that is, those that are typically state using the
word all. It returns tuples from R1 that have combinations with all tuples
from R2 . More precisely, given two tables R1 (A1 , . . . , Am , B1 , . . . , Bn ) and
R2 (B1 , . . . , Bn ), R1 ÷ R2 returns the tuples (a1 , . . . , am ) such that R1 con-
tains a tuple (a1 , . . . , am , b1 , . . . , bn ) for every tuple (b1 , . . . , bn ) in R2 .
For example, the query “Suppliers that have products in all categories” is
written as follows.
SuppCat ← πCompanyName, CategoryID (Suppliers ∗ Products ∗ Categories)
Result ← SuppCat ÷ πCategoryID (Categories)
Table SuppCat finds the couples of supplier and category such that the sup-
plier has at least one product from the category. The result is obtained by
dividing this table by a table containing the identifiers of the categories.
SQL
SQL (structured query language) is the most common language for creating,
manipulating, and retrieving data from relational DBMSs. SQL is composed
of several sublanguages. The data definition language (DDL) is used to
define the schema of a database. The data manipulation language (DML)
is used to query a database and to modify its content (i.e., to add, update,
and delete data in a database). In what follows, we present a summary of the
main features of SQL that we will use in this book. For a detailed description,
we refer the reader to the references provided at the end of this chapter.
Below we show the SQL DDL command for defining table Orders in the
schema of Fig. 2.4 (only some of attributes are shown). The basic DDL state-
ment is CREATE TABLE, which creates a relation and defines the data types
of the attributes, the primary and foreign keys, and the constraints.
CREATE TABLE Orders (
OrderID INTEGER PRIMARY KEY,
CustomerID INTEGER NOT NULL,
EmployeeID INTEGER NOT NULL,
OrderDate DATE NOT NULL,
...
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID),
FOREIGN KEY (ShippedVia) REFERENCES Shippers(ShipperID),
FOREIGN KEY (EmployeeID) REFERENCES Employees(EmployeeID),
CHECK (OrderDate <= RequiredDate) )
2.4 Logical Database Design 31
SQL provides a DROP TABLE statement for deleting a table, and an AL-
TER TABLE statement for modifying the structure of a table.
The DML part of SQL is used to insert, update, and delete tuples from
the database tables. For example, the following INSERT statement
INSERT INTO Shippers(CompanyName, Phone)
VALUES ('Federal Express', '02 752 75 75')
adds a new shipper in the Northwind database. This tuple is modified by the
following UPDATE statement:
UPDATE Shippers
SET CompanyName = 'Fedex'
WHERE CompanyName = 'Federal Express'
SQL also provides statements for retrieving data from the database. The
basic structure of an SQL expression is:
SELECT h list of attributes i
FROM h list of tables i
WHERE h condition i
where h list of attributes i indicates the attribute names whose values are to be
retrieved by the query, h list of tables i is a list of the relation names that will
be included in the query, and h condition i is a Boolean expression that must
be satisfied by the tuples in the result. The semantics of an SQL expression
SELECT R.A, S.B
FROM R, S
WHERE R.B = S.A
that is, the SELECT clause is analogous to a projection π, the WHERE clause
is a selection σ, and the FROM clause indicates the Cartesian product ×
between all the tables included in the clause.
It is worth noting that an SQL query, opposite to a relational algebra one,
returns a set with duplicates (or a bag). Therefore, the keyword DISTINCT
must be used to remove duplicates in the result. For example, the query
“Countries of customers” must be written:
SELECT DISTINCT Country
FROM Customers
32 2 Database Concepts
This query returns the set of countries of the Northwind customers, without
duplicates. If the DISTINCT keyword is removed from the above query, then
it would return as many results as the number of customers in the database.
As another example, the query “Identifier, first name, and last name of the
employees hired between 2012 and 2014,” which we presented when discussing
the projection and selection operations, reads in SQL:
SELECT EmployeeID, FirstName, LastName
FROM Employees
WHERE HireDate >= '2012-01-01' and HireDate <= '2014-12-31'
The UNION in the above query removes duplicates in the result, whereas the
UNION ALL keeps them, that is, if an employee from the UK is reported by
at least one employee from the UK, it will appear twice in the result.
The join operation can be implemented as a projection of a selection over
the Cartesian product of the relations involved. However, in general, it is
easier and more efficient to use the join operation. For example, the query
“Name of the products supplied by suppliers from Brazil” can be written as:
SELECT ProductName
FROM Products P, Suppliers S
WHERE P.SupplierID = S.SupplierID AND Country = 'Brazil'
The outer join operation must be explicitly stated in the FROM clause. For
example, the query “First name and last name of employees, together with
the first name and last name of their supervisor, or null if the employee has
no supervisor” can be implemented using the LEFT OUTER JOIN operation.
SELECT E.FirstName, E.LastName, S.FirstName, S.LastName
FROM Employees E LEFT OUTER JOIN Employees S
ON E.ReportsTo = S.EmployeeID
Analogously, we can use the FULL OUTER JOIN operation to also include
in the answer the employees who do not supervise anybody.
2.4 Logical Database Design 33
When there is a GROUP BY clause, the SELECT clause must contain only
aggregates or grouping attributes. The HAVING clause is analogous to the
WHERE clause, except that the condition is applied over each group rather
than over each tuple. Finally, the result can be sorted with the ORDER BY
clause, where every attribute in the list can be ordered either in ascending
or descending order by specifying ASC or DESC, respectively.
We next present some examples of aggregate SQL queries. We start with
the query “Total number of orders handled by each employee, in descending
order of number of orders. Only list employees with more than 100 orders.”
SELECT EmployeeID, COUNT(*) AS OrdersByEmployee
FROM Orders
GROUP BY EmployeeID
HAVING COUNT(*) > 100
ORDER BY COUNT(*) DESC
Consider now the query “For customers from Germany, list the total quan-
tity of each product ordered. Order the result by customer ID, in ascending
order, and by quantity of product ordered, in descending order.”
34 2 Database Concepts
The inner query computes the products ordered by customers from Germany.
This returns a bag of product identifiers. The outer query scans the Products
table, and for each tuple, it compares the product identifier with the set
of identifiers returned by the inner query. If the product is in the set, the
product identifier and the product name are listed.
The query above can be formulated using the EXISTS predicate, yielding
what are referred to as correlated nested queries, as follows:
SELECT ProductID, ProductName
FROM Products P
WHERE EXISTS (
SELECT *
FROM Orders O JOIN Customers C ON
O.CustomerID = C.CustomerID JOIN
OrderDetails D ON O.OrderID = D.OrderID
WHERE C.Country = 'Germany' AND D.ProductID = P.ProductID )
In the outer query we define an alias (or variable) P. For each tuple in Prod-
ucts, the variable P in the inner query is instantiated with the values in such
tuple; if the result set of the inner query instantiated in this way is not empty,
the EXISTS predicate evaluates to true, and the values of ProductID and Pro-
ductName are listed. The process is repeated for all tuples in Products.
2.4 Logical Database Design 35
Here, the NOT EXISTS predicate will evaluate to true if, when P is instanti-
ated in the inner query, the query returns the empty set.
The division operator we introduced in the previous section is not explictly
implemented in SQL. It must be expressed by nesting two NOT EXISTS
predicates. For example, the query “Names of suppliers who have products
in all categories” can be written as follows:
SELECT S.CompanyName
FROM Suppliers S
WHERE NOT EXISTS (
SELECT *
FROM Categories C
WHERE NOT EXISTS (
SELECT *
FROM Products P
WHERE P.CategoryID = C.CategoryID AND
S.SupplierID = P.SupplierID ) )
This query uses double negatives to express the query condition. Indeed, it
can be read as follows: “Suppliers for which there is no category such that no
product of the category is supplied by the supplier.”
In SQL, a view is just a query that is stored in the database with an
associated name. Thus, views are like virtual tables. A view can be created
from one or many tables or other views.
Views can be used for various purposes. They are used to structure data in
a way that users find it natural or intuitive. They can also be used to restrict
access to data such that users can have access only to the data they need.
Finally, views can also be used to summarize data from various tables, which
can be used, for example, to generate reports.
Views are created with the CREATE VIEW statement. To create a view, a
user must have appropriate system privileges to modify the database schema.
Once a view is created, it can then be used in a query as any other table.
For example, the following statement creates a view CustomerOrders that
computes for each customer and order the total amount of the order.
CREATE VIEW CustomerOrders AS
SELECT O.CustomerID, O.OrderID,
SUM(D.Quantity * D.UnitPrice) AS Amount
FROM Orders O, OrderDetails D
WHERE O.OrderID = D.OrderID
GROUP BY O.CustomerID, O.OrderID
36 2 Database Concepts
This view is used in the next query to compute for each customer the maxi-
mum amount among all her orders.
SELECT CustomerID, MAX(Amount) AS MaxAmount
FROM CustomerOrders
GROUP BY CustomerID
As we will see in Chap. 8, views can be materialized, that is, they can be
physically stored in a database.
A common table expression (CTE) is a temporary table defined within
an SQL statement. Such temporary tables can be seen as views within the
scope of the statement. A CTE is typically used when a user does not have
the necessary privileges for creating a view.
For example, the following query combines in a single statement the view
definition and the subsequent query given in the previous section.
WITH CustomerOrders AS (
SELECT O.CustomerID, O.OrderID,
SUM(D.Quantity * D.UnitPrice) AS Amount
FROM Orders O, OrderDetails D
WHERE O.OrderID = D.OrderID
GROUP BY O.CustomerID, O.OrderID )
SELECT CustomerID, MAX(Amount) AS MaxAmount
FROM CustomerOrders
GROUP BY CustomerID
Note that several temporary tables can be defined in the WITH clause.
the same page. On the other hand, for optimal disk efficiency, the database
block size must be equal to or be a multiple of the disk block size.
DBMSs reserve a storage area in the main memory that holds several
database pages, which can be accessed for answering a query without reading
those pages from the disk. This area is called a buffer. When a request is
issued to the database, the query processor checks if the required data records
are included in the pages already loaded in the buffer. If so, data are read
from the buffer and/or modified. In the latter case, the modified pages are
marked as such and eventually written back to the disk. If the pages needed to
answer the query are not in the buffer, they are read from the disk, probably
replacing existing ones in the buffer (if it is full, which is normally the case)
using well-known algorithms, for example, replacing the least recently used
pages with the new ones. In this way, the buffer acts as a cache that the
DBMS can access to avoid going to disk, enhancing query performance.
File organization is the physical arrangement of data in a file into records
and blocks on secondary storage. There are three main types of file organiza-
tion. In a heap (or unordered) file organization, records are placed in the
file in the order in which they are inserted. This makes insertion very effi-
cient. However, retrieval is relatively slow, since the various pages of the file
must be read in sequence until the required record is found. Sequential (or
ordered) files have their records sorted on the values of one or more fields,
called ordering fields. Ordered files allow fast retrieving of records, provided
that the search condition is based on the sorting attribute. However, inserting
and deleting records in a sequential file are problematic, since the order must
be maintained. Finally, hash files use a hash function that calculates the
address of the block (or bucket) in which a record is to be stored, based on
the value of one or more attributes. Within a bucket, records are placed in
order of arrival. A collision occurs when a bucket is filled to its capacity and
a new record must be inserted into that bucket. Hashing provides the fastest
possible access for retrieving an arbitrary record given the value of its hash
field. However, collision management degrades the overall performance.
Independently of the particular file organization, additional access struc-
tures called indexes are used to speed up the retrieval of records in response
to search conditions. Indexes provide efficient ways to access the records based
on the indexing fields that are used to construct the index. Any field(s) of
the file can be used to create an index, and multiple indexes on different fields
can be constructed in the same file.
There are many different types of indexes. We describe below some cate-
gories of indexes according to various criteria.
• One categorization of indexes distinguishes between clustered and non-
clustered indexes, also called primary and secondary indexes. In
a clustered index, the records in the data file are physically ordered ac-
cording to the field(s) on which the index is defined. This is not the case
for a nonclustered index. A file can have at most one clustered index and
in addition can have several nonclustered indexes.
2.6 Summary 39
2.6 Summary
This chapter introduced the background database concepts that will be used
throughout the book. We started by describing database systems and the
usual steps followed for designing them, that is, requirements specification,
conceptual design, logical design, and physical design. Then, we presented the
Northwind case study, which was used to illustrate the different concepts in-
troduced throughout the chapter. We presented the entity-relationship model,
a well-known conceptual model. With respect to logical models, we studied
the relational model and also gave the mapping rules that are used to trans-
40 2 Database Concepts
For a general overview of all the concepts covered in this chapter, we refer the
reader to the textbooks [70, 79]. An overall view of requirements engineering
is given in [59]. Conceptual database design is covered in [171] although it is
based on UML [37] instead of the entity-relationship model. Logical database
design is covered in [230]. A thorough overview of the components of the
SQL:1999 standard is given in [151, 153], and later versions of the standard
are described in [133, 152, 157, 272]. Physical database design is detailed
in [140].
2.11 Illustrate with examples the different types of redundancy that may
occur in a relation. How can redundancy in a relation induce problems
in the presence of insertions, updates, and deletions?
2.12 What is the purpose of functional and multivalued dependencies?
What is the difference between them?
2.13 Describe the different operations of the relational algebra. Elaborate
on the difference between the several types of joins. How can a join
be expressed in terms of other operations of the relational algebra?
2.14 What is SQL? What are the sublanguages of SQL?
2.15 What is the general structure of SQL queries? How can the semantics
of an SQL query be expressed with the relational algebra?
2.16 Discuss the differences between the relational algebra and SQL. Why
is relational algebra an operational language, whereas SQL is a declar-
ative language?
2.17 Explain what duplicates are in SQL and how they are handled.
2.18 Describe the general structure of SQL queries with aggregation and
sorting. State the basic aggregation operations provided by SQL.
2.19 What are subqueries in SQL? Give an example of a correlated sub-
query.
2.20 What are common table expressions in SQL? What are they needed
for?
2.21 What is the objective of physical database design? Explain some fac-
tors that can be used to measure the performance of database appli-
cations and the trade-offs that have to be resolved.
2.22 Explain different types of file organization. Discuss their respective
advantages and disadvantages.
2.23 What is an index? Why are indexes needed? Explain the various types
of indexes.
2.24 What is clustering? What is it used for?
2.9 Exercises
Exercise 2.1. A French horse race fan wants to set up a database to analyze
the performance of the horses as well as the betting payoffs.
A racetrack is described by a name (e.g., Hippodrome de Chantilly), a
location (e.g., Chantilly, Oise, France), an owner, a manager, a date opened,
and a description. A racetrack hosts a series of horse races.
A horse race has a name (e.g., Prix Jean Prat), a category (i.e., Group 1,
2, or 3), a race type (e.g., thoroughbred flat racing), a distance (in meters), a
track type (e.g., turf right-handed), qualification conditions (e.g., 3-year-old
excluding geldings), and the first year it took place.
A meeting is held on a certain date and a racetrack and is composed of
one or several races. For a meeting, the following information is kept: weather
42 2 Database Concepts
(e.g., sunny, stormy), temperature, wind speed (in km per hour), and wind
direction (N, S, E, W, NE, etc.).
Each race of a meeting is given a number and a departure time and has
a number of horses participating in it. The application must keep track of
the purse distribution, that is, how the amount of prize money is distributed
among the top places (e.g., first place: e228,000, second place: e88,000, etc.),
and the time of the fastest horse.
Each race at a date offers several bet types (e.g., tiercé, quarté+) each
type offering zero or more bet options (e.g., in order, in any order, and bonus
for the quarté+). The payoffs are given for a bet type and a base amount
(e.g., quarté+ for e2) and specify for each option the win amount and the
number of winners.
A horse has a name, a breed (e.g., thoroughbred), a sex, a foaling date
(i.e., birth date), a gelding date (i.e., castration date for male horses, if any),
a death date (if any), a sire (i.e., father), a dam (i.e., mother), a coat color
(e.g., bay, chestnut), an owner, a breeder, and a trainer.
A horse that participates in a race with a jockey is assigned a number and
carries a weight according to the conditions attached to the race or to equalize
the difference in ability between the runners. Finally, the arrival place and
the margin of victory of the horses are kept by the application.
a. Design an ER schema for this application. If you need additional infor-
mation, you may look at the various existing French horse racing web
sites.
b. Translate the ER schema above into the relational model. Indicate the
keys of each relation, the referential integrity constraints, and the non-
null constraints.
Exercise 2.2. A Formula One fan club wants to set up a database to keep
track of the results of all the seasons since the first Formula One World
championship in 1950.
A season is held on a year, between a starting and an ending date, has a
number of races, and is described by a summary and a set of regulations. A
race has a number (stating its order in a season), an official name (e.g., 2013
Formula One Shell Belgian Grand Prix), a race date, a race time (expressed
in both local and UTC time), a description of the weather when the race
took place, the pole position (consisting of driver name and time realized),
and the fastest lap (consisting of driver name, time, and lap number).
Each race of a season belongs to a Grand Prix (e.g., Belgian Grand Prix),
for which the following information is kept: active years (e.g., 1950–1956,
1958, etc. for the Belgian Grand Prix), total number of races (58 races as
of 2013 for the Belgian Grand Prix), and a short historical description. The
race of a season is held on a circuit, described by its name (e.g., Circuit
de Spa-Francorchamps), location (e.g., Spa, Belgium), type (such as race,
road, street), number of laps, circuit length, race distance (the latter two
expressed in kilometers), and lap record (consisting of time, driver, and year).
2.9 Exercises 43
Notice that the course of the circuits may be modified over the years. For
example, the Spa-Francorchamps circuit was shortened from 14 to 7 km in
1979. Further, a Grand Prix may use several circuits over the years, as we
the case for the Belgian Grand Prix.
A team has a name (e.g., Scuderia Ferrari), one or two bases (e.g.,
Maranello, Italy), and one or two current principals (e.g., Stefano Domeni-
cali). In addition, a team keeps track of its debut (the first Grand Prix en-
tered), the number of races competed, the number of world championships
won by constructor and by driver, the highest race finish (consisting of place
and number of times), the number of race victories, the number of pole po-
sitions, and the number of fastest laps. A team competing in a season has a
full name, which typically includes its current sponsor (e.g., Scuderia Ferrari
Marlboro from 1997 to 2011), a chassis (e.g., F138), an engine (e.g., Ferrari
056), and a tyre brand (e.g., Bridgestone).
For each driver, the following information is kept: name, nationality, birth
date and birth place, number of races entered, number championships won,
number of wins, number of podiums, total points in the career, number of pole
positions, number of fastest laps, highest race finish (consisting of place and
number of times), and highest grid position (consisting of place and number
of times). Drivers are hired by teams competing in a season as either main
drivers or test drivers. Each team has two main drivers and usually two test
drivers, but the number of test drivers may vary from none to six. In addition,
although a main driver is usually associated to a team for the whole season, it
may only participate in some of the races of the season. A team participating
in a season is assigned two consecutive numbers for its main drivers, where
the number 1 is assigned to the team that won the constructor’s world title
the previous season. Further, the number 13 is usually not given to a car, it
only appeared once in the Mexican Grand Prix in 1963.
A driver participating in a Grand Prix must participate in a qualifying
session, which determines the starting order for the race. The results kept
for a driver participating in the qualifying session are the position and the
time realized for the three parts (called Q1, Q2, and Q3). Finally, the results
kept for a driver participating in a race are the following: position (may be
optional), number of laps, time, the reason why the driver retired or was
disqualified (both may be optional) and the number of points (scored only
for the top eight finishers).
a. Design an ER schema for this application. Note any unspecified require-
ments and integrity constraints, and make appropriate assumptions to
make the specification complete. If you need additional information, you
may look at the various existing Formula One web sites.
b. Translate the ER schema above into the relational model. Indicate the
keys of the relations, and the referential integrity constraints, and the
non-null constraints.
44 2 Database Concepts
Exercise 2.3. Consider the following queries for the Northwind database.
Write in relational algebra queries (a)–(g) and in SQL all the queries.
a. Name, address, city, and region of employees.
b. Name of employees and name of customers located in Brussels related
through orders that are sent by Speedy Express.
c. Title and name of employees who have sold at least one of the products
“Gravad Lax” or “Mishi Kobe Niku.”
d. Name and title of employees as well as the name and title of the employee
to whom they report.
e. Name of products that were sold by employees or purchased by customers
located in London.
f. Name of employees and name of the city where they live for employees
who have sold to customers located in the same city.
g. Names of products that have not been ordered.
h. Names of customers who bought all products.
i. Name of categories and the average price of products in each category.
j. Identifier and name of the companies that provide more than three prod-
ucts.
k. Identifier, name, and total sales of employees ordered by employee iden-
tifier.
l. Name of employees who sell the products of more than seven suppliers.
Chapter 3
Data Warehouse Concepts
This chapter introduces the basic concepts of data warehouses. A data ware-
house is a particular database targeted toward decision support. It takes data
from various operational databases and other data sources and transforms it
into new structures that fit better for the task of performing business anal-
ysis. Data warehouses are based on a multidimensional model, where data
are represented as hypercubes, with dimensions corresponding to the vari-
ous business perspectives and cube cells containing the measures to be an-
alyzed. In Sect. 3.1, we study the multidimensional model and present its
main characteristics and components. Section 3.2 gives a detailed description
of the most common operations for manipulating data cubes. In Sect. 3.3, we
present the main characteristics of data warehouse systems and compare them
against operational databases. The architecture of data warehouse systems
is described in detail in Sect. 3.4. As we shall see, in addition to the data
warehouse itself, data warehouse systems are composed of back-end tools,
which extract data from the various sources to populate the warehouse, and
front-end tools, which are used to extract the information from the ware-
house and present it to users. We finish in Sect. 3.5, describing SQL Server,
a representative business intelligence suite of tools.
The importance of data analysis has been steadily increasing from the early
1990s, as organizations in all sectors are being required to improve their
decision-making processes in order to maintain their competitive advantage.
Traditional database systems like the ones studied in Chap. 2 do not satisfy
the requirements of data analysis. They are designed and tuned to support
the daily operations of an organization, and their primary concern is to en-
sure fast, concurrent access to data. This requires transaction processing and
concurrency control capabilities, as well as recovery techniques that guaran-
Köln
) r
ity e
( C t om
Berlin
Lyon
s
Cu
Paris
Q1 measure
Time (Quarter)
values
Q2
dimensions
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Fig. 3.1 A three-dimensional cube for sales data with dimensions Product, Time, and
Customer, and a measure Quantity
3.1.1 Hierarchies
see the sales figures at a finer granularity, such as at the month level, or at
a coarser granularity, such as at the customer’s country level. Hierarchies
allow this possibility by defining a sequence of mappings relating lower-level,
detailed concepts to higher-level, more general concepts. Given two related
levels in a hierarchy, the lower level is called the child and the higher level is
called the parent. The hierarchical structure of a dimension is called the di-
mension schema, while a dimension instance comprises the members at all
levels in a dimension. Figure 3.2 shows the simplified hierarchies for our cube
example. In the next chapter, we give full details of how dimension hierarchies
are modeled. In the Product dimension, products are grouped in categories.
For the Time dimension, the lowest granularity is Day, which aggregates into
Month, which in turn aggregates into Quarter, Semester, and Year. Similarly,
for the Customer dimension, the lowest granularity is Customer, which aggre-
gates into City, State, Country, and Continent. It is usual to represent the top
of the hierarchy with a distinguished level called All.
Quarter State
Month City
Day Customer
At the instance level, Fig. 3.3 shows an example of the Product dimen-
sion.1 Each product at the lowest level of the hierarchy can be mapped to
a corresponding category. All categories are grouped under a member called
all, which is the only member of the distinguished level All. This member is
used for obtaining the aggregation of measures for the whole hierarchy, that
is, for obtaining the total sales for all products.
In real-world applications, there exist many kinds of hierarchies. For ex-
ample, the hierarchy depicted in Fig. 3.3 is balanced, since there is the same
number of levels from each individual product to the root of the hierarchy. In
Chaps. 4 and 5, we shall study these and other kinds of hierarchies in detail,
covering both their conceptual representation and their implementation in
current data warehouse and OLAP systems.
1
Note that, as indicated by the ellipses, not all nodes of the hierarchy are shown.
3.1 Multidimensional Model 49
All all
3.1.2 Measures
For example, the measure Quantity in the cube of Fig. 3.1 is additive:
it can be summarized when the hierarchies in the Product, Time, and
Customer dimensions are traversed.
• Semiadditive measures can be meaningfully summarized using addi-
tion along some, but not all, dimensions. As a typical example, inventory
quantities cannot be meaningfully aggregated along the Time dimension,
for instance, adding the inventory quantities for two different quarters.
• Nonadditive measures cannot be meaningfully summarized using ad-
dition across any dimension. Typical examples are item price, cost per
unit, and exchange rate.
The aggregation functions to be used in the various dimensions must be
defined for each measure. This is particularly important in the case of semi-
additive and nonadditive measures. For example, a semiadditive measure
representing inventory quantities can be aggregated computing the average
along the Time dimension and the sum along other dimensions. Averaging
can also be used for aggregating nonadditive measures such as item price or
exchange rate. However, depending on the semantics of the application, other
functions such as the minimum, maximum, or count could be used instead.
Allowing users to interactively explore the data cube at different granular-
ities, optimization techniques based on aggregate precomputation are used.
Incremental aggregation mechanisms avoid computing the whole aggregation
from scratch each time the data warehouse is queried. However, this is not
always possible, since this depends on the aggregate function used. This leads
to another classification of measures, which we explain next.
• Distributive measures are defined by an aggregation function that can
be computed in a distributed way. Suppose that the data are partitioned
into n sets, and that the aggregate function is applied to each set, giving
n aggregated values. The function is distributive if the result of applying
it to the whole data set is the same as the result of applying a function
(not necessarily the same) to the n aggregated values. The usual aggre-
gation functions such as the count, sum, minimum, and maximum are
distributive. However, the distinct count function is not. For instance,
if we partition the set of measure values {3, 3, 4, 5, 8, 4, 7, 3, 8} into the
subsets {3, 3, 4}, {5, 8, 4}, and {7, 3, 8}, summing up the result of the dis-
tinct count function applied to each subset gives us a result of 8, while
the answer over the original set is 5.
• Algebraic measures are defined by an aggregation function that can be
expressed as a scalar function of distributive ones. A typical example of an
algebraic aggregation function is the average, which can be computed by
dividing the sum by the count, the latter two functions being distributive.
• Holistic measures are measures that cannot be computed from other
subaggregates. Typical examples include the median, the mode, and the
rank. Holistic measures are expensive to compute, especially when data
are modified, since they must be computed from scratch.
3.2 OLAP Operations 51
Roll-up
The roll-up operation aggregates measures along a dimension hierarchy to
obtain measures at a coarser granularity. The syntax for this operation is:
ROLLUP(CubeName, (Dimension → Level)*, AggFunction(Measure)*)
The result is shown in Fig. 3.4b. While the original cube contained four
values in the Customer dimension, one for each city, the new cube contains
two values, one for each country. The remaining dimensions are not affected.
Thus, the values in cells pertaining to Paris and Lyon in a given quarter and
category contribute to the aggregation of the corresponding values for France.
The computation of the cells pertaining to Germany proceeds analogously.
When querying a cube, a usual operation is to roll-up a few dimensions to
particular levels and to remove the other dimensions through a roll-up to the
All level. In a cube with n dimensions, this can be obtained by applying n
successive roll-up operations. The ROLLUP* operation provides a shorthand
notation for this sequence of operations. The syntax is as follows:
ROLLUP*(CubeName, [(Dimension → Level)*], AggFunction(Measure)*)
Köln
) r
ity e
n t r er
Berlin
(C tom
y)
ou m
Lyon Germany
( C u st o
s
Cu
Paris France
C
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Köln
) r
ity e
Berlin
(C tom
Lyon Köln
s
) r
Cu
ity e
Paris Berlin
(C tom
Lyon
s
Jan
Cu
Paris
Time (Month)
Feb Q1
Time (Quarter)
Mar Q2
... Q3
Dec Q4
Produce Seafood Condiments Seafood
Beverages Condiments Beverages Produce
Product (Category) Product (Category)
(c) (d)
or t
Seafood
y)
eg c
at du
Condiments
(C Pro
Produce Q1
Time (Quarter)
Beverages
Q2
Paris
Customer (City)
Q3
Lyon
Q4
Berlin
Produce Seafood
Köln
Beverages Condiments
Q1 Q2 Q3 Q4 Product (Category)
Time (Quarter)
(e) (f)
Fig. 3.4 OLAP operations. (a) Original cube; (b) Roll-up to the Country level; (c)
Drill-down to the Month level; (d) Sort product by name; (e) Pivot; (f ) Slice
on City=‘Paris’
3.2 OLAP Operations 53
Köln
) r
ity e
Berlin
(C tom
Lyon
s
Cu
) r Paris
(Quarter) City me
( to
Lyon Q1
s
Time (Quarter)
Cu
Paris
Q1 Q2
Time
Q2 Q3
Produce Seafood Q4
Beverages Condiments Produce Seafood
Product (Category) Beverages Condiments
Product (Category)
(g) (h)
Köln Köln
) r
) r
ity e
ity e
Berlin Berlin
(C tom
(C tom
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
(i) (j)
Köln
) r
ity e
Berlin
(C tom
Lyon
s
Cu
Paris Q1 84 89 106 84
Time (Quarter)
Q1 Q2 82 77 93 79
Time (Quarter)
Q2 Q3 105 72 65 88
Q3 Q4 112 61 96 102
Q4 Lyon Köln
Produce Seafood Paris Berlin
Beverages Condiments Customer (City)
Product (Category)
(k) (l)
Fig. 3.4 OLAP operations (continued). (g) Dice on City=‘Paris’ or ‘Lyon’ and Quar-
ter=‘Q1’ or ‘Q2’; (h) Dice on Quantity > 15; (i) Cube for 2011; (j) Drill-across;
(k) Percentage change; (l) Total sales by quarter and city
54 3 Data Warehouse Concepts
Köln Köln
) r
) r
ity e
ity e
Berlin
(C tom
Berlin
( C t om
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
(m) (n)
Köln Köln
) r
) r
ity e
ity e
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
(o) (p)
Köln
) r
ity e
Berlin
(C tom
Lyon
s
Cu
Paris
Q1
Time (Quarter)
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
(q)
Fig. 3.4 OLAP operations (continued). (m) Maximum sales by quarter and city; (n)
Top two sales by quarter and city; (o) Top 70% sales by city and category
ordered by ascending quarter; (p) Top 70% sales by city and category ordered
by descending quantity; (q) Rank quarter by category and city ordered by
descending quantity
3.2 OLAP Operations 55
Köln Köln
) r
) r
ity e
ity e
( C t om Berlin Berlin
( C t om
Lyon Lyon
s
s
Cu
Cu
Paris Paris
Jan Jan
Time (Month)
Time (Month)
Feb Feb
Mar Mar
... ...
Dec Dec
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
(r) (s)
Madrid
) r
ity e
Bilbao
(C tom
Köln Köln
) r
s
ity e
Berlin
Cu
Berlin
(C tom
Lyon Lyon
s
Cu
Paris Paris
Q1 Q1
Time (Quarter)
Time (Quarter)
Q2 Q2
Q3 Q3
Q4 Q4
Produce Seafood Produce Seafood
Beverages Condiments Beverages Condiments
Product (Category) Product (Category)
(t) (u)
Fig. 3.4 OLAP operations (continued). (r) Three-month moving average; (s) Year-to-
date sum; (t) Union of the original cube and another cube with data from
Spain; (u) Difference of the original cube and the cube in Fig. 3.4n
which performs a roll-up along the Time dimension to the Quarter level and
the other dimensions (in this case Customer and Product) to the All level. On
the other hand, if the dimensions are not specified as in
ROLLUP*(Sales, SUM(Quantity))
all the dimensions of the cube will be rolled-up to the All level, yielding a
single cell containing the overall sum of the Quantity measure.
56 3 Data Warehouse Concepts
In this case, a new measure ProdCount will be added to the cube. We will see
below other ways to add measures to a cube.
In many real-world situations hierarchies are recursive, that is, they con-
tain a level that rolls-up to itself. A typical example is a supervision hierarchy
over employees. Such hierarchies are discussed in detail in Chap. 4. The par-
ticularity of such hierarchies is that the number of levels of the hierarchy
is not fixed at the schema level, but it depends on its members. The RE-
CROLLUP operation is used to aggregate measures over recursive hierarchies
by iteratively performing roll-ups over the hierarchy until the top level is
reached. The syntax of this operation is as follows:
RECROLLUP(CubeName, Dimension → Level, Hierarchy, AggFunction(Measure)*)
Drill down
The drill-down operation performs the inverse of the roll-up operation, that
is, it goes from a more general level to a more detailed level in a hierarchy.
The syntax of this operation is as follows:
DRILLDOWN(CubeName, (Dimension → Level)*)
where Dimension → Level is the level in a dimension to which the drill down
is performed.
For example, in the cube shown in Fig. 3.4b, the sales of category Seafood
in France are significantly higher in the first quarter compared to the other
ones. Thus, we can take the original cube and apply a drill-down along the
Time dimension to the Month level to find out whether this high value oc-
curred during a particular month, as follows
DRILLDOWN(Sales, Time → Month)
As shown in Fig. 3.4c, we discover that, for some reason yet unknown, sales
in January soared both in Paris and in Lyon.
Sort
The sort operation returns a cube where the members of a dimension have
been sorted. The syntax of the operation is as follows:
SORT(CubeName, Dimension, (Expression [ {ASC | DESC | BASC | BDESC} ])*)
where the members of Dimension are sorted according to the value of Expres-
sion either in ascending or descending order. In the case of ASC or DESC,
3.2 OLAP Operations 57
members are sorted within their parent (i.e., respecting the hierarchies),
whereas, in the case of BASC or BDESC the sorting is performed across all
members (i.e., irrespective of the hierarchies). The ASC is the default option.
For example, the following expression
SORT(Sales, Product, ProductName)
sorts the members of the Product dimension on ascending order of their name,
as shown in Fig. 3.4d. Here, ProductName is an attribute of products. When
the cube contains only one dimension, the members can be sorted based on its
measures. For example, if SalesByQuarter is obtained from the original cube
by aggregating sales by quarter for all cities and all categories, the expression
SORT(SalesByQuarter, Time, Quantity DESC)
Pivot
The pivot (or rotate) operation rotates the axes of a cube to provide an
alternative presentation of the data. The syntax of the operation is as follows:
PIVOT(CubeName, (Dimension → Axis)*)
where the axes are specified as {X, Y, Z, X1, Y1, Z1, . . .}.
In our example, to see the cube with the Time dimension on the x axis,
we can rotate the axes of the original cube as follows
PIVOT(Sales, Time → X, Customer → Y, Product → Z)
Slice
The slice operation removes a dimension in a cube (i.e., a cube of n − 1
dimensions is obtained from a cube of n dimensions) by selecting one instance
in a dimension level. The syntax of this operation is:
SLICE(CubeName, Dimension, Level = Value)
where the Dimension will be dropped by fixing a single Value in the Level.
The other dimensions remain unchanged.
In our example, to visualize the data only for Paris, we apply a slice
operation as follows:
SLICE(Sales, Customer, City = 'Paris')
The result is the subcube of Fig. 3.4f, a two-dimensional matrix where each
column represents the evolution of the sales quantity by category and quar-
ter, that is, a collection of time series. The slice operation assumes that the
granularity of the cube is at the specified level of the dimension (in the ex-
ample above, at the city level). Thus, a granularity change by means of a
roll-up or drill-down operation is often needed prior to the slice operation.
58 3 Data Warehouse Concepts
Dice
The dice operation keeps the cells in a cube that satisfy a Boolean condition
ϕ. The syntax for this operation is
DICE(CubeName, Condition)
and the result is shown in Fig. 3.4g. As another example, we can select the
cells of the original cube that have a measure greater than 15 as
DICE(Sales, Quantity > 15)
Rename
The rename operation returns a cube where some schema elements or mem-
bers have been renamed. The syntax is:
RENAME(CubeName, ({SchemaElement | Member} → NewName)*)
renames the cube in Fig. 3.4a and its measure. As another example
RENAME(Sales, Customer.all → AllCustomers)
Drill across
The drill-across operation combines cells from two data cubes that have
the same schema and instances, using a join condition. The syntax of the
operation is:
DRILLACROSS(CubeName1, CubeName2, [Condition]).
3.2 OLAP Operations 59
In the first step, we create a temporary cube Sales1 by renaming the measure.
In the second step, we perform the drill across of the two cubes by combining a
cell in Sales1 with the cell in Sales corresponding to the subsequent month. As
already stated, the join condition above corresponds to an outer join. Notice
that the Sales cube in Fig. 3.4a contains measures for a single year. Thus, in
the result above the cells corresponding to January and December will contain
a null value in one of the two measures. As we will see in Sect. 4.4, when the
cube contains measures for several years, the join condition must take into
account that measures of January must be joined with those of December of
the preceding year. Notice also that the cube has three dimensions and the
join condition in the query above pertains to only one dimension. For the
other dimensions it is supposed that an outer equijoin is performed.
Add measure
The add measure operation adds new measures to the cube computed from
other measures or dimensions. The syntax for this operation is as follows:
ADDMEASURE(CubeName, (NewMeasure = Expression, [AggFunction])* )
Drop measure
The drop measure operation removes one or several measures from a cube.
The syntax is as follows:
DROPMEASURE(CubeName, Measure*)
For example, given the result of the add measure above, the cube illustrated
in Fig. 3.4k is expressed by:
DROPMEASURE(Sales2011-2012, Quantity2011, Quantity2012)
Aggregation
We have seen that the roll-up operation aggregates measures when displaying
the cube at coarser level. On the other hand, we also need to aggregate
measures of a cube at the current granularity, that is, without performing a
roll-up operation. The syntax for this operation is as follows:
AggFunction(CubeName, Measure) [BY Dimension*]
Usual aggregation functions are SUM, AVG, COUNT, MIN, and MAX. In ad-
dition to these, we use extended versions of MIN and MAX, which have an
additional argument that is used to obtain the n minimum or maximum
values. Further, TOPPERCENT and BOTTOMPERCENT select the members
of a dimension that cumulatively account for x percent of a measure. Also,
RANK and DENSERANK are used to rank the members of a dimension ac-
cording to a measure. We show next examples of these functions.
In our example, given the original cube in Fig. 3.4a, the total sales by
quarter and city can be computed as follows
SUM(Sales, Quantity) BY Time, Customer
This will yield the two-dimensional cube in Fig. 3.4l. On the other hand, to
obtain the total sales by quarter we can write
SUM(Sales, Quantity) BY Time
which returns a one-dimensional cube with values for each quarter. In the
query above, a roll-up along the Customer dimension up to level All is per-
formed before the aggregation. Finally, to obtain the overall sales we write
SUM(Sales, Quantity)
OLAP since filtering aggregation functions must not only compute the ag-
gregated value but also determine the associated dimension members. As an
example, to obtain the best-selling employee, we must compute the maximum
sales amount and identify who is the employee that performed best.
When applying an aggregation operation, the dimension members in the
resulting cube will depend on the aggregation function used. For example,
given the cube in Fig. 3.4a, the total overall quantity is obtained as
SUM(Sales, Quantity)
This yields a single cell whose coordinates for the three dimensions will be
all equal to all. Also, when computing the overall maximum quantity as:
MAX(Sales, Quantity)
we obtain the cell with value 47 and coordinates Q4, Condiments, and Paris
(we suppose that cells that are hidden in Fig. 3.4a contain a smaller value
for this measure). Similarly, the following expression
SUM(Sales, Quantity) BY Time, Customer
returns the total sales by quarter and customer, resulting in the cube given
in Fig. 3.4l. This cube has three dimensions, where the Product dimension
only contains the member all. On the other hand,
MAX(Sales, Quantity) BY Time, Customer
will yield the cube in Fig. 3.4m, where only the cells containing the maximum
by time and customer will have values, while the other ones will be filled with
null values. Similarly, the two maximum quantities by product and customer
as shown in Fig. 3.4n can be obtained as follows:
MAX(Sales, Quantity, 2) BY Time, Customer
Notice that in the example above, we requested the two maximum quantities
by time and customer. If in the cube there are two or more cells that tie for
the last place in the limited result set, then the number of cells in the result
could be greater than two. For example, this is the case in Fig. 3.4n for Berlin
and Q1, where there are three values in the result, that is, 33, 25, and 25.
To compute top or bottom percentages, the order of the cells must be
specified. For example, to compute the top 70% of the measure quantity by
city and category ordered by quarter, as shown in Fig. 3.4o, we can write
TOPPERCENT(Sales, Quantity, 70) BY City, Category ORDER BY Quarter ASC
The operation computes the running sum of the sales by city and category
starting with the first quarter and continues until the target percentage is
reached. In this example, the sales in the first three quarters cover the required
70%. Similarly, the top 70% of the measure quantity by city and category
ordered by quantity, shown in Fig. 3.4p, can be obtained by
62 3 Data Warehouse Concepts
The rank operation also requires the specification of the order of the cells.
As an example, to rank quarters by category and city order by descending
quantity, as shown in Fig. 3.4q, we can write
RANK(Sales, Time) BY Category, City ORDER BY Quantity DESC
The rank and the dense rank operations differ in the case of ties. The former
assigns the same rank to ties. For example, in Fig. 3.4q there is a tie in the
quarters for Seafood and Köln, where Q2 and Q4 are in the first rank and Q3
and Q1 are in the third and fourth ranks, respectively. Using dense rank, Q3
and Q1 would be in the second and third ranks, respectively.
We often need to compute measures where the value of a cell is obtained by
aggregating the measures of several nearby cells. Examples of these include
moving average and year-to-date computations. For this, we need to define a
subcube that is associated with each cell, and perform the aggregation over
this subcube. These functions correspond to the window functions in SQL
that will be described in Chap. 5. For example, given the cube in Fig. 3.4c,
the 3-month moving average in Fig. 3.4r can be obtained by
ADDMEASURE(Sales, MovAvg3M = AVG(Quantity) OVER Time 2 CELLS PRECEDING)
Here, the moving average for January is equal to the measure in January,
since there are no previous cells. Analogously, the measure for February is
the average of the values of January and February. Finally, the average for the
remaining months is computed from the measure value of the current month
and the two preceding ones. In the window functions it is supposed that the
members of the dimension over which the window is constructed are already
sorted. For this, a sort operation can be applied prior to the application of
the window aggregate function.
Similarly, to compute the year-to-date sum in Fig. 3.4s we can write
ADDMEASURE(Sales, YTDQuantity = SUM(Quantity) OVER Time
ALL CELLS PRECEDING)
Here, the aggregation function is applied to a window that contains the cur-
rent cell and all the previous ones, as indicated by ALL CELLS PRECEDING.
Union
The union operation merges two cubes that have the same schema but dis-
joint instances. The syntax of the operation is:
UNION(CubeName1, CubeName2).
If CubeSpain has the same schema as our original cube but containing only
the sales to Spanish customers, the cube in Fig. 3.4t is obtained by
UNION(Sales, SalesSpain)
3.3 Data Warehouses 63
Difference
Given two cubes with the same schema, the difference operation removes
the cells in a cube that exist in another one. The syntax of the operation is:
DIFFERENCE(CubeName1, CubeName2).
In our example, we can remove from the original cube in Fig. 3.4a the cells
of the top-two sales by quarter and city shown in Fig. 3.4n, which is called
TopTwoSales, as follows
DIFFERENCE(Sales, TopTwoSales)
Drill Through
Finally, the drill-through operation allows one to move from data at the
bottom level in a cube to data in the operational systems from which the cube
was derived. This could be used, for example, if one were trying to determine
the reason for outlier values in a data cube. Formally, the drill through is not
an OLAP operator since its result is not a multidimensional cube.
Operation Purpose
Add measure Adds new measures to a cube computed from other measures or
dimensions.
Aggregation op- Aggregate the cells of a cube, possibly after performing a grouping
erations of cells.
Dice Keeps the cells of a cube that satisfy a Boolean condition over di-
mension levels, attributes, and measures.
Difference Removes the cells of a cube that are in another cube. Both cubes
must have the same schema.
Drill-across Merges two cubes that have the same schema and instances using a
join condition.
Drill-down Disaggregates measures along a hierarchy to obtain data at a finer
granularity. It is the opposite of the roll-up operation.
Drop measure Removes measures from a cube.
Pivot Rotates the axes of a cube to provide an alternative presentation of
its data.
Recursive roll-up Performs an iteration of roll-ups over a recursive hierarchy until the
top level is reached.
Rename Renames one or several schema elements of a cube.
Roll-up Aggregates measures along a hierarchy to obtain data at a coarser
granularity. It is the opposite of the drill-down operation.
Roll-up* Shorthand notation for a sequence of roll-up operations.
Slice Removes a dimension from a cube by selecting one instance in a
dimension level.
Sort Orders the members of a dimension according to an expression.
Union Combines the cells of two cubes that have the same schema but
disjoint members.
sales order). On the other hand, data structures for OLAP must support
complex aggregation queries, thus requiring access to all the records in one
or more tables, resulting in long, complex SQL queries. Further, OLAP sys-
tems are less frequently accessed than OLTP systems (e.g., a system handling
purchase orders is frequently accessed, while a performing analysis of orders
may not be that frequent). Also, data warehouse records are usually accessed
in read mode (lines 5–8). OLTP systems usually have a short query response
time, provided the appropriate indexing structures are defined, while complex
OLAP queries can take longer time to complete (line 9).
OLTP systems have normally a high number of concurrent accesses and
require locking or other concurrency management mechanisms to ensure safe
transaction processing (lines 10–11). On the other hand, OLAP systems are
read only, and thus queries can be submitted and computed concurrently.
Also, the number of concurrent users in an OLAP system is usually low.
Finally, OLTP systems are modeled using UML or some variation of the
ER model studied in Chap. 2, since such models lead to a highly normalized
schema, adequate for databases that support frequent transactions, to guar-
antee consistency and reduce redundancy. OLAP designers use the multi-
dimensional model, which, at the logical level (as we will see in Chap. 5),
leads in general to a denormalized database schema, with a high level of
redundancy, which favors query processing (lines 12–14).
3.4 Data Warehouse Architecture 67
We now show the general data warehouse architecture that will be used
throughout the book. This architecture is composed of several tiers, depicted
in Fig. 3.5 and described next.
Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases
Statistical
tools
External
sources Data marts
Data mining
tools
The front-end tier in Fig. 3.5 is used for data analysis and visualization.
It contains client tools that allow users to exploit the contents of the data
warehouse. Typical client tools include the following:
• OLAP tools allow interactive exploration and manipulation of the ware-
house data. They facilitate the formulation of complex queries that may
involve large amounts of data. These queries are called ad hoc queries,
since the system has no prior knowledge about them.
• Reporting tools enable the production, delivery, and management of
reports, which can be paper-based, interactive, or web-based. Reports use
predefined queries, that is, queries asking for specific information in a
specific format that are performed on a regular basis. Modern reporting
techniques include key performance indicators and dashboards.
• Statistical tools are used to analyze and visualize the cube data using
statistical methods.
• Data mining tools allow users to analyze data in order to discover valu-
able knowledge such as patterns and trends; they also allow predictions
to be made on the basis of current data.
In Chap. 7, we show some of the tools used to exploit the data warehouse,
like data analysis tools, key performance indicators, and dashboards.
Some of the components illustrated in Fig. 3.5 can be missing in a real en-
vironment. In some situations there is only an enterprise data warehouse
without data marts or, alternatively, an enterprise data warehouse does not
exist. Building an enterprise data warehouse is a complex task that is very
costly in time and resources. In contrast, a data mart is typically easier to
build than an enterprise warehouse. However, when several data marts are
created independently, they need to be integrated into a data warehouse for
the entire enterprise, which is usually complicated.
In other situations, an OLAP server does not exist and/or the client tools
directly access the data warehouse. This is indicated by the arrow connecting
the data warehouse tier to the front-end tier. This situation is illustrated in
Chap. 7, where the same queries for the Northwind case study are expressed
both in MDX and DAX (targeting the OLAP server) and in SQL. In an
extreme situation, there is neither a data warehouse nor an OLAP server.
This is called a virtual data warehouse, which defines a set of views over
operational databases that are materialized for efficient access. The arrow
connecting the data sources to the front-end tier depicts this situation. A
virtual data warehouse, although easy to build, does not contain historical
3.5 Overview of Microsoft SQL Server BI Tools 71
data, does not contain centralized metadata, and does not have the ability
to clean and transform the data. Furthermore, a virtual data warehouse can
severely impact the performance of operational databases.
Finally, a data staging area may not be needed when the data in the source
systems conforms very closely to the data in the warehouse. This situation
arises when there are few data sources having high-quality data, which is
rarely the case in real-world situations.
Nowadays, there is a wide offer in business intelligence tools. The major data-
base providers, such as Microsoft, Oracle, IBM, and Teradata, have their own
suite of such tools. Other popular tools include SAP, MicroStrategy, Qlik, and
Tableau. In addition to the above commercial tools, there are also open-source
tools, of which Pentaho is the most popular one. In this book, we have chosen
a representative suite of tools for illustrating the topics presented: Microsoft’s
SQL Server tools. We briefly describe next these tools, and provide references
to other well-known business intelligence tools in the bibliographic notes.
Microsoft SQL Server provides an integrated platform for building analyt-
ical applications. It is composed of three main components, described below.
• Analysis Services is used to define, query, update, and manage analyti-
cal databases. It comes in two modes: multidimensional and tabular. The
difference between them stems from their underlying paradigm (multidi-
mensional or relational). Each mode has an associated query language,
MDX and DAX, respectively. In this book, we cover both modes and
its associated languages MDX and DAX in Chaps. 5, 6, and 7 when we
define and query the analytical database for the Northwind case study.
• Integration Services supports ETL processes previously introduced. It
is used to extract data from a variety of data sources, to combine, clean,
and summarize this data, and, finally, to populate a data warehouse with
the resulting data. We cover Integration Services when we describe the
ETL process for the Northwind case study in Chap. 9.
• Reporting Services is used to define, generate, store, and manage re-
ports. Reports can be built from various types of data sources, including
data warehouses and OLAP cubes, and can be personalized and delivered
in a variety of formats. Users can view reports with a variety of clients,
such as web browsers or mobile applications. Clients access reports via
Reporting Services’ server component. We will explain Reporting Services
when we build dashboards for the Northwind case study in Chap. 7.
Several tools can be used for developing and managing these components.
Visual Studio is a development platform that supports Analysis Services,
72 3 Data Warehouse Concepts
3.6 Summary
Basic data warehouse concepts can be found in the classic books by Kim-
ball [129] and by Inmon [117]. In particular, the definition of data warehouses
we gave in Sect. 3.3 is from Inmon. The notion of hypercube underlying the
multidimensional model was studied in [94], where the roll-up and cube op-
erations were defined for SQL. Hierarchies in OLAP are studied in [144].
The notion of summarizability of measures was defined in [138] and has been
studied, for example, in [109, 110, 166]. Other classification of measures are
given in [94, 129]. More details on these concepts are given in Chap. 5, where
we also give further references.
There is not yet a standard definition of the OLAP operations, in a similar
way as the relational algebra operations are defined for the relational alge-
bra. Many different algebras for OLAP have been proposed in the literature,
each one defining different sets of operations. A comparison of these OLAP
algebras is given in [202], where the authors advocate the need for a refer-
ence algebra for OLAP. The definition of the operations we presented in this
chapter was inspired from [50].
3.9 Exercises 73
For SQL Server, the books devoted to Analysis Services [108], Integration
Services [52], and Reporting Services [135] cover extensively these compo-
nents. The tabular model in Microsoft Analysis Services is studied in [204],
while DAX is covered in [205].
3.9 Exercises
The advantages of using conceptual models for designing databases are well
known. Conceptual models facilitate communication between users and de-
signers since they do not require knowledge about the underlying implemen-
tation platform. Further, schemas developed using conceptual models can
be mapped to various logical models, such as relational, object-oriented, or
even graph models, thus simplifying responses to changes in the technology
used. Conceptual models also facilitate database maintenance and evolution,
since they focus on users’ requirements; as a consequence, they provide better
support for subsequent changes in the logical and physical schemas.
In this chapter, we study conceptual modeling for data warehouses. In
particular, we base our presentation in the MultiDim model, which can be
used to represent the data requirements of data warehouse and OLAP ap-
plications. The definition of the model is given in Sect. 4.1. Since hierarchies
are essential for exploiting data warehouse and OLAP systems to their full
capabilities, in Sect. 4.2 we consider various kinds of hierarchies that ex-
ist in real-world situations. We classify these hierarchies, giving a graphical
representation of them and emphasizing the differences between them. We
also present advanced aspects of conceptual modeling in Sect. 4.3. Finally, in
Sect. 4.4, we revisit the OLAP operations that we presented in Chap. 2 by
addressing a set of queries to the Northwind data warehouse.
Employee City
Territories
EmployeeID Supplier CityName
Geography
FirstName
LastName SupplierID
Title CompanyName
Address State
BirthDate
HireDate PostalCode StateName
Address EnglishStateName
City StateType
Region Customer
StateCode
Geography
Supervisor
Date RegionName
RegionCode
Date Sales
DayNbWeek Quantity
DayNameWeek OrderDate UnitPrice: Avg +!
DayNbMonth Country
DayNbYear DueDate Discount: Avg +!
SalesAmount CountryName
WeekNbYear ShippedDate Freight
CountryCode
Calendar /NetAmount CountryCapital
Population
Subdivision
Month Product
MonthNumber ProductID
MonthName ProductName Continent
QuantityPerUnit ContinentName
UnitPrice
Quarter Discontinued
Categories Shipper
Quarter
ShipperID
Category CompanyName
Semester
CategoryID
Semester
CategoryName Order
Description
OrderNo
Year OrderLineNo
Year
of the levels involved in a fact indicate the granularity of the measures, that
is, the level of detail at which measures are represented.
Measures are aggregated along dimensions when performing roll-up op-
erations. As shown in Fig. 4.1, the aggregation function associated with a
measure can be specified next to the measure name, where the SUM aggre-
gation function is assumed by default. In Chap. 3 we classified measures as
additive, semiadditive, or nonadditive. We assume by default that mea-
sures are additive. For semiadditive and nonadditive measures, we include
the symbols ‘+!’ and ‘+’,/ respectively. For example, in Fig. 4.1 the measures
Quantity and UnitPrice are, respectively, additive and semiadditive. Further,
measures and level attributes may be derived, if they are calculated on the
basis of other measures or attributes in the schema. We use the symbol ‘/’ for
indicating them. For example, in Fig. 4.1 the measure NetAmount is derived.
A hierarchy comprises several related levels. Given two related levels of
a hierarchy, the lower level is called the child and the higher level is called
the parent. Thus, the relationships composing hierarchies are called parent-
child relationships. The cardinalities in parent-child relationships indi-
cate the minimum and the maximum number of members in one level that
can be related to a member in another level. For example, in Fig. 4.1 the
child level Product is related to the parent level Category with a one-to-many
cardinality, which means that a product belongs to only one category and
that a category can have many products.
A dimension may contain several hierarchies, each one expressing a par-
ticular criterion used for analysis purposes; thus, we include the hierarchy
name to differentiate them. For example, in Fig. 4.1, the Employee dimen-
sions has two hierarchies, namely, Territories and Supervision. When the user
is not interested in employing a hierarchy for aggregation purposes, she will
represent all the attributes in a single level. This is the case of the attributes
City, Region, and Country in the Employee dimension in Fig. 4.1, .
Levels in a hierarchy are used to analyze data at various granularities,
or levels of detail. For example, the Product level contains information about
products, while the Category level may be used to see these products from the
more general perspective of the categories to which they belong. The level in
a hierarchy that contains the most detailed data is called the leaf level. The
name of the leaf level defines the dimension name, except for the case where
the same level participates several times in a fact, in which case the role name
defines the dimension name. These are called role-playing dimensions.
The level in a hierarchy representing the most general data is called the root
level. It is usual (but not mandatory) to represent the root of a hierarchy
using a distinguished level called All, which contains a single member, denoted
by all. The decision of including this level in multidimensional schemas is
left to the designer. In the remainder, we do not show the All level in the
hierarchies (except when we consider it necessary for clarity of presentation),
since we consider that it is meaningless in conceptual schemas.
4.2 Hierarchies 79
The identifier attributes of a parent level define how child members are
grouped. For example, in Fig. 4.1, CategoryID in the Category level is an
identifier attribute, used for grouping different product members during the
roll-up operation from the Product to the Category levels. However, in the
case of many-to-many parent-child relationships, we need to determine how
to distribute the measures from a child to its parent members. For this, we
can use a distributing attribute. For example, in Fig. 4.1, the relation-
ship between Employee and City is many-to-many (i.e., an employee can be
assigned to several cities). A distributing attribute can be used to store the
percentage of time that an employee devotes to each city.
Finally, it is sometimes the case that two or more parent-child relation-
ships are exclusive. This is represented using the symbol ‘⊗’. An example is
given in Fig. 4.1, where states can be aggregated either into regions or into
countries. Thus, according to their type, states participate in only one of the
relationships departing from the State level.
4.2 Hierarchies
A balanced hierarchy has only one path at the schema level, where all
levels are mandatory. An example is given by hierarchy Product → Category
in Fig. 4.1. At the instance level, the members form a tree where all the
branches have the same length, as shown in Fig. 3.3. All parent members
have at least one child member, and a child member belongs exactly to one
parent member. For example, in Fig. 3.3 each category is assigned at least
one product, and a product belongs to only one category.
80 4 Conceptual Data Warehouse Design
An unbalanced hierarchy has only one path at the schema level, where at
least one level is not mandatory. Therefore, at the instance level there can
be parent members without associated child members. Figure 4.2a shows a
hierarchy schema where a bank is composed of several branches, a branch
may have agencies, and an agency may have ATMs. As a consequence, at the
instance level the members represent an unbalanced tree, that is, the branches
of the tree have different lengths, since some parent members do not have
associated child members. For example, Fig. 4.2b shows a branch with no
agency, and several agencies with no ATM. As for balanced hierarchies, the
cardinalities imply that every child member belongs to at most one parent
member. For example, in Fig. 4.2 every agency belongs to one branch. These
hierarchies are useful either when facts may come at different granularities
(a case we study later), or the same hierarchy is used by different facts at
different levels of granularity. For example, one fact may be associated with
ATMs and another one with agencies.
(a)
Bank bank X
(b)
Andrew
Employee Fuller
Fig. 4.3 Instances of the parent-child hierarchy in the Northwind data warehouse
Sector
Customer SectorName
Description Branch
CustType
CustomerId ...
BranchName
CustomerName
Description
Address ...
... Profession
ProfessionName
Description
...
(a)
Branch branch 1
Sector/
Profession sector A sector B profession A profession B
... ...
Customer company Z company K person X person Y
(b)
Publication
PublicationId
Title
Abstract
NoPages
PublicationType
Continent Europe
FiscalQuarter FiscalYear
FiscalQuarterNo FiscalYear
Date Month ... ...
Time
Date MonthName
... ... CalendarQuarter CalendarYear
CalendarQuarterNo CalendarYear
... ...
(a)
Calendar
Year 2011
Jan 2011 Feb 2011 Mar 2011 Apr 2011 May 2011 Jun 2011 ...
Fiscal
Year 2011
(b)
Category Department
Product
Product
Groups
CategoryName DepartmentName
ProductNumber Description Description
ProductName ... ...
Description
Distributor
Size
Location
DistributorDivision
DistributorName DistributorRegion
... DivisionName
RegionName
Responsible
... Area
...
City Country
SalesOrganization StoreLocation
SalesOrganization StoreLocation
CityName CountryName
CityPopulation Capital
Store CityArea State CountryArea
... GDPGrowth
StoreNumber StateName ...
StoreName StatePopulation
StoreAddress StateArea
SalesDistrict SalesRegion
ManagerName ...
... DistrictName RegionName
Representative Responsible
ContactInfo RegionExtent
... ...
City
CityName
Employee CityPopulation
Lives
CityArea State
EmployeeID ...
StateName
FirstName
EnglishStateName
LastName
Territories
StateType
Title Territory ...
BirthDate
... TerritoryName
Description
Fig. 4.10 Parallel dependent hierarchies leading to different parent members of the
shared level
Janet
Employee Leverling
50 50 Atlanta
Atlanta Georgia Georgia
100 170 170 100 170 170
20 20
40 Orlando 40 Orlando
100 200
60 Florida 60 Florida
Tampa 200 Tampa 400
70 70
100 200
30 30
(a) (b)
This approach causes incorrect aggregated results, since the employee’s sales
are counted three times instead of only once.
One solution to the double-counting problem would be to transform a
nonstrict hierarchy into a strict one by creating a new member for each
set of parent members participating in a many-to-many relationship. In our
example, a new member that represents the three cities Atlanta, Orlando, and
Tampa will be created. However, a new member must also be created at the
state level, since two cities belong to the state of Florida, and one to Georgia.
Another solution would be to ignore the existence of several parent members
and to choose one of them as the primary member. For example, we may
choose the city of Atlanta. However, neither of these solutions correspond to
the users’ analysis requirements, since in the former, artificial categories are
introduced, and in the latter, some pertinent analysis scenarios are ignored.
percentage ÷
Payroll
EmployeeId SectionName DivisionName
Salary EmployeeName Description Type
Position Activity Responsible
... ... ...
archy where employees may work in several sections. The schema includes
a measure that represents an employee’s overall salary, that is, the sum of
the salaries paid in each section. Suppose that an attribute stores the per-
centage of time for which an employee works in each section. In this case,
we annotate this attribute in the relationship with an additional symbol ‘÷’
indicating that it is a distributing attribute determining how measures
are divided between several parent members in a many-to-many relationship.
Choosing an appropriate distributing attribute is important in order to
avoid approximate results when aggregating measures. For example, suppose
that in Fig. 4.13 the distributing attribute represents the percentage of time
that an employee works in a specific section. If the employee has a higher
position in one section, although she works less time in that section she may
earn a higher salary. Thus, applying the percentage of time as a distributing
attribute for measures representing an employee’s overall salary may not give
an exact result. Note also that in cases where the distributing attribute is
unknown, it can be approximated by considering the total number of par-
ent members with which the child member is associated. In the example of
Fig. 4.12, since Janet Leverling is associated with three cities, one third of
the value of the measure will be accounted for each city.
Payroll
DivisionName SectionName EmployeeId
Type Description Salary EmployeeName
Responsible Activity Position
... ... ...
Fig. 4.14 Transforming a nonstrict hierarchy into a strict one with an additional di-
mension
Figure 4.14 shows another solution to the problem of Fig. 4.13 where
we transformed a nonstrict hierarchy into independent dimensions. However,
this solution corresponds to a different conceptual schema, where the focus
of analysis has been changed from employees’ salaries to employees’ salaries
by section. Note that this solution can only be applied when the exact dis-
tribution of the measures is known, for instance, when the amounts of salary
paid for working in the different sections are known. It cannot be applied to
nonstrict hierarchies without a distributing attribute, as in Fig. 4.11.
Although the solution in Fig. 4.14 aggregates correctly the Salary mea-
sure when applying the roll-up operation from the Section to the Division
levels, the problem of double counting of the same employee is still present.
Suppose that we want to use the schema in Fig. 4.14 to calculate the num-
ber of employees by section or by division; this value can be calculated by
counting the instances of employees in the fact. The example in Fig. 4.15a
considers five employees who are assigned to various sections. Counting the
90 4 Conceptual Data Warehouse Design
(a) (b)
number of employees who work in each section gives correct results. However,
the aggregated values for each section cannot be reused for calculating the
number of employees in every division, since some employees (E1 and E2 in
Fig. 4.15a) will be counted twice and the total result will give a value equal
to 7 (Fig. 4.15b) instead of 5.
In summary, nonstrict hierarchies can be handled in several ways:
• Transforming a nonstrict hierarchy into a strict one:
– Creating a new parent member for each group of parent members
linked to a single child member in a many-to-many relationship.
– Choosing one parent member as the primary member and ignoring
the existence of other parent members.
– Replacing the nonstrict hierarchy by two independent dimensions.
• Including a distributing attribute.
• Calculating approximate values of a distributing attribute.
Since each solution has its advantages and disadvantages and requires spe-
cial aggregation procedures, the designer must select the appropriate solution
according to the situation at hand and users’ requirements.
In this section, we discuss particular modeling issues, namely, facts with mul-
tiple granularities, many-to-many dimensions, and links between facts, and
show how they can be represented in the MultiDim model.
4.3 Advanced Modeling Aspects 91
As shown in Fig. 4.16, this situation can be modeled using exclusive rela-
tionships between the various granularity levels. The issue is to get correct
analysis results when fact data are registered at multiple granularities.
Client
ClientId
ClientName
ClientAddress
...
Date Account Agency
BankStructure
Date AccountNo AgencyName
Event Type Address
Balance
WeekdayFlag Description Area
WeekendFlag OpeningDate NoEmployees
... Amount ... ...
In this case, the link relating the Date level and the AccountHolders fact can
be eliminated. Alternatively, this situation can be modeled with a nonstrict
hierarchy as shown in Fig. 4.19b.
Client
ClientId
ClientName
ClientAddress
...
Account
Date Holders Account Agency
BankStructure
Date AccountNo AgencyName
Event Type Address
Balance
WeekdayFlag Description Area
WeekendFlag OpeningDate NoEmployees
... Amount ... ...
(a)
Client
ClientId
ClientName
ClientAddress
...
Holder
Date Account Agency
BankStructure
(b)
Fig. 4.19 Two possible decompositions of the fact in Fig. 4.17. (a) Creating two facts;
(b) Including a nonstrict hierarchy
Even though the solutions proposed in Fig. 4.19 eliminate the double-
counting problem, the two schemas require programming effort for queries
that ask for information about individual clients. In Fig. 4.19a a drill-across
operation (see Sect. 3.2) between the two facts is needed, while in Fig. 4.19b
special procedures for aggregation in nonstrict hierarchies must be applied.
In the case of Fig. 4.19a, since the two facts represent different granularities,
queries with drill-across operations are complex, demanding a conversion ei-
94 4 Conceptual Data Warehouse Design
ther from a finer to a coarser granularity (e.g., grouping clients to know who
holds a specific balance in an account) or vice versa (e.g., distributing a bal-
ance between different account holders). Note also that the two schemas in
Fig. 4.19 could represent the information about the percentage of ownership
of accounts by customers (if this is known). This could be represented by
a measure in the AccountHolders fact in Fig. 4.19a and by a distributing
attribute in the many-to-many relationship in Fig. 4.19b.
ClientGroup Client
Holder
GroupId ClientId
... ClientName
ClientAddress
...
BankStructure
Date AccountNo AgencyName
Event Amount Type Address
WeekdayFlag Description Area
WeekendFlag OpeningDate NoEmployees
... ... ...
Sometimes we need to define a link between two related facts even if they
share dimensions. Fig. 4.21a shows an example where the facts Order and
Delivery share dimensions Customer and Date, while each fact has specific
dimensions, that is, Employee in Order and Shipper in Delivery. As indicated
by the link between the two facts, suppose that several orders can be delivered
by a single delivery and that a single order (e.g., containing many products)
can be delivered by several deliveries. Possible instances of the above facts
and their link are shown in Fig. 4.21b. In the figure, the links between the fact
instances and the members of dimensions Customer and Date are not shown.
Notice that, even if the facts share dimensions, an explicit link between the
facts is needed to keep the information of how orders were delivered. Indeed,
neither Customer nor Date can be used for this purpose.
Customer
CustomerID
...
Employee Shipper
Order Delivery
EmployeeID ShipperID
Amount Freight
... ...
... ...
Date
Date
…
(a)
Fig. 4.21 An excerpt of a conceptual schema for analyzing orders and deliveries. (a)
Schema; (b) Examples of instances
this case an order would be delivered by only one delivery, but one delivery
may concern multiple orders.
The ROLLUP* operation specifies the levels at which each of the dimen-
sions Customer, OrderDate, and Product are rolled-up. For the other dimen-
sions, a roll-up to All is performed. The SUM operation aggregates the mea-
sure SalesAmount. All other measures of the cube are removed from the result.
Query 4.2. Yearly sales amount for each pair of customer and supplier coun-
tries.
ROLLUP*(Sales, OrderDate → Year, Customer → Country,
Supplier → Country, SUM(SalesAmount))
Query 4.4. Monthly sales growth per product, that is, total sales per product
compared to those of the previous month.
Sales1 ← ROLLUP*(Sales, OrderDate → Month, Product → Product,
SUM(SalesAmount))
Sales2 ← RENAME(Sales1, SalesAmount → PrevMonthSalesAmount)
Sales3 ← DRILLACROSS(Sales2, Sales1,
( Sales1.OrderDate.Month > 1 AND
Sales2.OrderDate.Month+1 = Sales1.OrderDate.Month AND
Sales2.OrderDate.Year = Sales1.OrderDate.Year ) OR
( Sales1.OrderDate.Month = 1 AND Sales2.OrderDate.Month = 12 AND
Sales2.OrderDate.Year+1 = Sales1.OrderDate.Year ) )
Result ← ADDMEASURE(Sales3, SalesGrowth =
SalesAmount - PrevMonthSalesAmount )
Agin, we first apply a ROLLUP operation, make a copy of the resulting cube,
and then join the two cubes with the DRILLACROSS operation. However,
in the join condition two cases must be considered. In the first one, for the
months starting from February (Month > 1) the cells to be merged must
be consecutive and belong to the same year. In the second case, the cell
corresponding to January must be merged with the one of December from
the previous year. In the last step we compute a new measure SalesGrowth as
the difference between the sales amount of the two corresponding months.
Here, we roll-up all the dimensions in the cube, except Employee, to the All
level, while aggregating the measure SalesAmount. Then, the MAX operation
is applied while specifying that cells with the top three values of the measure
are kept in the result.
In this query, we roll-up the dimensions of the cube as specified. Then, the
MAX operation is applied after grouping by Product and OrderDate.
Query 4.7. Countries that account for top 50% of the sales amount.
Sales1 ← ROLLUP*(Sales, Customer → Country, SUM(SalesAmount))
Result ← TOPPERCENT(Sales1, Customer, 50) ORDER BY SalesAmount DESC
Here, we roll-up the Customer dimension to Country level and the other di-
mensions to the All level. Then, the TOPPERCENT operation selects the
countries that cumulatively account for top 50% of the sales amount.
98 4 Conceptual Data Warehouse Design
Query 4.8. Total sales and average monthly sales by employee and year.
Sales1 ← ROLLUP*(Sales, Employee → Employee, OrderDate → Month,
SUM(SalesAmount))
Result ← ROLLUP*(Sales1, Employee → Employee, OrderDate → Year,
SUM(SalesAmount), AVG(SalesAmount))
Here, we first roll-up the cube to the Employee and Month levels by summing
the SalesAmount measure. Then, we perform a second roll-up to the Year level
to obtain to overall sales and the average of monthly sales.
Query 4.9. Total sales amount and discount amount per product and month.
Sales1 ← ADDMEASURE(Sales, TotalDisc = Discount * Quantity * UnitPrice)
Result ← ROLLUP*(Sales1, Product → Product, OrderDate → Month,
SUM(SalesAmount), SUM(TotalDisc))
Here, we first compute a new measure TotalDisc from three other measures.
Then, we roll-up the cube to the Product and Month levels.
Query 4.11. Moving average over the last 3 months of the sales amount by
product category.
Sales1 ← ROLLUP*(Sales, Product → Category, OrderDate → Month,
SUM(SalesAmount))
Result ← ADDMEASURE(Sales1, MovAvg3M = AVG(SalesAmount) OVER
OrderDate 2 CELLS PRECEDING)
Query 4.12. Personal sales amount made by an employee compared with the
total sales amount made by herself and her subordinates during 2017.
Sales1 ←
SLICE(Sales, OrderDate.Year = 2017)
Sales2 ←
ROLLUP*(Sales1, Employee → Employee, SUM(SalesAmount))
Sales3 RENAME(Sales2, SalesAmount → PersonalSales)
←
Sales4 ←
RECROLLUP(Sales2, Employee → Employee, Supervision,
SUM(SalesAmount))
Result ← DRILLACROSS(Sales4, Sales3)
4.5 Summary 99
We first restrict the data in the cube to the year 2017. Then, we perform
the aggregation of the sales amount measure by employee, obtaining the
sales figures independently of the supervision hierarchy. In the third step the
obtained measure is renamed, after which we apply the recursive roll-up that
iterates over the supervision hierarchy, aggregating children to parent until
the top level is reached. The last step obtains the cube with both measures.
Query 4.13. Total sales amount, number of products, and sum of the quan-
tities sold for each order.
ROLLUP*(Sales, Order → Order, SUM(SalesAmount),
COUNT(Product) AS ProductCount, SUM(Quantity))
Here, we roll-up all the dimensions, except Order, to the All level, while adding
the SalesAmount and Quantity measures and counting the number of products.
Query 4.14. For each month, total number of orders, total sales amount,
and average sales amount by order.
Sales1 ← ROLLUP*(Sales, OrderDate → Month, Order → Order,
SUM(SalesAmount))
Result ← ROLLUP*(Sales1, OrderDate → Month, SUM(SalesAmount),
AVG(SalesAmount) AS AvgSales, COUNT(Order) AS OrderCount)
Here we first roll-up to the Month and Order levels. Then, we roll-up to remove
the Order dimension and obtain the requested measures.
Query 4.15. For each employee, total sales amount, number of cities, and
number of states to which she is assigned.
ROLLUP*(Sales, Employee → State, SUM(SalesAmount), COUNT(DISTINCT City)
AS NoCities, COUNT(DISTINCT State) AS NoStates)
4.5 Summary
4.8 Exercises
Exercise 4.2. Design a MultiDim schema for the telephone provider appli-
cation in Ex. 3.1.
Exercise 4.3. Design a MultiDim schema for the train application in Ex. 3.2.
Exercise 4.4. Design a MultiDim schema for the university application given
in Ex. 3.3 taking into account the different granularities of the time dimen-
sion.
Exercise 4.5. Design a MultiDim schema for the French horse race applica-
tion given in Ex. 2.1. With respect to the races, the application must be able
to display different statistics about the prizes won by owners, by trainers,
by jockeys, by breeders, by horses, by sires (i.e., fathers), and by damsires
(i.e., maternal grandfathers). With respect to the bets, the application must
be able to display different statistics about the payoffs by type, by race, by
racetrack, and by horses.
Exercise 4.7. Design a MultiDim schema for the Formula One application
given in Ex. 2.2. With respect to the races, the application must be able
to display different statistics about the prizes won by drivers, by teams, by
circuit, by Grand Prix, and by season.
Geography
Family Name Country Name Customer ID
First Name
Middle Initial
Department State Last Name
Birth Date
Department Name State Name Gender
Education
Category Marital Status
City Member Card
Category Name Yearly Income
City Name
Occupation
Total Children
Subcategory Nb Children Home
Geography
Subcategory Name House Owner
Store Nb Cars Owned
Store ID
Class
Store Name
Department
Brand
Month
Brand Name
Sales Month No
Store Sales Month Name
Promotion Store Cost
Promotion Name Unit Sales Calendar
/Sales Average
Media /Profit Date
Date
Promotion Media Day Name Week
Day Nb Month
Media Type Week Nb Year
c. All measures for stores in the states of California and Washington sum-
marized at the city level.
d. All measures, including the derived ones, for stores in the state of Cali-
fornia summarized at the state and city levels.
e. Sales average in 2017 by store state and store type.
f. Sales profit in 2017 by store and semester.
g. Sales profit percentage in 2017 by store, quarter, and semester.
104 4 Conceptual Data Warehouse Design
Conceptual models are useful to design database applications since they favor
the communication between the stakeholders in a project. However, concep-
tual models must be translated into logical ones for their implementation on a
database management system. In this chapter, we study how the conceptual
multidimensional model studied in the previous chapter can be represented
in the relational model. We start in Sect. 5.1 by describing the three logical
models for data warehouses, namely, relational OLAP (ROLAP), multidi-
mensional OLAP (MOLAP), and hybrid OLAP (HOLAP). In Sect. 5.2, we
focus on the relational representation of data warehouses and study four typi-
cal implementations: the star, snowflake, starflake, and constellation schemas.
In Sect. 5.3, we present the rules for mapping a conceptual multidimensional
model (in our case, the MultiDim model) to the relational model. Section 5.4
discusses how to represent the time dimension. Sections 5.5 and 5.6 study how
hierarchies, facts with multiple granularities, and many-to-many dimensions
can be implemented in the relational model. Section 5.7 is devoted to the
study of slowly changing dimensions, which arise when dimensions in a data
warehouse are updated. In Sect. 5.8, we study how a data cube can be repre-
sented in the relational model and how it can be queried using SQL. Finally,
to illustrate these concepts, we show in Sect. 5.9 how the Northwind data
warehouse can be implemented in Analysis Services using both the multidi-
mensional and the tabular models. For brevity, we refer to them, respectively,
as Analysis Services Multidimensional and Analysis Services Tabular.
Product Store
ProductKey StoreKey
ProductNumber Sales StoreNumber
ProductName StoreName
Description ProductKey StoreAddress
Size StoreKey ManagerName
CategoryName PromotionKey CityName
CategoryDescr DateKey CityPopulation
DepartmentName Amount CityArea
DepartmentDescr Quantity StateName
... StatePopulation
StateArea
StateMajorActivity
Promotion Date
...
PromotionKey DateKey
PromotionDescr Date
DiscountPerc WeekdayFlag
StartDate WeekendFlag
EndDate Season
... ...
while in the snowflake schema in Fig. 5.2 we need an extra join, as follows:
108 5 Logical Data Warehouse Design
Store
City State
StoreKey
StoreNumber CityKey StateKey
StoreName CityName StateName
StoreAddress CityPopulation StatePopulation
ManagerName CityArea StateArea
CityKey StateKey StateMajorActivity
... ... ...
Fig. 5.4 Relational representation of the Northwind data warehouse in Fig. 4.1
is called a fact (or degenerate) dimension. The fact table also contains
five attributes representing the measures UnitPrice, Quantity, Discount, Sales-
Amount, and Freight. Finally, note that the many-to-many parent-child rela-
tionship between Employee and Territory is mapped to the table Territories,
containing two foreign keys.
With respect to keys, in the Northwind data warehouse of Fig. 5.4 we
have illustrated the two possibilities for defining the keys of dimension lev-
els, namely, generating surrogate keys and keeping the database key as data
warehouse key. For example, Customer has a surrogate key CustomerKey and
a database key CustomerID. On the other hand, SupplierKey in Supplier is a
database key. The choice of one among these two solutions is addressed in
the ETL process that we will see in Chap. 9.
112 5 Logical Data Warehouse Design
The general mapping rules given in Sect. 5.3 do not capture the specific se-
mantics of the various kinds of hierarchies described in Sect. 4.2. In addition,
for some kinds of hierarchies, alternative logical representations exist. In this
section, we study in detail the logical representation of hierarchies.
(a) (b)
Fig. 5.5 Relations for a balanced hierarchy. (a) Snowflake structure; (b) Flat table
bank X
Fig. 5.6 Transforming the unbalanced hierarchy in Fig. 4.2b into a balanced one using
placeholders
yields tables containing all attributes of a level, and an additional foreign key
relating child members to their corresponding parent. For example, the table
Employee in Fig. 5.4 shows the relational representation of the parent-child
hierarchy in Fig. 4.1. As can be expected, operations over such a table are
more complex. In particular, recursive queries are necessary for traversing a
parent-child hierarchy. While recursive queries are supported in SQL and in
MDX, this is not the case for DAX. As we will see in Sect. 5.9.2, it is thus
necessary to flatten the hierarchical structure to a regular hierarchy made up
of one column for each possible level of the hierarchy.
Generalized hierarchies account for the case where dimension members are of
different kinds, and each kind has a specific aggregation path. For example,
in Fig. 4.4, customers can be either companies or persons, where companies
are aggregated through the path Customer → Sector → Branch, while persons
are aggregated through the path Customer → Profession → Branch.
As for balanced hierarchies, two approaches can be used for representing
generalized hierarchies at the logical level: create a table for each level, lead-
ing to snowflake schemas, or create a single table for all the levels, where
null values are used for attributes that do not pertain to specific members
(e.g., tuples for companies will have null values in attributes corresponding
to persons). A mix of these two approaches is also possible: create one table
for the common levels and another table for the specific ones. Finally, we
could also use separate fact and dimension tables for each path. In all these
approaches we must keep metadata about which tables compose the different
aggregation paths, while we need to specify additional constraints to ensure
correct queries (e.g., to avoid grouping Sector with Profession in Fig. 4.4).
116 5 Logical Data Warehouse Design
Sector
Customer SectorKey
SectorName
CustomerKey Description Branch
CustomerId BranchKey
CustomerName ... BranchKey
Address BranchName
SectorKey (0,1) Description
Profession ...
ProfessionKey (0,1)
... ProfessionKey
ProfessionName
Description
BranchKey
...
Sector
SectorKey
SectorName
Customer Description
CustomerKey BranchKey Branch
...
CustomerId BranchKey
CustomerName BranchName
Address Description
SectorKey (0,1) Profession ...
ProfessionKey (0,1)
BranchKey ProfessionKey
CustomerType ProfessionName
... Description
BranchKey
...
Fig. 5.8 Improved relational representation of the generalized hierarchy in Fig. 4.4
5.5 Logical Representation of Hierarchies 117
An example of the relations for the hierarchy in Fig. 4.4 is given in Fig. 5.8.
The table Customer includes two kinds of foreign keys: one that indicates
the next specialized hierarchy level (SectorKey and ProfessionKey), which is
obtained by applying Rules 1 and 3b in Sect. 5.3; the other kind of foreign
key corresponds to the next joining level (BranchKey), which is obtained by
applying Rule 4 above. The discriminating attribute CustomerType, which can
take the values Person and Company, indicates the specific aggregation path
of members to facilitate aggregations. Finally, check constraints must be
specified to ensure that only one of the foreign keys for the specialized levels
may have a value, according to the value of the discriminating attribute.
ALTER TABLE Customer ADD CONSTRAINT CustomerTypeCK
CHECK ( CustomerType IN ('Person', 'Company') )
ALTER TABLE Customer ADD CONSTRAINT CustomerPersonFK
CHECK ( (CustomerType != 'Person') OR
( ProfessionKey IS NOT NULL AND SectorKey IS NULL ) )
ALTER TABLE Customer ADD CONSTRAINT CustomerCompanyFK
CHECK ( (CustomerType != 'Company') OR
( ProfessionKey IS NULL AND SectorKey IS NOT NULL ) )
The schema in Fig. 5.8 allows choosing alternative paths for analysis. One
possibility is to use the paths that include the specific levels, for example
Profession or Sector. Another possibility is to only access the levels that are
common to all members, for example, to analyze all customers, whatever their
type, using the hierarchy Customer and Branch. As with the snowflake struc-
ture, one disadvantage of this structure is the need to apply join operations
between several tables. However, an important advantage is the expansion of
the analysis possibilities that it offers.
The mapping above can also be applied to ragged hierarchies since these
hierarchies are a special case of generalized hierarchies. This is illustrated in
Fig. 5.4 where the City level has two foreign keys to the State and Country
levels. Nevertheless, since in a ragged hierarchy there is a unique path where
some levels can be skipped, another solution is to embed the attributes of
an optional level in the splitting level. This is also shown in Fig. 5.4, where
the level State has two optional attributes corresponding to the Region level.
Finally, another solution would be to transform the hierarchy at the instance
level by including placeholders in the missing intermediate levels, as it is
done for unbalanced hierarchies in Sect. 5.5.2. In this way, a ragged hierarchy
would be converted into a balanced one.
distinguished at the conceptual level (see Figs. 4.4a and 4.7), this distinction
cannot be made at the logical level (compare Figs. 5.7 and 5.9).
FiscalQuarter FiscalYear
FiscalQuarterKey FiscalYearKey
Date Month FiscalQuarterNo FiscalYearNo
FiscalYearKey ...
DateKey MonthKey ...
Date MonthName
DateKey FiscalQuarterKey
... CalendarQuarter CalendarYear
CalendarQuarterKey
... CalendQuarterKey CalendYearKey
CalendarQuarterNo CalendarYearNo
CalendYearKey ...
...
City
Employee CityKey
CityName
EmployeeKey CityPopulation State
EmployeeID CityArea StateKey
FirstName ...
StateName
LastName EnglishStateName
Title StateType
Territory
BirthDate ...
... TerritoryKey
TerritoryName
Description
Fig. 5.10 Relations for the parallel dependent hierarchies in Fig. 4.10
For example, in Fig. 5.10 table States contains all states where an employee
lives, works, or both. Therefore, aggregating along the path Employee → City
→ State will yield states where no employee lives. If we do not want these
states in the result, we can create a view named StateLives containing only
the states where at least one employee lives.
Finally, note also that both alternative and parallel dependent hierarchies
can be easily distinguished at the conceptual level (Figs. 4.7 and 4.10); how-
ever, their logical representations (Figs. 5.9 and 5.10) look similar in spite of
several characteristics that differentiate them, as explained in Sect. 4.2.5.
Applying the mapping rules given in Sect. 5.3 to nonstrict hierarchies, creates
relations for the levels and an additional relation, called a bridge table, for
the many-to-many relationship between them. An example for the hierarchy
in Fig. 4.13 is given in Fig. 5.11, where the bridge table EmplSection repre-
sents the many-to-many relationship. If the parent-child relationship has a
distributing attribute (as in Fig. 4.13), it will be represented in the bridge
table as an additional attribute, which stores the values required for measure
distribution. However, this requires a special aggregation procedure that uses
the distributing attribute.
EmplSection
EmployeeKey
SectionKey
Percentage
• Data structure and size: Bridge tables require less space than creating
additional dimensions. In the latter case, the fact table grows if child
members are related to many parent members. The additional foreign key
in the fact table also increases the space required. In addition, for bridge
tables, information about the parent-child relationship and distributing
attribute (if it exists) must be stored separately.
• Performance and applications: For bridge tables, join operations, calcu-
lations, and programming effort are needed to aggregate measures cor-
rectly, while in the case of additional dimensions, measures in the fact
table are ready for aggregation along the hierarchy. Bridge tables are thus
appropriate for applications that have a few nonstrict hierarchies. They
are also adequate when the information about measure distribution does
not change with time. On the contrary, additional dimensions can easily
represent changes in time of measure distribution.
Finally, yet another option consists in transforming the many-to-many
relationship into a one-to-many one by defining a “primary” relationship,
that is, to convert the nonstrict hierarchy into a strict one, to which the
corresponding mapping for simple hierarchies is applied (as explained in
Sect. 4.3.2).
Two approaches can be used for the logical representation of facts with mul-
tiple granularities. The first one consists in using multiple foreign keys, one
for each alternative granularity, in a similar way as it was explained for gen-
eralized hierarchies in Sect. 5.5.3. The second approach consists in removing
granularity variation at the instance level with the help of placeholders, in a
similar way as explained for unbalanced hierarchies in Sect. 5.5.2.
Consider the example of Fig. 4.16, where measures are registered at multi-
ple granularities. Figure 5.12 shows the relational schema resulting from the
first solution above, where the Sales fact table is related to both the City and
the State levels through referential integrity constraints. In this case, both at-
tributes CityKey and StateKey are optional, and constraints must be specified
to ensure that only one of the foreign keys may have a value.
Figure 5.13 shows an example of instances for the second solution above,
where placeholders are used for facts that refer to nonleaf levels. There are
5.6 Advanced Modeling Aspects 121
Product City
ProductKey CityKey
ProductNo Sales CityName
ProductName Population
... DateKey ...
ProductKey StateKey
CityKey (0,1)
StateKey (0,1)
Date Quantity State Country
UnitPrice
DateKey SalesAmount StateKey CountryKey
Date StateName CountryName
DayNbWeek EnglishStateName CountryCode
... ... ...
CountryKey
Fig. 5.12 Relations for the fact with multiple granularities in Fig. 4.16
two possible cases illustrated by the two placeholders in the figure. In the first
case, a fact member points to a nonleaf member that has children. In this
case, placeholder PH1 represents all cities other than the existing children. In
the second case, a fact member points to a nonleaf member without children.
In this case, placeholder PH2 represents all (unknown) cities of the state.
Fig. 5.13 Using placeholders for the fact with multiple granularities in Fig. 4.16
The mapping to the relational model given in Sect. 5.3, applied to many-to-
many dimensions, creates relations representing the fact, the dimension levels,
122 5 Logical Data Warehouse Design
Client
ClientKey
ClientId BalanceClient
ClientName
ClientAddress BalanceKey
... ClientKey
Order Delivery
OrderKey OrderDelivery DeliveryKey
EmployeeKey ShipperKey
CustomerKey OrderKey CustomerKey
DateKey DeliveryKey DateKey
Amount Freight
Date
DateKey
Date
...
Fig. 5.15 Relations for the schema with a link between facts in Fig. 4.21
Fig. 5.16 Examples of instances of the facts and their link in Fig. 5.15
may concern several orders. In this case, a bridge table is no longer needed
and the DeliveryKey should be added to the Order fact table.
A link between fact tables is commonly used for combining the data
from two different cubes through a join operation. In the example shown
in Fig. 5.17, the information about orders and deliveries is combined to en-
able analysis of the entire sales process. Notice that only one copy of the
CustomerKey is kept (since it is assumed that it is the same in both orders
and deliveries), while the DateKeys of both facts are kept, and these are re-
named OrderDateKey and DeliveryDateKey. Notice that since the relationship
between the two facts is many-to-many, this may induce the double-counting
problem to which we referred in Sect. 4.2.6.
124 5 Logical Data Warehouse Design
Fig. 5.17 Drill-across of the facts through their link in Fig. 5.16
So far, we have assumed that new data that arrives to the warehouse only
corresponds to facts, which means dimensions are stable, and their data do
not change. However, in many real-world situations, dimensions can change
both at the structure and the instance level. Structural changes occur, for
example, when an attribute is deleted from the data sources and therefore it
is no longer available. As a consequence, this attribute should also be deleted
from the dimension table. Changes at the instance level can be of two kinds.
First, when a correction must be made to the dimension tables due to an
error, the new data should replace the old one. Second, when the contextual
conditions of an analysis scenario change, the contents of dimension tables
must change accordingly. We cover these two latter cases in this section.
We introduce the problem by means of a simplified version of the North-
wind data warehouse, where we consider a Sales fact table related only to
the dimensions Date, Employee, Customer, and Product, and a SalesAmount
measure. We assume a star (denormalized) representation of table Product,
and thus category data are embedded in this table. Below, we show instances
of the Sales fact table and the Product dimension table.
DateKey EmployeeKey CustomerKey ProductKey SalesAmount
t1 e1 c1 p1 100
t2 e2 c2 p1 100
t3 e1 c3 p3 100
t4 e2 c4 p4 100
As we said above, new tuples will be entered into the Sales fact table as
new sales occur. But also other updates are likely to occur. For example, when
a new product starts to be commercialized by the company, a new tuple in
Product must be inserted. Also, data about a product may be wrong, and in
5.7 Slowly Changing Dimensions 125
this case, the corresponding tuples must be corrected. Finally, the category
of a product may change as a result of a new commercial or administrative
policy. Assuming that these kinds of changes are not frequent, when the
dimensions are designed so that they support them, they are called slowly
changing dimensions.
In the scenario above, consider a query asking for the total sales per em-
ployee and product category, expressed as follows:
SELECT E.EmployeeKey, P.CategoryName, SUM(SalesAmount)
FROM Sales S, Product P
WHERE S.ProductKey = P.ProductKey
GROUP BY E.EmployeeKey, P.CategoryName
Suppose now that, at an instant t after t4 (the date of the last sale shown
in the fact table above), the category of product prod1 changed to cat2, that
means, there is a reclassification of the product with respect to its category.
The trivial solution of updating the category of the product to cat2 does not
keep track of the previous category of a product. As a consequence, if the
user poses the same query as above, and the fact table has not been changed
in the meantime, she would expect to get the same result, but since all the
sales occurred before the reclassification, she would get the following result:
EmployeeKey CategoryName SalesAmount
e1 cat2 200
e2 cat2 200
This result is incorrect since the products affected by the category change
were already associated with sales data. Opposite to this, if the new category
would be the result of an error correction (i.e., the actual category of prod1 is
cat2), this result would be correct. In the former case, obtaining the correct
answer requires to guarantee the preservation of the results obtained when
prod1 had category cat1, and make sure that the new aggregations will be
computed with the new category cat2.
Three basic ways of handling slowly changing dimensions have been pro-
posed in the literature. The simplest one, called type 1, consists in overwriting
the old value of the attribute with the new one, which implies that we lose the
history of the attribute. This approach is appropriate when the modification
is due to an error in the dimension data.
In the second solution, called type 2, the tuples in the dimension table
are versioned, and a new tuple is inserted each time a change takes place.
In our example, we would enter a new row for product prod1 in the Product
126 5 Logical Data Warehouse Design
table with its new category cat2 so that all sales prior to t will contribute to
the aggregation to cat1, while the ones that occurred after t will contribute
to cat2. This solution requires to have a surrogate key in the dimension in
addition to the business key so that all versions of a dimension member have
different surrogate key but the same business key. In our example, we keep
the business key in a column ProductID and the surrogate key in a column
ProductKey. Furthermore, it also necessary to extend the table Product with
two attributes indicating the validity interval of the tuple, let us call them
From and To. The table Product would look like the following:
Product Product Product UnitPrice Category Description From To
Key ID Name Name
k1 p1 prod1 10.00 cat1 desc1 2010-01-01 2011-12-31
k11 p1 prod1 10.00 cat2 desc2 2012-01-01 9999-12-31
k2 p2 prod2 12.00 cat1 desc1 2010-01-01 9999-12-31
... ... ... ... ... ... ... ...
In the table above, the first two tuples correspond to the two versions of
product prod1, with ProductKey values k1 and k11. The value 9999-12-31 in
the To attribute indicates that the tuple is still valid; this is a usual notation
in temporal databases. Since the same product participates in the fact table
with as many surrogates as there are attribute changes, the business key
keeps track of all the tuples that pertain to the same product. For example,
the business key will be used when counting the number of different products
sold by the company over specific time periods. Notice that since a new
record is inserted every time an attribute value changes, the dimension can
grow considerably, decreasing the performance during join operations with
the fact table. More sophisticated techniques have been proposed to address
this, and below we will comment on them.
In the type 2 approach, sometimes an additional attribute is added to ex-
plicitly indicate which is the current row. The table below shows an attribute
denoted by RowStatus, telling which is the current value for product prod1.
Product Product Product Unit Category Description From To Row
Key ID Name Price Name Status
k1 p1 prod1 10.00 cat1 desc1 2010-01-01 2011-12-31 Expired
k11 p1 prod1 10.00 cat2 desc2 2012-01-01 9999-12-31 Current
··· ··· ··· ··· ··· ··· ··· ··· ···
and the Category table remains unchanged. However, if the change occurs at
an upper level in the hierarchy, this change needs to be propagated downward
in the hierarchy. For example, suppose that the description of category cat1
changes, as reflected in the following table:
Category Category Category Description From To
Key ID Name
l1 c1 cat1 desc1 2010-01-01 2011-12-31
l11 c1 cat1 desc11 2012-01-01 9999-12-31
l2 c2 cat2 desc2 2010-01-01 9999-12-31
l3 c3 cat3 desc3 2010-01-01 9999-12-31
l4 c4 cat4 desc4 2010-01-01 9999-12-31
This change must be propagated to the Product table so that all sales prior
to the change refer to the old version of category cat1 (with key l1), while
the new sales must point to the new version (with key l11), as shown below:
Product Product Product Unit Category From To
Key ID Name Price Key
k1 p1 prod1 10.00 l1 2010-01-01 2011-12-31
k11 p1 prod1 10.00 l11 2012-01-01 9999-12-31
k2 p2 prod2 12.00 l1 2010-01-01 2011-12-31
k21 p2 prod2 12.00 l11 2012-01-01 9999-12-31
k3 p3 prod3 13.50 l2 2010-01-01 9999-12-31
k4 p4 prod4 15.00 l2 2011-01-01 9999-12-31
Note that only the two last versions of the attribute can be represented in
this solution and that the validity interval of the tuples is not stored.
It is worth noticing that it is possible to apply the three solutions above,
or combinations of them, to the same dimension. For example, we may apply
correction (type 1), tuple versioning (type 2), or attribute addition (type 3)
for various attributes in the same dimension table.
In addition to these three classic approaches to handle slowly changing
dimensions, more sophisticated (although more difficult to implement) solu-
tions have been proposed. We briefly comment on them next.
The type 4 approach aims at handling very large dimension tables and
attributes that change frequently. This situation can make the dimension
tables to grow to a point that even browsing the dimension can become very
slow. Thus, a new dimension, called a minidimension, is created to store the
most frequently changing attributes. For example, assume that in the Product
dimension there are attributes SalesRanking and PriceRange, which are likely
to change frequently, depending on the market conditions. Thus, we will
create a new dimension called ProductFeatures, with key ProductFeaturesKey,
and the attributes SalesRanking and PriceRange, as follows:
Product Sales Price
FeaturesKey Ranking Range
pf1 1 1-100
pf2 2 1-100
··· ··· ···
pf200 7 500-600
As can be seen, there will be one row in the minidimension for each unique
combination of SalesRanking and PriceRange encountered in the data, not one
row per product.
The key ProductFeaturesKey must be added to the fact table Sales as a
foreign key. In this way, we prevent the dimension to grow with every change
in the sales ranking score or price range of a product, and the changes are
actually captured by the fact table. For example, assume that product prod1
initially has sales ranking 2 and price range 1-100. A sale of this product will
be entered in the fact table with a value of ProductFeaturesKey equal to pf2.
If later the sales ranking of the product goes up to 1, the subsequent sales
will be entered with ProductFeaturesKey equal to pf1.
The type 5 approach is an extension of type 4, where the primary dimen-
sion table is extended with a foreign key to the minidimension table. In the
current example, the Product dimension will look as follows:
5.7 Slowly Changing Dimensions 129
will be used for current values. In order to support current analysis we need
an additional view, called CurrentProduct, which keeps only current values of
the Product dimension as follows:
ProductID ProductName UnitPrice CategoryKey
p1 prod1 10.00 l2
p2 prod2 12.00 l1
p3 prod3 13.50 l2
p4 prod4 15.00 l2
A variant of this approach uses the surrogate key as the key of the current
dimension, eliminating the need for two different foreign keys in the fact
table.
Several data warehouse platforms provide some support for slowly chang-
ing dimensions, typically type 1 to type 3. However, these solutions are not
satisfactory, since they require considerable programming effort for their cor-
rect manipulation. As we will see in Chap. 11, temporal data warehouses are
a more general solution to this problem since they provide built-in temporal
semantics to data warehouses.
Fig. 5.18 (a) A data cube with two dimensions, Product and Customer. (b) A fact table
representing the same data
Figure 5.19 show the result of the GROUP BY ROLLUP and the GROUP BY
CUBE queries above. In the case of roll-up, in addition to the detailed data,
we can see the total amount by product and the overall total. For example,
the total sales for product p1 is 305. If we also need the totals by customer,
we would need the cube computation, performed by the second query.
Fig. 5.19 Operators (a) GROUP BY ROLLUP and (b) GROUP BY CUBE
132 5 Logical Data Warehouse Design
Actually, the ROLLUP and CUBE operators are simply shorthands for
a more powerful operator, called GROUPING SETS, which is used to pre-
cisely specify the aggregations to be computed. For example, the GROUP BY
ROLLUP query above can be written using GROUPING SETS as follows:
SELECT ProductKey, CustomerKey, SUM(SalesAmount)
FROM Sales
GROUP BY GROUPING SETS((ProductKey, CustomerKey), (ProductKey), ())
The result of the query is given in Fig. 5.20. The first three columns are
obtained from the initial Sales table. The fourth one is obtained as follows.
For each tuple, a window is defined, called partition, containing all the tuples
pertaining to the same product. The attribute SalesAmount is then aggregated
over this group using the corresponding function (in this case MAX) and the
result is written in the MaxAmount column. Note that the first three tuples,
corresponding to product p1, have a MaxAmount of 105, that is, the maximum
amount sold of this product to customer c2.
A second SQL feature to address OLAP queries is called window order-
ing. It is used to order the rows within a partition. This feature is useful, in
particular, to compute rankings. Two common aggregate functions applied
in this respect are ROW_NUMBER and RANK. For example, the next query
shows how does each product rank in the sales of each customer. For this, we
can partition the table by customer, and apply the ROW_NUMBER function
as follows:
SELECT ProductKey, CustomerKey, SalesAmount, ROW_NUMBER() OVER
(PARTITION BY CustomerKey ORDER BY SalesAmount DESC) AS RowNo
FROM Sales
The result is shown in Fig. 5.21a. The first tuple, for example, was evaluated
by opening a window with all the tuples of customer c1, ordered by the sales
amount. We see that product p1 is the one most demanded by customer c1.
5.8 Performing OLAP Queries with SQL 133
Fig. 5.20 Sales of products to customers compared with the maximum amount sold for
that product
Fig. 5.21 (a) Ranking of products in the sales of customers; (b) Ranking of customers
in the sales of products
We could instead partition by product, and study how each customer ranks
in the sales of each product, using the function RANK.
SELECT ProductKey, CustomerKey, SalesAmount, RANK() OVER
(PARTITION BY ProductKey ORDER BY SalesAmount DESC) AS Rank
FROM Sales
As shown in the result given in Fig. 5.21b, the first tuple was evaluated
opening a window with all the tuples with product p1, ordered by the sales
amount. We can see that customer c2 is the one with highest purchases of
p1, and customers c3 and c1 are in the second place, with the same ranking.
A third kind of feature of SQL for OLAP is window framing, which
defines the size of the partition. This is used to compute statistical functions
over time series, like moving averages. To give an example, let us assume that
134 5 Logical Data Warehouse Design
we add two columns Year and Month to the Sales table. The following query
computes the 3-month moving average of sales by product.
SELECT ProductKey, Year, Month, SalesAmount, AVG(SalesAmount) OVER
(PARTITION BY ProductKey ORDER BY Year, Month
ROWS 2 PRECEDING) AS MovAvg3M
FROM Sales
The result is shown in Fig. 5.22a. For each tuple, the query evaluator opens
a window that contains the tuples pertaining to the current product. Then,
it orders the window by year and month and computes the average over the
current tuple and the preceding two ones, provided they exist. For example,
in the first tuple, the average is computed over the current tuple (there is no
preceding tuple), while in the second tuple, the average is computed over the
current tuple and the preceding one. Finally, in the third tuple, the average
is computed over the current tuple and the two preceding ones.
Fig. 5.22 (a) Three-month moving average of the sales per product (b) Year-to-date
sum of the sales per product
The result is shown in Fig. 5.22b. For each tuple, the query evaluator opens a
window that contains the tuples pertaining to the current product and year
ordered by month. Unlike in the previous query, the aggregation function
SUM is applied to all the tuples before the current tuple, as indicated by
ROWS UNBOUNDED PRECEDING.
5.9 Defining the Northwind Data Warehouse in Analysis Services 135
It is worth noting that queries that use window functions can be expressed
without them, although the resulting queries are harder to read and may be
less efficient. For example, the query above computing the year-to-date sales
can be equivalently written as follows:
SELECT ProductKey, Year, Month, SalesAmount, AVG(SalesAmount) AS YTD
FROM Sales S1, Sales S2
WHERE S1.ProductKey = S2.ProductKey AND
S1.Year = S2.Year AND S1.Month >= S2.Month
Of course, there are many other functions appropriate for OLAP provided
by SQL, which the interested reader can find in the standard.
Data Sources
A data warehouse retrieves its data from one or several data stores. A data
source contains connection information to a data store, which includes the
136 5 Logical Data Warehouse Design
location of the server, a login and password, a method to retrieve the data,
and security permissions. Analysis Services supports multiple types of data
sources. If the source is a relational database, then SQL is used by default
to query the database. In our example, there is a single data source that
connects to the Northwind data warehouse.
These calculations combine the month, quarter, or semester with the year.
• In the Sales fact table, the named calculation OrderLineDesc combines the
order number and the order line using the expression
CONVERT(CHAR(5),OrderNo) + ' - ' + CONVERT(CHAR(1),OrderLineNo)
5.9 Defining the Northwind Data Warehouse in Analysis Services 137
Fig. 5.23 The data source view for the Northwind data warehouse
Dimensions
Analysis Services supports several types of dimensions as follows:
• A regular dimension has a direct one-to-many link between a fact table
and a dimension table. An example is the dimension Product.
• A reference dimension is indirectly related to the fact table through
another dimension. An example is the Geography dimension, which is
related to the Sales fact table through the Customer and Supplier dimen-
sions. Reference dimensions can be chained together, for instance, one
can define another reference dimension from the Geography dimension.
• In a role-playing dimension, a single fact table is related to a di-
mension table more than once. Examples are the dimensions OrderDate,
DueDate, and ShippedDate, which all refer to the Date dimension. A role-
playing dimension is stored once and used multiple times.
• A fact dimension, also referred to as degenerate dimension, is similar
to a regular dimension but the dimension data are stored in the fact table.
An example is the dimension Order.
• In a many-to-many dimension, a fact is related to multiple dimension
members and a member is related to multiple facts. In the Northwind data
warehouse, there is a many-to-many relationship between Employees and
138 5 Logical Data Warehouse Design
Cities, which is represented in the bridge table Territories. This table must
be defined as a fact table in Analysis Services, as we will see later.
Dimensions can be defined either from a DSV, which provides data for
the dimension, or from preexisting templates provided by Analysis Services.
An example of the latter is the time dimension, which does not need to be
defined from a data source. Dimensions can be built from one or more tables.
In order to define dimensions, we need to discuss how hierarchies are han-
dled in Analysis Services. In the next section, we provide a more detailed dis-
cussion on this topic. In Analysis Services, there are two types of hierarchies.
Attribute hierarchies correspond to a single column in a dimension ta-
ble, for instance, attribute ProductName in dimension Product. On the other
hand, multilevel (or user-defined) hierarchies are derived from two or
more attributes, each attribute being a level in the hierarchy, for instance
Product and Category. An attribute can participate in more than one multi-
level hierarchy, for instance, a hierarchy Product and Brand in addition to the
previous one. Analysis Services supports three types of multilevel hierarchies,
depending on how the members of the hierarchy are related to each other:
balanced, ragged, and parent-child hierarchies. We will explain how to define
these hierarchies in Analysis Services later in this section.
We illustrate how to define the different kinds of dimensions supported
by Analysis Services using the Northwind data warehouse. We start with
a regular dimension, namely, the Product dimension, shown in Fig. 5.24.
The right pane defines the tables in the DSV from which the dimension is
created. The attributes of the dimension are given in the left pane. Finally,
the hierarchy Categories, composed of the Category and Product levels, is
shown in the central pane. The attributes CategoryKey and ProductKey are
used for defining these levels. However, in order to show friendly names when
browsing the hierarchy, the NameColumn property of these attributes are set
to CategoryName and ProductName, respectively.
We next explain how to define the Date dimension. As shown in Fig. 5.26,
the dimension has a hierarchy named Calendar, which is defined using the
attributes Year, Semester, Quarter, Month, and Date. As can be seen, the last
140 5 Logical Data Warehouse Design
two attributes have been renamed in the hierarchy. To enable MDX time func-
tions with time dimensions, the Type property of the dimension must be set
to Time. Further, it is necessary to identify which attributes in a time dimen-
sion correspond to the typical subdivision of time. This is done by defining
the Type property of the attributes of the dimension. Thus, the attributes
DayNbMonth, MonthNumber, Quarter, Semester, and Year are, respectively, of
type DayOfMonth, MonthOfYear, QuarterOfYear, HalfYearOfYear, and Year.
Attributes in hierarchies must have a one-to-many relationship to their
parents in order to ensure correct roll-up operations. For example, a quarter
must roll-up to its semester. In Analysis Services, this is stated by defin-
ing a key for each attribute composing a hierarchy. By default, this key is
set to the attribute itself, which implies that, for example, years are unique.
Nevertheless, in the Northwind data warehouse, attribute MonthNumber has
values such as 1 and 2, and thus, a given value appears in several quarters.
Therefore, it is necessary to specify that the key of the attribute is a combi-
nation of MonthNumber and Year. This is done by defining the KeyColumns
property of the attribute, as shown in Fig. 5.27. Further, in this case, the
NameColumn property must also be set to the attribute that is shown when
browsing the hierarchy, that is, FullMonth. This should be done similarly for
attributes Quarter and Semester.
Fig. 5.27 Definition of the key for attribute MonthNumber in the Calendar hierarchy
These relationships between attributes for the Date dimension are given
in Fig. 5.28, and correspond to the concept of functional dependencies. In
Analysis Services, there are two types of relationships: flexible relationships,
which can evolve over time (e.g., a product can be assigned to a new category),
and rigid ones, which cannot (e.g., a month is always related to its year). The
5.9 Defining the Northwind Data Warehouse in Analysis Services 141
relationships shown in Fig. 5.28 are rigid, as indicated by the solid arrow
head. Figure 5.29 shows some members of the Calendar hierarchy. As can be
seen, the named calculations FullSemester (e.g., S2 1997), FullQuarter (e.g.,
Q2 1997), and FullMonth (e.g., January 1997) are displayed.
The definition of the fact dimension Order follows similar steps than for
the other dimensions, except that the source table for the dimension is the
fact table. The key of the dimension will be composed of the combination
of the order number and the line number. Therefore, the named calculation
OrderLineDesc will be used in the NameColumn property when browsing the
dimension. Also, we must indicate that this is a degenerate dimension when
defining the cube. We will explain this later in this section.
Finally, in many-to-many dimensions, like in the case of City and Em-
ployee, we also need to indicate that the bridge table Territories is actually
defined as a fact table, so Analysis Services can take care of the double-
counting problem. This is also done when defining the cube.
Hierarchies
What we have generically called hierarchies in Chap. 4 and in the present one
are referred to in Analysis Services as user-defined or multilevel hierarchies.
Multilevel hierarchies are defined by means of dimension attributes, and these
attributes may be stored in a single table or in several tables. Therefore, both
the star and the snowflake schema representations are supported in Analysis
Services Multidimensional. As we will see later in this chapter, this is not the
case in Analysis Services Tabular.
Balanced hierarchies are supported by Analysis Services. Examples of
these hierarchies are the Date and the Product dimensions.
Analysis Services does not support unbalanced hierarchies. We have
seen in Sect. 5.5.2 several solutions to cope with them. On the other hand,
Analysis Services supports parent-child hierarchies, which are a special
142 5 Logical Data Warehouse Design
Cubes
In Analysis Services, a cube is built from one or several DSVs. A cube consists
of one or more dimensions from dimension tables and one or more measure
groups from fact tables. A measure group is composed of a set of measures.
The facts in a fact table are mapped as measures in a cube. Analysis Services
allows multiple fact tables in a single cube. In this case, the cube typically
contains multiple measure groups, one from each fact table.
144 5 Logical Data Warehouse Design
(a) (b)
(c)
Fig. 5.31 Definition of the Northwind cube in Analysis Services Multidimensional. (a)
Measure groups; (b) Dimensions; (c) Schema of the cube
Figure 5.31 shows the definition of the Northwind cube in Analysis Ser-
vices. As can be seen in Fig. 5.31a, Analysis Services adds a new measure
to each measure group, in our case Sales Count and Territories Count, which
counts the number fact members associated with each member of each dimen-
sion. Thus, Sales Count would count the number of sales for each customer,
5.9 Defining the Northwind Data Warehouse in Analysis Services 145
supplier, product, and so on. Similarly, Territories Count would count the
number of cities associated to each employee.
measures, that is, measures that can be aggregated in some dimensions but
not in others. Recall that we defined such measures in Sect. 3.1.2. Analysis
Services provides several functions for semiadditive measures such as Avera-
geOfChildren, FirstChild, LastChild, FirstNonEmpty, and LastNonEmpty.
The aggregation function associated to each measure must be defined with
the AggregationFunction property. The default aggregation measure is SUM,
and this is suitable for all measures in our example, except for Unit Price and
Discount. Since these are semiadditive measures, their aggregation should be
AverageOfChildren, which computes, for a member, the average of its children.
The FormatString property is used to state the format in which the mea-
sures will be displayed. For example, measure Unit Price is of type money,
and thus, its format will be $###,###.00, where a ‘#’ displays a digit or
nothing, a ‘0’ displays a digit or a zero, and ‘,’ and ‘.’ are, respectively, thou-
sand and decimal separators. The format string for measure Quantity will be
###,##0. Finally, the format string for measure Discount will be 0.00%,
where the percent symbol ‘%’ specifies that the measure is a percentage and
includes a multiplication by a factor of 100.
Further, we can define the default measure of the cube, in our case Sales
Amount. As we will see in Chap. 6, if an MDX query does not specify the
measure to be displayed, then the default measure will be used.
Data Sources
A tabular data model integrates data from one or several data sources, which
can be relational databases, multidimensional databases, data feeds, text files,
etc. Depending on the data source, specific connection and authentication
information must be provided.
When importing data from a relational database, we can either choose the
tables and views from which to import the data or write queries that specify
the data to import. To limit the data in a table, it is possible to filter columns
or rows that are not needed. A common best practice when defining a tabular
model is to define database views that are used instead of tables for loading
data into the model. Such views decouple the physical database structure
from the tabular data model and, for instance, can combine information from
several tables that corresponds to a single entity in the tabular data model.
However, this requires appropriate access rights to create the views.
An essential question in a tabular model is to decide whether to use a
snowflake dimension from the source data or denormalize the source tables
into a single model table. Generally, the benefits of a single model table out-
weigh the benefits of multiple model tables, in particular since in the tabular
model it is not possible to create a hierarchy that spans several tables, and
also since it is more efficient from a performance viewpoint. However, the
storage of redundant denormalized data can result in increased model stor-
age size, particularly for very large dimension tables. Therefore, the optimal
decision depends on the volume of data and the usability requirements.
148 5 Logical Data Warehouse Design
In the tabular model we will thus import the above views and we will
rename the first three ones as Product, Customer, and Supplier. We also import
the tables Employee, Territories, Sales, Date, and Shipper.
5.9 Defining the Northwind Data Warehouse in Analysis Services 149
Storage Modes
Analysis Services Tabular has two storage modes. By default, it uses an in-
memory columnar database, referred to as Vertipaq, which stores a copy
of the data read from the data source when the data model is refreshed.
As we will analyze in Chap. 15, in-memory means that all the data reside
in RAM whereas columnar means that the data of individual columns are
stored independently from each other. In addition, data are compressed to
reduce the scan time and the memory required. An alternative storage mode,
referred to as DirectQuery, transforms a tabular model into a metadata
layer on top of an external database. It converts a DAX query to a tabular
model into one or more SQL queries to the external relational database, which
reduces the latency between updates in the data source and the availability
of such data in the analytical database. Which of the two storage modes to
chose depends on the application requirements and the available hardware.
Relationships
When importing tables from a relational database, the import wizard infers
relationships from foreign key constraints. However, this is not the case for the
views we created above, which are not linked to the Sales table. We therefore
need to create these relationships as shown in Fig. 5.35, where a relationship
has several characteristics that define its cardinality, filter direction, and ac-
tive state, which we describe below. Figure 5.36 shows the tabular model of
the Northwind data warehouse after defining these relationships.
Relationships in a tabular model are based on a single column. At least
one of the two tables involved in a relationship should use a column that
150 5 Logical Data Warehouse Design
Fig. 5.36 Tabular model of the Northwind data warehouse in Analysis Services
has a unique value for each row of the table. This is usually the primary key
of a table, but unlike in foreign keys, this can also be any candidate key.
If a relationship is based on multiple columns, which is supported in foreign
keys, they should be combined in a single column to create the corresponding
relationship. This can be achieved for example through a calculated column
that concatenates the values of the columns composing the relationship.
The cardinality of most relationships in a tabular model is one-to-many,
where for each row of the table in the one side, called the lookup table,
there are zero, one, or more rows in the other table, referred to as the related
table. An example is the relationship between the Sales and Customer tables
shown in Fig. 5.35. As shown in the figure, the cardinality may be many-to-
one, one-to-many, and one-to-one. The latter option is when the columns in
both sides of a relationship are candidate keys for the tables.
The uniqueness constraint in a relationship only affects the lookup table,
whereas any value in the column of the related table that does not have a
corresponding value in the lookup table will be mapped to a special blank row,
which is automatically added to the lookup table to handle all the unmatched
values in the related table. For example, such a special blank row would
aggregate all the rows of the Sales table that do not have a corresponding
row in the Customer table. This additional row exists only if there is at least
5.9 Defining the Northwind Data Warehouse in Analysis Services 151
one row in the Sales fact that does not exist in the Customer table. In a
one-to-many relationship, the functions RELATEDTABLE and RELATED can
be used in a row-related calculation of the, respectively, lookup and related
tables.
In one-to-one relationships, both sides of the relationship have a unique-
ness constraint in the column involved. However, it is not guaranteed that
all values in one column are present in the other column. In one-to-one rela-
tionships, both tables can act as lookup tables and as related tables at the
same time. For this reason, both functions RELATEDTABLE and RELATED
can be used in row-related calculations for both tables.
Relationships in a tabular model are used for filter propagation, where
a filter applied to any column of a table propagates through relationships to
other tables following the filter direction of the relationships. The direction
of a filter propagation is visible in the diagram view, as shown in Fig. 5.36.
By default, a one-to-many relationship propagates the filter from the lookup
table to the related table, which is referred to as a unidirectional relation-
ship. For example, the relationship between Sales and Customer automatically
propagates a filter applied to one or more of the columns of the Customer table
to the Sales table. This can be modified by defining the relationship as bidi-
rectional, which enables filter propagation in both directions. For example,
as shown in Fig. 5.36, the two relationships linking the bridge table Territo-
ries are bidirectional. In addition, the filter propagation defined in a model
can be modified in a DAX expression using the CROSSFILTER function. It is
worth noting that a one-to-one relationship always has a bidirectional filter
progagation. To change this behavior, the relationship type must be changed
to one-to-many.
In a tabular model there can only be one active relationship between two
tables. Consider for example the three role-playing dimensions between the
Sales and the Date tables. As can be seen in Fig. 5.36, only one of the three
relationships is active, which is the one defined by the OrderDate column
and represented by the solid line; the other two are inactive, as represented
by the dotted lines. An inactive relationship is not automatically used in
a DAX expression unless it is activated by using the USERELATIONSHIP
function. This function activates the relationship in the context of a single
expression without altering the state of the relationships for other calculations
and without modifying the underlying data model. More generally, there
cannot be multiple paths between any two tables in the model. Therefore,
when this happens, one or several relationships are automatically deactivated.
This was one of the reasons for defining the views for the Customer and the
Supplier in the relational database, which are then imported in the tabular
model and used as normal tables. Otherwise it would not be possible to access,
for example, both the countries of customers and the countries of suppliers
without having to explictly write DAX expressions. However, this approach
typically implies introducing data redundancy through denormalization.
152 5 Logical Data Warehouse Design
This column can be created directly in Visual Studio using the grid view,
where the expression is written in the formula textbox and the column is
renamed after right-clicking on it. As another example, a calculated column
can be used to define a measure NetAmount in the Sales table as follows.
Sales[NetAmount] = Sales[SalesAmount] - Sales[Freight]
A calculated column behaves like any other column. The DAX expression
defining a calculated column operates in the context of the current row and
cannot access other rows of the table. Calculated columns are computed
during database processing and then stored in the model. This is by contrast
with SQL, where calculated columns are computed at query time and do not
use memory.
On the other hand, measures are used to aggregate values from many
rows in a table. Suppose that we want to compute the percentage of the
net amount with respect to the sales amount. We cannot use a calculated
column for this. Indeed, although this would compute the right value at the
row level, it would provide erroneous values when aggregating this measure
since it would compute the sum of the ratios. Instead, we need to define a
measure that computes the ratio on the aggregates as follows.
Sales[Net Amount %] := DIVIDE ( SUM(Sales[NetAmount]), SUM(Sales[SalesAmount]) )
Fig. 5.37 Displaying the Total Sales measure by customer country in Power BI
Date Dimension
A tabular model requires a date dimension in order to enable time-related
functions in DAX such as year-to-date aggregations. A date dimension is a
table that has a column of the date data type with the following character-
istics: It must have one record per day, it must cover at least the minimum
and maximum dates in the date field to which it will be applied, but it may
go beyond both ends, and it must have no gaps and thus, may include dates
for which there are no facts recorded.
Fig. 5.38 Setting the Date table in the Northwind tabular model
Figure 5.38 shows how we specify this for the Date table in the Northwind
example. We can also create calculated columns in this table as follows.
'Date'[SemesterName] = "S" & Date[Semester]
'Date'[QuarterName] = "Q" & Date[Quarter]
154 5 Logical Data Warehouse Design
Hierarchies
Hierarchies are essential to enable the analysis of measures at various gran-
ularities. We create next a Calendar hierarchy for the Date dimension in
the Northwind data warehouse. We use the columns Date, MonthName,
QuarterName, SemesterName, and Year to define the hierarchy. As shown in
Fig. 5.39a, the columns may be renamed with user-friendly names. It is con-
sidered to be a good practice to hide columns participating in a hierarchy as
well as hiding columns that are used to sort other columns. This can be done
for example using the properties as shown in Fig. 5.39b.
(a) (b)
Fig. 5.39 (a) Defining the Calendar hierarchy (b) Setting the Hidden and the Sort by
Column properties
The above column uses the PATH function to obtain a delimited text with
the keys of all the supervisors of each row, starting from the topmost one.
For example, for the employee Michael Suyama with key 6 it will return the
value 2|5|6.
5.10 Summary 155
We then need to define three columns named Level 1, Level 2, and Level 3
for the levels. The Level 1 column is defined as follows.
Level 1 =
VAR LevelKey = PATHITEM ( Employee[EmployeePath], 1, INTEGER )
RETURN LOOKUPVALUE ( Employee[EmployeeName], Employee[EmployeeKey], LevelKey )
The PATHITEM function returns the employee key from the path at the po-
sition specified by the second argument counting from left to right, whereas
the LOOKUPVALUE function returns the name of the employee whose key is
equal to the value of the variable LevelKey. The other two columns are defined
similarly by changing the value of the second argument of the PATHITEM
function. Figure 5.40 shows these calculated columns, which should be hid-
den. A hierarchy named Supervision containing these three levels should be
defined and the attribute Hide Members of the hierarchy should be set to Hide
Blank Members to enable a proper visualization in client tools.
5.10 Summary
In this chapter, we have studied the problem of logical design of data ware-
houses, specifically relational data warehouse design. Several alternatives
were discussed: the star, snowflake, starflake, and constellation schemas. As in
the case of operational databases, we provided rules for translating concep-
tual multidimensional schemas into logical schemas. Particular importance
was given to the logical representation of the various kinds of hierarchies
studied in Chap. 4. The problem of slowly changing dimensions was also
addressed in detail. We then explained how the OLAP operations can be
implemented in the relational model using SQL, and also reviewed the ad-
vanced features SQL provides to support OLAP queries. We concluded the
chapter by showing how we can implement the Northwind data warehouse in
Microsoft Analysis Services using both the multidimensional and the tabular
models.
156 5 Logical Data Warehouse Design
5.13 Exercises
Exercise 5.1. Consider the data warehouse of the telephone provider given
in Ex. 3.1. Draw a star schema diagram for the data warehouse.
Exercise 5.2. For the star schema obtained in the previous exercise, write
in SQL the queries given in Ex. 3.1.
Exercise 5.3. Consider the data warehouse of the train application given
in Ex. 3.2. Draw a snowflake schema diagram for the data warehouse with
hierarchies for the train and station dimensions.
Exercise 5.4. For the snowflake schema obtained in the previous exercise,
write in SQL the queries given in Ex. 3.2.
Exercise 5.5. Consider the university data warehouse described in Ex. 3.3.
Draw a constellation schema for the data warehouse taking into account the
different granularities of the time dimension.
Exercise 5.6. For the constellation schema obtained in the previous exercise,
write in SQL the queries given in Ex. 3.3.
Exercise 5.7. Translate the MultiDim schema obtained for the French horse
race application in Ex. 4.5 into the relational model.
Exercise 5.8. Translate the MultiDim schema obtained for the Formula One
application in Ex. 4.7 into the relational model.
Exercise 5.9. Implement in Analysis Services a multidimensional model for
the Foodmart data warehouse given in Fig. 5.41.
Exercise 5.10. Implement in Analysis Services a tabular model for the Food-
mart data warehouse given in Fig. 5.41.
Exercise 5.11. The Research and Innovative Technology Administration
(RITA)1 coordinates the US Department of Transportation’s (DOT) research
programs. It collects several statistics about many kinds of transportation
means, including the information about flight segments between airports sum-
marized by month. It is possible to download a set of CSV files in ZIP format,
one by year, ranging from 1990 up until now. These files include information
about the scheduled and actually departured flights, the number of seats
sold, the freight transported, and the distance traveled, among other ones.
The mentioned web site describes all fields in detail.
Construct an appropriate data warehouse schema for the above applica-
tion. Analyze the input data and motivate the choice of your schema.
1
http://www.transtats.bts.gov/
158 5 Logical Data Warehouse Design
Two fundamental concepts in MDX are tuples and sets. We illustrate them
using the example cube given in Fig. 6.1.
Köln
) r
ity e
Berlin
(C tom
Lyon
s
Cu
Paris
Q1
Time (Quarter)
Q2
Q3
Q4
Produce Seafood
Beverages Condiments
Product (Category)
Notice that in the above expression we stated the coordinate for each of
the three dimensions in the format Dimension.Level.Member. As we will see
later, in MDX there are several ways to specify a member of a dimension.
In particular, the order of the members is not significant, and the previous
tuple can also be stated as follows:
(Date.Quarter.Q1, Product.Category.Beverages, Customer.City.Paris)
Since a tuple points to a single cell, then it follows that each member in the
tuple must belong to a different dimension.
On the other hand, a set is a collection of tuples defined using the same
dimensions. For example, the following set
{ (Product.Category.Beverages, Date.Quarter.Q1, Customer.City.Paris)
(Product.Category.Beverages, Date.Quarter.Q1, Customer.City.Lyon) }
points to the previous cell with value 21 and the one behind it with value 12.
It is worth noting that a set may have one or even zero tuples.
A tuple does not need to specify a member from every dimension. Thus,
the tuple
(Customer.City.Paris)
points to the slice of the cube composed of the sixteen front cells of the cube,
that is, the sales of product categories in Paris, while the tuple
(Customer.City.Paris, Product.Category.Beverages)
points to the four cells at the front and left of the cube, that is, the sales of
beverages in Paris. If a member for a particular dimension is not specified,
then the default member for the dimension is implied. Typically, the default
member is the All member, which has the aggregated value for the dimension.
However, as we will see later, the default member can be also the current
member in the scope of a query.
Let us see now how tuples interact with hierarchies. Suppose that in our
cube we have a hierarchy in the Customer dimension with levels Customer,
City, State, and Country. In this case, the following tuple
(Customer.Country.France, Product.Category.Beverages, Date.Quarter.Q1)
uses the member France at the Country level and points to the single cell that
holds the value for the total sales of beverages in France in the first quarter.
In MDX, measures act much like dimensions. Suppose that in our cube
we have three measures UnitPrice, Discount, and SalesAmount. In this case,
the Measures dimension, which exists in every cube, contains three members,
and thus, we can specify the measure we want as in the following tuple.
(Customer.Country.France, Product.Category.Beverages, Date.Quarter.Q1,
Measures.SalesAmount)
The axis specification states the axes of a query as well as the members
selected for each of these axes. There can be up to 128 axes in an MDX
query. Each axis is given a number: 0 for the x-axis, 1 for the y-axis, 2 for the
z-axis, and so on. The first axes have predefined names, namely, COLUMNS,
ROWS, PAGES, CHAPTERS, and SECTIONS. Otherwise, the axes can be
referenced using the AXIS(number) or the number naming convention, where
AXIS(0) corresponds to COLUMNS, AXIS(1) corresponds to ROWS, and so
on. It is worth noting that query axes cannot be skipped, that is, a query
that includes an axis must not exclude lower-numbered axes. For example, a
query cannot have a ROWS axis without a COLUMNS axis.
The slicer specification on the WHERE clause is optional. If not specified,
the query returns the default measure for the cube. Unless we want to display
the Measures dimension, most queries have a slicer specification.
The simplest form of an axis specification consists in taking the members of
the required dimension, including those of the Measures dimension, as follows:
SELECT [Measures].MEMBERS ON COLUMNS,
[Customer].[Country].MEMBERS ON ROWS
FROM Sales
This query displays all the measures for customers, summarized at the coun-
try level. In MDX the square brackets are optional, except for a name with
embedded spaces, with numbers, or that is an MDX keyword, where they are
required. In the following, we omit unnecessary square brackets.
The above query will show a row with only null values for countries that
do not have customers. The next query uses the NONEMPTY function to
remove such values.
SELECT Measures.MEMBERS ON COLUMNS,
NONEMPTY(Customer.Country.MEMBERS) ON ROWS
FROM Sales
Although in this case the use of the NONEMPTY function and the NON
EMPTY keyword yields the same result, there are slight differences between
both, which go beyond this introduction to MDX.
6.1 Introduction to MDX 163
The above query displayed all measures that are stored in the cube. How-
ever, derived measures such as Net Amount, which we defined in Sect. 5.9.1,
will not appear in the result. If we want this to happen, we should use the
ALLMEMBERS keyword.
SELECT Measures.ALLMEMBERS ON COLUMNS,
Customer.Country.MEMBERS ON ROWS
FROM Sales
6.1.3 Slicing
Consider now the query below, which shows all measures by year.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
FROM Sales
The added condition tells the query to return the values of all measures for
all years but only for customers who live in Belgium. Thus, we can see how
the WHERE clause behaves different than in SQL.
Multiple members from different hierarchies can be added to the WHERE
clause. The previous query can be restricted to customers who live in Belgium
and who bought products in the category beverages as shown next.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
FROM Sales
WHERE (Customer.Country.Belgium, Product.Categories.Beverages)
To use multiple members from the same hierarchy, we need to include a set
in the WHERE clause. For example, the following query shows the values of
all measures for all years for customers who bought products in the category
beverages and live in either Belgium or France.
SELECT Measures.MEMBERS ON COLUMNS,
[Order Date].Year.MEMBERS ON ROWS
FROM Sales
WHERE ( { Customer.Country.Belgium, Customer.Country.France },
Product.Categories.Beverages )
164 6 Data Analysis in Data Warehouses
Using a set in the WHERE clause implicitly aggregates values for all members
in the set. In this case, the query shows aggregated values for Belgium and
France in each cell.
Consider now the following query, which requests the sales amount of
customers by country and by year.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
Customer.Country.MEMBERS ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
6.1.4 Navigation
The result of the query above contains aggregated values for all the years,
including the All column. The CHILDREN function can be used to display
only the values for the individual years (and not the All member) as follows:
SELECT [Order Date].Year.CHILDREN ON COLUMNS, ...
The attentive reader may wonder why the member All does not appear in the
rows of the above result. The reason is that the expression
Customer.Country.MEMBERS
and thus, it selects the members of the Country level of the Geography hierar-
chy of the Customer dimension. Since the All member is the topmost member
of the hierarchy, above the members of the Continent level, it is not a member
of the Country level and does not appear in the result. Let us explain this
further. As we have seen in Chap. 5, every attribute of a dimension defines
an attribute hierarchy. Thus, there is an All member in each hierarchy of a
6.1 Introduction to MDX 165
dimension, for both the user-defined hierarchies and the attribute hierarchies.
Since the dimension Customer has an attribute hierarchy Company Name, if
in the above query we use the expression
Customer.[Company Name].MEMBERS
the result will contain the All member, in addition to the names of all the
customers. Using CHILDREN instead will not show the All member.
It is also possible to select a single member or an enumeration of members
of a dimension. An example is given in the following query.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
{ Customer.Country.France,Customer.Country.Italy } ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
The MEMBERS and CHILDREN functions seen above do not provide the
ability to drill down to a lower level in a hierarchy. For this, the function
DESCENDANTS can be used. For example, the following query
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
DESCENDANTS(Customer.Germany, Customer.City) ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
• AFTER displays values from the Customer level, since it is only level after
City.
• SELF_AND_AFTER displays values from the City and Customer levels.
• BEFORE_AND_AFTER displays values from the Country to the Customer
levels, excluding the former.
• SELF_BEFORE_AFTER displays values from the Country to the Customer
levels.
• LEAVES displays values from the City level as above, since this is the only
leaf level between Country and City. On the other hand, if LEAVES is used
without specifying the level, as in the following query:
DESCENDANTS(Customer.Geography.Germany, ,LEAVES)
The function ANCESTOR can be used to obtain the result for an ancestor
at a specified level, as shown next.
SELECT Measures.[Sales Amount] ON COLUMNS,
ANCESTOR(Customer.Geography.[Du monde entier], Customer.Geography.State)
ON ROWS
FROM Sales
Although an MDX query can display up to 128 axes, most OLAP tools are
only able to display two axes, that is, two-dimensional tables. In this case, the
CROSSJOIN function can be used to combine several dimensions in a single
axis. In order to obtain the sales amount for product categories by country
and by quarter in a matrix format, we need to combine the customer and
time dimensions in a single axis as shown next.
SELECT Product.Category.MEMBERS ON COLUMNS,
CROSSJOIN(Customer.Country.MEMBERS,
[Order Date].Calendar.Quarter.MEMBERS) ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
Alternatively, we can use the cross join operator ‘*’ as shown next.
6.1 Introduction to MDX 167
More than two cross joins can be applied, as shown in the following query.
SELECT Product.Category.MEMBERS ON COLUMNS,
Customer.Country.MEMBERS * [Order Date].Calendar.Quarter.MEMBERS *
Shipper.[Company Name].MEMBERS ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
6.1.6 Subqueries
As stated above, the WHERE clause applies a slice to the cube. This clause
can be used to select the measures or the dimensions to be displayed. For
example, the following query shows the sales amount for the beverages and
condiments categories.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS ON ROWS
FROM Sales
WHERE { Product.Category.Beverages, Product.Category.Condiments }
Instead of using a slicer in the WHERE clause of above query, we can define
a subquery in the FROM clause as follows:
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS ON ROWS
FROM ( SELECT { Product.Category.Beverages,
Product.Category.Condiments } ON COLUMNS
FROM Sales )
This query displays the sales amount for each quarter in a subquery which
only mentions the beverages and condiments product categories. As we can
notice, different from SQL, in the outer query we can mention attributes that
are not selected in the subquery.
Nevertheless, there is a fundamental difference between using the WHERE
clause and using subqueries. When we include the product category hierarchy
in the WHERE clause it cannot appear on any axis, but this is not the case
in the subquery approach as the following query shows.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS * Product.Category.MEMBERS
ON ROWS
FROM ( SELECT { Product.Category.Beverages, Product.Category.Condiments }
ON COLUMNS
FROM Sales )
168 6 Data Analysis in Data Warehouses
The subquery may include more than one dimension, as shown next.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.Quarter.MEMBERS * Product.Category.MEMBERS
ON ROWS
FROM ( SELECT ( { Product.Category.Beverages, Product.Category.Condiments },
{ [Order Date].Calendar.[Q1 2017],
[Order Date].Calendar.[Q2 2017] } ) ON COLUMNS
FROM Sales )
We can also nest several subquery expressions, which are used to express
complex multistep filtering operations, as it is done in the following query,
which asks for the sales amount by quarter for the top two selling countries
for the beverages and condiments product categories.
SELECT Measures.[Sales Amount] ON COLUMNS,
[Order Date].Calendar.[Quarter].Members ON ROWS
FROM ( SELECT TOPCOUNT(Customer.Country.MEMBERS, 2,
Measures.[Sales Amount]) ON COLUMNS
FROM ( SELECT { Product.Category.Beverages,
Product.Category.Condiments } ON COLUMNS
FROM Sales ) )
This query uses the TOPCOUNT function, which sorts a set in descending
order with respect to the expression given as third parameter and returns the
specified number of elements with the highest values. Notice that although we
could have used a single nesting, the expression above is easier to understand.
where Parent is the parent of the new calculated member and MemberName
is its name. Similarly, named sets are used to define new sets as follows:
WITH SET SetName AS h expression i
Calculated members and named sets defined using the WITH clause as
above remain within the scope of a query. To make them visible within the
scope of a session and thus visible to all queries in that session, or within the
scope of a cube, a CREATE statement must be used. In the sequel, we will
only show examples of calculated members and named sets defined within
queries. Calculated members and named sets are computed at run time and
thus, there is no penalty in the processing of the cube or in the number of
aggregations to be stored.
6.1 Introduction to MDX 169
Here, FORMAT_STRING specifies the display format to use for the new cal-
culated member. In the format expression above, a ‘#’ displays a digit or
nothing, while a ‘0’ displays a digit or a zero. The use of the percent sym-
bol ‘%’ specifies that the calculation returns a percentage and includes a
multiplication by a factor of 100.
We can also create a calculated member in a dimension, as shown next.
WITH MEMBER Product.Categories.[All].[Meat & Fish] AS
Product.Categories.[Meat/Poultry] + Product.Categories.[Seafood]
SELECT { Measures.[Unit Price], Measures.Quantity, Measures.Discount,
Measures.[Sales Amount] } ON COLUMNS,
Category.ALLMEMBERS ON ROWS
FROM Sales
The query above creates a calculated member equal to the sum of the
Meat/Poultry and Seafood categories. Being a child of the All member of
the hierarchy Categories of the Product dimension, it will thus belong to the
Category level.
In the following query, we define a named set Nordic Countries composed
of the countries Denmark, Finland, Norway, and Sweden.
WITH SET [Nordic Countries] AS
{ Customer.Country.Denmark, Customer.Country.Finland,
Customer.Country.Norway, Customer.Country.Sweden }
SELECT Measures.MEMBERS ON COLUMNS,
[Nordic Countries] ON ROWS
FROM Sales
In the above example, the named set is defined by enumerating its members
and thus, it is a static name set even if defined in the scope of a session or a
cube, since its result must not be reevaluated upon updates of the cube. On
the contrary, a dynamic named set is evaluated any time there are changes
to the scope. As an example of a dynamic named set, the following query
displays several measures for the top five selling products.
WITH SET TopFiveProducts AS
TOPCOUNT ( Product.Categories.Product.MEMBERS, 5,
Measures.[Sales Amount] )
SELECT { Measures.[Unit Price], Measures.Quantity, Measures.Discount,
Measures.[Sales Amount] } ON COLUMNS,
TopFiveProducts ON ROWS
FROM Sales
170 6 Data Analysis in Data Warehouses
The IIF function has three parameters: the first one is a Boolean condition,
the second one is the value returned if the condition is true, and the third
one is the value returned if the condition is false. Thus, since the All member
has no parent, the value of the measure sales amount for its parent will be
equal to 0, and in this case a value of 1 will be given for the percentage sales.
The GENERATE function iterates through the members of a set, using a
second set as a template for the resultant set. Suppose we want to display the
sales amount by category for all customers in Belgium and France. To avoid
6.1 Introduction to MDX 171
enumerating in the query all customers for each country, the GENERATE
function can be used as follows:
SELECT Product.Category.MEMBERS ON COLUMNS,
GENERATE({Customer.Belgium, Customer.France},
DESCENDANTS(Customer.Geography.CURRENTMEMBER, Customer))
ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
The format expression above defines two formats, the first one for positive
numbers and the second one for negative numbers. Using NEXTMEMBER in
the expression above would show net amount for each month compared with
those of the following month. Since the Northwind cube contains sales data
starting from July 2016, the growth for the first month cannot be measured,
and thus it would be equal to the net amount. In this case, a value of zero is
used for the previous period that is beyond the range of the cube.
In the query above, instead of the PREVMEMBER function we can use
the LAG(n) function, which returns the member located a specified number
of positions preceding a specific member along the member dimension. If
the number given is negative, a subsequent member is returned, if it is zero
the current member is returned. Thus, PREV, NEXT, and CURRENT can be
replaced with LAG(1), LAG(-1), and LAG(0), respectively. A similar function
called LEAD exists, such that LAG(n) is equivalent to LEAD(-n).
Here, the PARALLELPERIOD function selects the member that is four quar-
ters (i.e., a year) prior to the current quarter. Since the Northwind cube
contains sales data starting from July 2016, the above query will show for
the first four quarters a null value for measure Previous Year and the same
value for the measures Net Amount and Net Amount Growth.
The functions OPENINGPERIOD and CLOSINGPERIOD return, respec-
tively, the first or last sibling among the descendants of a member at a given
level. For example, the difference between the sales quantity of a month and
that of the opening month of the quarter can be obtained as follows:
WITH MEMBER Measures.[Quantity Difference] AS
(Measures.[Quantity]) - (Measures.[Quantity],
OPENINGPERIOD([Order Date].Calendar.Month,
[Order Date].Calendar.CURRENTMEMBER.PARENT))
SELECT { Measures.[Quantity], Measures.[Quantity Difference] } ON COLUMNS,
[Order Date].Calendar.[Month] ON ROWS
FROM Sales
which returns the sum of a numeric expression evaluated over a set. For
example, the sum of sales amount for Italy and Greece can be displayed with
the following expression.
SUM({Customer.Country.Italy, Customer.Country.Greece}, Measures.[Sales Amount])
In the expression below, the measure to be displayed is the sum of the current
time member over the year level.
SUM(PERIODSTODATE([Order Date].Calendar.Year,
[Order Date].Calendar.CURRENTMEMBER), Measures.[Sales Amount])
As the Northwind data warehouse contains sales data starting in July 2016,
the value of both YTDSales and QTDSales for August 2016 is the sum of Sales
Amount of July and August 2016. On the other hand, the value of YTDSales
for December 2016 is the sum of Sales Amount from July to December 2016,
while the value of QTDSales for December 2016 is the sum of Sales Amount
from October to December 2016.
The xTD (YTD, QTD, MTD, and WTD) functions refer to year-, quarter-,
month-, and week-to-date periods. They are only applicable to a time dimen-
sion (which was not the case for the other functions we have seen so far).
The xTD functions are equivalent to the PeriodsToDate function with a level
specified. YTD specifies a year level, QTD specifies a quarter level, and so
on. For example, in the query above, the measure YTDSales can be defined
instead by the following expression:
SUM(YTD([Order Date].Calendar.CURRENTMEMBER), Measures.[Sales Amount])
Moving averages are used to solve very common business problems. They
are well suited to track the behavior of temporal series, such as financial
indicators or stock market data. As these data change very rapidly, moving
averages are used to smooth out the variations and discover general trends.
However, choosing the period over which smoothing is performed is essential,
because if the period is too long the average will be flat and will not be useful
to discover any trend, whereas a too short period will show too many peaks
and troughs to highlight general trends.
174 6 Data Analysis in Data Warehouses
The LAG function we have seen in the previous section, combined with
the range operator ‘:’, helps us to write moving averages in MDX. The range
operator returns a set of members made of two given members and all the
members in between. Thus, for computing the 3-month moving average of
the number of orders we can write the following query:
WITH MEMBER Measures.MovAvg3M AS
AVG([Order Date].Calendar.CURRENTMEMBER.LAG(2):
[Order Date].Calendar.CURRENTMEMBER,
Measures.[Order No]), FORMAT_STRING = '###,##0.00'
SELECT { Measures.[Order No], Measures.MovAvg3M } ON COLUMNS,
[Order Date].Calendar.Month.MEMBERS ON ROWS
FROM Sales
WHERE (Measures.MovAvg3M)
6.1.10 Filtering
As its name suggests, filtering is used to reduce the number of axis members
that are displayed. This is to be contrasted with slicing, as specified in the
WHERE clause, since slicing does not affect selection of the axis members,
but rather the values that go into them.
We have already seen the most common form of filtering, where the mem-
bers of an axis that have no values are removed with the NON EMPTY clause.
The FILTER function filters a set according to a specified condition. Suppose
we want to show sales amount in 2017 by city and by product category. If
one were only interested in top-performing cities, defined by those whose sales
amount exceeds $25,000, a filter would be defined as follows:
SELECT Product.Category.MEMBERS ON COLUMNS,
FILTER(Customer.City.MEMBERS, (Measures.[Sales Amount],
[Order Date].Calendar.[2017]) > 25000) ON ROWS
FROM Sales
WHERE (Measures.[Sales Amount], [Order Date].Calendar.[2017])
As another example, the following query shows customers that in 2017 had
profit margin below the one of their city.
6.1 Introduction to MDX 175
Here, Profit% computes the profit percentage of the current member, and
Profit%City applies Profit% to the parent of the current member, that is, the
profit of the state to which the city belongs.
6.1.11 Sorting
In the above queries, the countries are displayed in the hierarchical or-
der determined by the Continent level (the topmost level of the Geography
hierarchy), that is, first the European countries, then the North American
countries, and so on. If we wanted the countries sorted by their name, we can
use the ORDER function, whose syntax is given next.
ORDER(Set, Expression [, ASC | DESC | BASC | BDESC])
The expression can be a numeric or string expression. The default sort order
is ASC. The ‘B’ prefix indicates that the hierarchical order can be broken.
The hierarchical order first sorts members according to their position in the
hierarchy, and then it sorts each level. The nonhierarchical order sorts mem-
bers in the set independently of the hierarchy. In the previous query, the set
of countries can be ordered regardless of the hierarchy in the following way:
SELECT Measures.MEMBERS ON COLUMNS,
ORDER(Customer.Geography.Country.MEMBERS,
Customer.Geography.CURRENTMEMBER.NAME, BASC) ON ROWS
FROM Sales
Here, the property NAME returns the name of a level, dimension, member,
or hierarchy. A similar property, UNIQUENAME, returns the corresponding
176 6 Data Analysis in Data Warehouses
unique name. The answer to this query will show the countries in alphabetical
order, that is, Argentina, Australia, Austria, and so on.
It is often the case that the ordering is based on an actual measure. To
sort the query above based on the sales amount, we can proceed as follows:
SELECT Measures.MEMBERS ON COLUMNS,
ORDER(Customer.Geography.Country.MEMBERS,
Measures.[Sales Amount], BDESC) ON ROWS
FROM Sales
The function TOPCOUNT can also be used to answer the previous query.
SELECT Measures.MEMBERS ON COLUMNS,
TOPCOUNT(Customer.Geography.City.MEMBERS, 3,
Measures.[Sales Amount]) ON ROWS
FROM Sales
6.1 Introduction to MDX 177
The query starts by selecting the three best-selling cities and denotes this
set SetTop3Cities. Then, it adds two members to the Geography hierarchy.
The first one, called Top 3 Cities, contains the aggregation of the measures of
the elements in the set SetTop3Cities. The other member, called Other Cities,
contains the difference between the measures of the member Customer.[All]
and the measures of the member Top 3 Cities. The AGGREGATE function
aggregates each measure using the aggregation operator specified for each
measure. Thus, for measures Unit Price and Discount the average is used,
while for the other measures the sum is applied.
Other functions exist for top filter processing. The TOPPERCENT and
TOPSUM functions return the top elements whose cumulative total is at
least a specified percentage or a specified value, respectively. For example,
the next query displays the list of cities whose sales count accounts for 30%
of all the sales.
SELECT Measures.[Sales Amount] ON COLUMNS,
{ TOPPERCENT(Customer.Geography.City.MEMBERS, 30,
Measures.[Sales Amount]), Customer.Geography.[All] } ON ROWS
FROM Sales
MDX provides many aggregation functions. We have seen already the SUM
and AVG functions. Other functions, like MEDIAN, MAX, MIN, VAR, and
STDDEV compute, respectively, the median, maximum, minimum, variance,
and standard deviation of tuples in a set based on a numeric value. For
example, the following query analyzes each product category to see the total,
maximum, minimum, and average sales amount for a 1-month period in 2017.
178 6 Data Analysis in Data Warehouses
Our next query computes the maximum sales by category as well as the
month in which they occurred.
WITH MEMBER Measures.[Maximum Sales] AS
MAX(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), Measures.[Sales Amount])
MEMBER Measures.[Maximum Period] AS
TOPCOUNT(DESCENDANTS([Order Date].Calendar.Year.[2017],
[Order Date].Calendar.Month), 1, Measures.[Sales Amount]).ITEM(0).NAME
SELECT { [Maximum Sales], [Maximum Period] } ON COLUMNS,
Product.Categories.Category.MEMBERS ON ROWS
FROM Sales
Here, the TOPCOUNT function obtains the tuple corresponding to the maxi-
mum sales amount. Then, the ITEM function retrieves the first member from
the specified tuple, and the NAME function obtains the name of this member.
The COUNT function counts the number of tuples in a set. It has an op-
tional parameter, with values INCLUDEEMPTY or EXCLUDEEMPTY, which
states whether to include or exclude empty cells. For example, the COUNT
function can be used to compute the number of customers that purchased
a particular product category. This can be done by counting the number of
tuples obtained by joining the sales amount and customer names. Excluding
empty cells is necessary to restrict the count to those customers for which
there are sales in the corresponding product category as shown below.
WITH MEMBER Measures.[Customer Count] AS
COUNT({Measures.[Sales Amount] *
[Customer].[Company Name].MEMBERS}, EXCLUDEEMPTY)
SELECT { Measures.[Sales Amount], Measures.[Customer Count] } ON COLUMNS,
Product.Category.MEMBERS ON ROWS
FROM Sales
6.2 Introduction to DAX 179
6.2.1 Expressions
DAX is a typed language that supports the following data types: integer
numbers, floating-point numbers, currency (a fixed decimal number with four
digits of fixed precision that is stored as an integer), datetimes, Boolean,
string, and binary large object (BLOB). The type system determines the
resulting type of an expression based on the terms used in it.
Functions are used to perform calculations on a data model. Most func-
tions have input arguments, or parameters, and these may be required or
optional. A function returns a value when it is executed, which may be a
single value or a table. DAX provides multiple functions to perform calcu-
lations, including date and time functions, logical functions, statistical func-
tions, mathematical and trigonometric functions, financial functions, etc.
DAX provides several types of operators, namely, arithmetic (+, -, *, /),
comparison (=, <>, >, etc.), text concatenation (&), and logical operators
(&& and kk, corresponding to logical and and logical or). Operators are over-
loaded, that is, an operator behaves differently depending on its arguments.
Expressions are constructed with elements of a data model (such as
tables, columns, or measures), functions, operators, or constants. They are
used to define measures, calculated columns, calculated tables, or queries. An
expression for a measure or a calculated column must return a scalar value,
such as a number or a string, while an expression for a calculated table or a
query must return a table. Recall that we have already used DAX expressions
for defining measures and calculated columns when we defined the tabular
model for the Northwind case study in Sect. 5.9.2.
A table in a data model includes both columns and measures. These can
be referenced in an expression for example as ’Sales’[Quantity], where the
180 6 Data Analysis in Data Warehouses
quotes can be omited if the table name does not start with a number, does
not contain spaces, and is not a reserved word like Date or Sum. In addition,
the table name can be omitted if the expression is written in the same context
as the table that includes the referenced column or measure. However, it is
common practice to write the table name for a column reference, such as the
example above, and omit the table name for a measure reference, such as
[Sales Amount]. This improves readability and maintainability, since column
and measure references have different calculation semantics.
Measures are used to aggregate values from multiple rows in a table. An
example is given next, which uses the SUM aggregate function.
Sales[Sales Amount] := SUM( Sales[SalesAmount] )
Here, TODAY returns the current date, YEARFRAC returns the year fraction
representing the number of days between two dates, and INT rounds a number
down to the nearest integer. Calculated columns are computed during data-
base processing and stored into the model. This contrasts with SQL, where
calculated columns are computed at query time and do not use memory. As
a consequence, computing complex calculated columns in DAX is done at
process time and not query time, resulting in a better user experience.
Variables can be used to avoid repeating the same subexpression. The
following example classifies customers according to their total sales.
6.2 Introduction to DAX 181
Customer[Class] =
VAR TotalSales = SUM( Sales[SalesAmount] )
RETURN
SWITCH( TRUE, TotalSales > 1000, "A", TotalSales > 100, "B", "C" )
The CALENDARAUTO function searches in all dates of the data model and
returns a table with a single column named Date containing all the dates
from the first day of the year of the minimum date to the last day of the
year of the maximum date. Since this may include irrelevant dates, such as
customers’ birth dates, the FILTER function fixes the date range to the years
in the existing transactions of the Sales table. Finally, the ADDCOLUMNS
function returns a table with new columns specified by the expressions.
In this case, the CALCULATE function procedes as follows: (1) saves the
current filter context, (2) evaluates each filter argument and produces for
each of them the list of valid values for the column, (3) uses the new condition
to replace the existing filters in the model, (4) computes the expression of
the first argument in the new filter context, and (5) restores the original filter
context and returns the computed result.
Another task performed by the CALCULATE function is to transform any
existing row context into an equivalent filter context. For example, in the
following expression
6.2 Introduction to DAX 183
6.2.3 Queries
to sort the result of the above query in descending order of total sales. An
optional START AT keyword, which is used inside an ORDER BY clause, can
be used to define the value at which the query results begin.
The optional DEFINE clause is used to define measures or variables that
are local to the query. Definitions can reference other definitions that appear
before or after the current definition. The next example displays all measures
by category and supplier.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Freight] = SUM( [Freight] )
MEASURE Sales[Avg Unit Price] = AVERAGE( [UnitPrice] )
MEASURE Sales[Quantity] = SUM( [Quantity] )
MEASURE Sales[Avg Discount] = AVERAGE( [Discount] )
EVALUATE
SUMMARIZECOLUMNS(
Product[CategoryName], Supplier[CompanyName], "Sales Amount", [Sales Amount],
"Freight", [Freight], "Avg Unit Price", [Avg Unit Price],
"Quantity", [Quantity], "Avg Discount", [Avg Discount] )
ORDER BY [CategoryName], [CompanyName]
184 6 Data Analysis in Data Warehouses
Notice that not all the measures above are additive, which is why we used
the AVERAGE aggregate function for Unit Price and Discount.
Consider now the next query, which shows the total, minimum, and max-
imum sales amount by country and category.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Min Amount] = MIN( [SalesAmount] )
MEASURE Sales[Max Amount] = MAX( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Customer[Country], Product[CategoryName], "Sales Amount", [Sales Amount],
"Min Amount", [Min Amount], "Max Amount", [Max Amount] )
ORDER BY [Country]
The above query will not show the rows without sales for a country and a
category. Recall that to achieve this in MDX we need to explicitly add the
NON EMPTY keyword. If we want to show the rows without sales we need
to explicitly replace the measures with a zero value as shown next.
MEASURE Sales[Sales Amount] =
IF( ISBLANK( SUM( [SalesAmount] ) ), 0, SUM( [SalesAmount] ) )
Note that the ROLLUP function must be used with the SUMMARIZE function,
which is similar to the SUMMARIZECOLUMNS function introduced above.
6.2.4 Filtering
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Product[CategoryName], 'Date'[Year], 'Date'[Quarter], "Sales Amount", [Sales Amount] )
ORDER BY [CategoryName], [Year], [Quarter]
Suppose that in the above query we would like to focus only on sales to
European customers. This can be done as follows.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Product[CategoryName], 'Date'[Year], 'Date'[Quarter],
FILTER( Customer, Customer[Continent] = "Europe" ),
"Sales Amount", [Sales Amount] )
ORDER BY [CategoryName], [Year], [Quarter]
The FILTER function returns the rows of a table that satisfy a condition.
In this case, the result of the query will have the same number of rows as
the previous one without a filter, although the aggregated values will only
consider sales to European customers. On the other hand, the following query
focuses only on the beverages and condiments categories.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Product[CategoryName], 'Date'[Year], 'Date'[Quarter],
FILTER( Product, Product[CategoryName] IN { "Condiments", "Beverages" } ),
"Sales Amount", [Sales Amount] )
ORDER BY [CategoryName], [Year], [Quarter]
In this case, the resulting table will display only rows pertaining to one of
the two categories. Notice that several filters can be applied at the same time
as shown in the next query.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Product[CategoryName], 'Date'[Year], 'Date'[Quarter],
FILTER( Customer, Customer[Continent] = "Europe" ),
FILTER( Product, Product[CategoryName] IN { "Condiments", "Beverages" } ),
FILTER( 'Date', 'Date'[Year] = 2017 ),
"Sales Amount", [Sales Amount] )
ORDER BY [CategoryName], [Year], [Quarter]
Consider again the roll-up query of the previous section, which asks for
the sales by customer state and country. Using filtering we can restrict that
query to only European customers as shown next.
186 6 Data Analysis in Data Warehouses
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZE(
FILTER( Customer, Customer[Continent] = "Europe" ),
ROLLUP( Customer[Country], Customer[State] ),
"Sales Amount", [Sales Amount] )
ORDER BY [Country], [State]
Consider now the following query, which displays the percentage profit of
sales for Nordic countries.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Freight] = SUM( [Freight] )
MEASURE Sales[Profit %] = ( [Sales Amount] - [Freight] )/ [Sales Amount]
EVALUATE
SUMMARIZECOLUMNS(
Customer[Country],
FILTER( Customer, Customer[Country] IN
{ "Denmark", "Finland", "Norway", "Sweden" } ),
"Sales Amount", [Sales Amount], "Freight", [Freight], "Profit %", [Profit %] )
ORDER BY [Country]
The filter above, like all the previous ones, is static in the sense that it
always shows the aggregate values for the given countries. Suppose now that
we would like to display only the countries with an overall sales amount
greater than $25,000. In this case, the filter would be dynamic, since different
countries will be shown depending on the contents of the database.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Freight] = SUM( [Freight] )
MEASURE Sales[Profit %] = ( [Sales Amount] - [Freight] ) / [Sales Amount]
EVALUATE
FILTER(
SUMMARIZECOLUMNS(
Customer[Country], "Sales Amount", [Sales Amount], "Freight", [Freight],
"Profit %", [Profit %] ),
[Sales Amount] > 25000 )
ORDER BY [Sales Amount] DESC
FILTER(
SUMMARIZECOLUMNS(
Customer[City], FILTER( 'Date', 'Date'[Year] = 2017 ),
"Sales Amount", Sales[Sales Amount] ),
[Sales Amount] > 25000 ),
"Sales Amount", Sales[Sales Amount] )
ORDER BY [City], [CategoryName]
Recall from Sect. 6.2.2 that the CALCULATE function evaluates an expression
in a context modified by filters, in this case restricted to European customers.
As a more elaborate example, the following query classifies orders in three
categories according to their total sales amount and displays by month the
total number of orders as well as the number of orders per category.
DEFINE
MEASURE Sales[Sales Amount] = SUM( Sales[SalesAmount] )
MEASURE Sales[Order Count] = DISTINCTCOUNT( Sales[OrderNo] )
MEASURE Sales[Order Count A] =
COUNTROWS(
FILTER( VALUES( Sales[OrderNo] ), CALCULATE( [Sales Amount] <= 1000 ) ) )
MEASURE Sales[Order Count B] =
COUNTROWS(
FILTER( VALUES( Sales[OrderNo] ),
CALCULATE( [Sales Amount] > 1000 && [Sales Amount] <= 10000 ) ) )
MEASURE Sales[Order Count C] =
COUNTROWS(
FILTER( VALUES( Sales[OrderNo] ), CALCULATE( [Sales Amount] > 10000 ) ) )
EVALUATE
SUMMARIZECOLUMNS( 'Date'[MonthNumber], 'Date'[Year],
"Order Count", [Order Count], "Order Count A", [Order Count A],
"Order Count B", [Order Count B], "Order Count C", [Order Count C] )
ORDER BY [Year], [MonthNumber]
The ALL function removes any filter from the customer city column. Since
the customer country is used in the SUMMARIZECOLUMNS function as a
grouping column, this has the effect of setting a context filter, so the ratio is
computed for all cities of that country. If we remove this column, the ratio
would instead be computed with respect to all cities. The query also shows
the FORMAT function, which formats output values, in this case percentages.
If we would like to restrict the above query so that it only shows data for
European cities, this can be done by adding a filter as follows.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Sales Amount %] = SUM( [SalesAmount] ) /
CALCULATE( SUM( Sales[SalesAmount] ), ALLSELECTED( Customer[City] ) )
EVALUATE
SUMMARIZECOLUMNS(
Customer[City], Customer[Country],
FILTER( Customer, Customer[Continent] = "Europe" ),
"Sales Amount", [Sales Amount], "Sales Amount %", [Sales Amount %] )
ORDER BY [Country], [City]
Notice that we used the ALLSELECTED function, which removes any context
filters from the customer city column while keeping explicit filters. Using ALL
instead would give the wrong results. This is just an example of the complex-
ities of context changes, whose explanation goes beyond this introduction to
DAX.
The following query shows the customers that in 2017 had profit margins
below that of their city.
6.2 Introduction to DAX 189
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Net Amount] = SUM( [SalesAmount] ) - SUM( [Freight] )
MEASURE Sales[Profit %] = [Net Amount] / [Sales Amount]
MEASURE Sales[Profit % City] =
CALCULATE( [Profit %], ALL( Customer[CompanyName] ) )
EVALUATE
FILTER(
SUMMARIZECOLUMNS(
Customer[CompanyName], Customer[City],FILTER( 'Date', 'Date'[Year] = 2017 ),
"Sales Amount", [Sales Amount], "Net Amount", [Net Amount],
"Profit %", [Profit %], "Profit % City", [Profit % City] ),
NOT ISBLANK([Profit %]) && [Profit %] < [Profit % City] )
ORDER BY [CompanyName]
The PARALLELPERIOD function above shifts the dates four quarters (that
is, one year) back in time. The above query could be written by using the
DATEADD function. Nevertheless, the two functions are not equivalent, since
PARALLELPERIOD returns full periods at the month, quarter, and year gran-
ularity, while DATEADD also works at the day granularity. Notice also that
since the Northwind cube contains sales data starting from July 2016, for the
first four quarters the value of Net Amount PY will be blank and thus, the
values of Net Amount and Net Amount Growth will be equal.
As another example, the following query computes the difference between
the sales quantity of a month and that of the opening month of the quarter.
DEFINE
MEASURE Sales[Total Qty] = SUM( [Quantity] )
MEASURE Sales[Total Qty SoQ Diff] =
VAR SoQ = STARTOFQUARTER( 'Date'[Date] )
RETURN
[Total Qty] - CALCULATE( SUM( [Quantity] ),
DATESBETWEEN( 'Date'[Date], SoQ, ENDOFMONTH( SoQ ) ) )
EVALUATE
SUMMARIZECOLUMNS(
'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber], "Total Qty", [Total Qty],
"Total Qty SoQ Diff", [Total Qty SoQ Diff] )
ORDER BY [Year], [MonthNumber]
DEFINE
MEASURE Sales[No Orders] = DISTINCTCOUNT( Sales[OrderNo] )
MEASURE Sales[MovAvg3M] =
CALCULATE(
AVERAGEX( VALUES( 'Date'[MonthNumber] ), [No Orders] ),
DATESINPERIOD( 'Date'[Date], MAX( 'Date'[Date] ), -3, MONTH ) )
EVALUATE
SUMMARIZECOLUMNS(
'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"No Orders", [No Orders], "MovAvg3M", [MovAvg3M] )
ORDER BY [Year], [MonthNumber]
The ADDCOLUMNS function adds to the Sales table a new column containing
the order date using the RELATED function. Recall from Sect. 5.9.2 that in
the tabular model the active relationship between tables Sales and Date is
based on the order date. If we want instead to obtain the last shipped date
of customers, this can be done as shown next.
DEFINE
MEASURE Customer[Last Shipped Date] =
MAXX( ADDCOLUMNS( Sales, "Shipped Date",
LOOKUPVALUE( 'Date'[Date], 'Date'[DateKey], Sales[ShippedDateKey] ) ),
[Shipped Date] )
EVALUATE
SUMMARIZECOLUMNS (
Customer[CompanyName],
"Last Shipped Date", FORMAT( [Last Shipped Date], "yyyy-mm-dd" ) )
ORDER BY [CompanyName]
The LOOKUPVALUE function searches for the date value in table Date that
corresponds to the shipped date key value in table Sales.
All the measures we have seen until now use the active relationships be-
tween the Date and the Sales table. The following query uses the function
192 6 Data Analysis in Data Warehouses
Here we use the TOPN function, which returns a given number of top rows
according to a specified expression.
As a more elaborate example, the next query displays the top three cities
based on sales amount together with their combined sales, the combined sales
of all the other cities, and the overall sales of all cities.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
VAR Top3Cities =
TOPN( 3,
SUMMARIZECOLUMNS( Customer[City], "Sales Amount", [Sales Amount] ),
[Sales Amount], DESC )
VAR TotalTop3Cities =
ROW("City", "Top 3 Cities",
"Sales Amount", CALCULATE( [Sales Amount], Top3Cities ) )
VAR AllCities =
ROW("City", "All Cities",
"Sales Amount", CALCULATE( [Sales Amount] ) )
VAR OtherCities =
ROW("City", "Other Cities",
"Sales Amount", CALCULATE( [Sales Amount] ) -
6.2 Introduction to DAX 193
Table Top3Cities uses the TOPN function to select from the result of the SUM-
MARIZECOLUMNS function the three rows with the highest sales amount
values. Table TotalTop3Cities uses the ROW function to compute a single-
row table aggregating the sales amount value of the top three cities. Table
AllCities computes a single-row table containing the overall sales amount of
all cities. Table OtherCities computes a single-row table that subtracts the
overall sales amount of the top-three cities from the overall sales amount of
all cities. Finally, the main query uses the UNION function to form the union
of all the previously computed partial results.
Ranking is another essential analytics requirement. DAX provides the
RANKX function for this purpose. For example, the following query ranks
the customer countries with respect to the sales amount figures.
DEFINE
MEASURE Sales[Sales Amount] = SUM ( [SalesAmount] )
EVALUATE
ADDCOLUMNS(
SUMMARIZECOLUMNS( Customer[Country], "SalesAmount", [Sales Amount] ),
"Country Rank", RANKX( ALL( Customer[Country] ), [Sales Amount] ) )
ORDER BY [Country Rank]
The first ranking, Category Rank, computes for each country the rank of
the categories. For example, that would give as result that in Argentina the
category with the highest sales is confections. The columns stated in the
ORDER BY clause above allow us to visualize this. On the other hand, Country
Rank computes for each category the rank of the countries. For example, that
would give as result that for the category confections the country with highest
sales is United States. In order to better visualize this second ranking, the
ORDER BY in the above query should be changed as follows.
194 6 Data Analysis in Data Warehouses
The FILTER function selects the best category for each country, while the
SELECTCOLUMNS function projects out the rank column and renames the
column with the best category.
DAX provides several functions for combining tables. Consider the following
query, which lists and counts the cities assigned to employees.
DEFINE
MEASURE Employee[Nb Cities] = COUNT( EmployeeGeography[City] )
MEASURE Employee[Cities] =
CONCATENATEX( VALUES( EmployeeGeography[City] ), [City], ", " )
EVALUATE
FILTER(
SUMMARIZECOLUMNS(
Employee[EmployeeName], "Nb Cities", [Nb Cities], "Cities", [Cities] ),
[EmployeeName] <> BLANK() )
ORDER BY [EmployeeName]
The function TREATAS modifies the data lineage and makes customer coun-
tries behave as supplier countries when calculating the measure. Thus, by
using customer country as a grouping column in SUMMARIZECOLUMNS, we
obtain the number of suppliers for that country.
As said above, in the tabular model there is not a single table contain-
ing all countries, instead countries are associated to customers, suppliers,
and employees independently. We need to use set operations to combine the
countries appearing in three different tables. DAX provides the following set
operations: UNION, INTERSECT, and EXCEPT, the latter one being the set
difference. For example, the following query computes the union without du-
plicates of the customer and supplier countries.
EVALUATE
DISTINCT( UNION( VALUES( Customer[Country] ), VALUES( Supplier[Country] ) ) )
ORDER BY [Country]
As shown in the query, the UNION function keeps duplicate values, which are
removed with the DISTINCT function. It is worth mentioning that the result
of the union above has no data lineage.
DAX provides the following functions for performing various kinds of joins:
NATURALINNERJOIN, NATURALLEFTOUTERJOIN, and CROSSJOIN. The
first two functions join the tables on common columns of the same name,
while the last function performs a Cartesian product.
NATURALINNERJOIN and NATURALLEFTOUTERJOIN only join columns
from the same source table. In order to join two columns with the same name
that have no relationship, it is necessary to either use the TREATAS function
explained above or to write the column using an expression that breaks the
data lineage. This is shown in the next query, which computes the number
of customers and suppliers for all customer and supplier countries.
DEFINE
MEASURE Sales[Nb Customers] = COUNT( Customer[CompanyName] )
MEASURE Sales[Nb Suppliers] = COUNT( Supplier[CompanyName] )
VAR T1 =
SELECTCOLUMNS(
196 6 Data Analysis in Data Warehouses
SUMMARIZECOLUMNS(
Customer[Country], "Nb Customers", [Nb Customers] ),
"Country", [Country] & "", "Nb Customers", [Nb Customers] )
VAR T2 =
SELECTCOLUMNS(
SUMMARIZECOLUMNS(
Supplier[Country], "Nb Suppliers", [Nb Suppliers] ),
"Country", [Country] & "", "Nb Suppliers", [Nb Suppliers] )
EVALUATE
DISTINCT(
UNION(
NATURALLEFTOUTERJOIN( T1, T2 ),
SELECTCOLUMNS(
NATURALLEFTOUTERJOIN( T2, T1 ),
"Country", [Country], "Nb Customers", [Nb Customers],
"Nb Suppliers", [Nb Suppliers] ) ) )
ORDER BY [Country]
expected figures? What are the sale goals for employees? What is the sales
trend? To obtain this information, business users must define a collection of
indicators and display them timely in order to alert when things are getting
out of the expected path. For example, they can devise a sales indicator that
shows the sales over the current analysis period (e.g., quarter), and how these
sales figures compare against an expected value or company goal. Indicators
of this kind are called key performance indicators (KPIs).
KPIs are complex measurements used to estimate the effectiveness of an
organization in carrying out their activities, and to monitor the performance
of their processes and business strategies. KPIs are traditionally defined with
respect to a business strategy and business objectives, delivering a global
overview of the company status. They are usually included in dashboards
and reports (which will be discussed below), providing a detailed view of each
specific area of the organization. Thus, business users can assess and manage
organizational performance using KPIs. To support decision making, KPIs
typically have a current value which is compared against a target value, a
threshold value and a minimum value. All these values are usually normalized,
to facilitate interpretation.
There have been many proposals of classification of KPIs. The simplest one
is to classify them according to the industry in which they are applied. In
this way, we have, for instance, agriculture KPIs, education and training
KPIs, finance KPIs, and so on. Another simple classification is based on
the functional area where they are applied. Thus, we have accounting KPIs,
corporate services KPIs, finance KPIs, human resources KPIs, and so on.
KPIs can be also classified along the time dimension, depending on whether
they consider the future, the present, or the past.
• Leading KPIs reflect expectations about what can happen in the future.
An example is expected demand.
• Coincident KPIs reflect what is currently happening. An example is num-
ber of current orders.
• Lagging KPIs reflect what happened in the past. Examples include cus-
tomer satisfaction or earnings before interest and taxes.
Another way to classify KPIs refers to whether the indicator measures
characteristics of the input or the output of a process.
• Input KPI measure resources invested in or used to generate business
results. Examples include headcount or cost per hour.
• Output KPIs reflect the overall results or impact of the business activity
to quantify performance. An example is customer retention.
198 6 Data Analysis in Data Warehouses
the attribute for the same period last year. This ratio is then compared
against an expected sales growth value set as a company goal.
• The indicator must be clearly owned by a department or company office,
that means, there must be an individual or a group that must be made
clearly accountable for keeping the indicator on track.
• The metric must be measurable, which means all elements must be quan-
tifiable.
• The indicator can be produced timely. To be useful for decision making,
we must be able to produce a KPI at regular predefined intervals, in a
way such that it can be analyzed together with other KPIs in the set of
indicators.
• The indicator must be aligned with the company goals. KPIs should lead
to the achievement of the global company goals. For example, a global
goal can be to grow 10% per year.
• The number of KPIs must remain manageable, and decision makers must
not be overwhelmed by a large number of indicators.
We will apply the above guidelines to define a collection of indicators for
the Northwind case study in Sect. 7.5.
6.4 Dashboards
managers, on the other hand, must achieve daily or weekly performance goals,
and require not only a narrower time frame and kind of data, but also, if cur-
rent rates are off-target, the ability to quickly investigate the amount and
cause of variation of a parameter. Business analysts have a much broader set
of needs. Rather than knowing what they are looking for, they often approach
performance data with ad hoc questions, therefore they may require a time
frame ranging between just a few hours up to many weeks.
In order to design a dashboard that complies with the needs of the intended
audience, the visual elements and interactions must be carefully chosen. Fac-
tors such as placement, attention, cognitive load, and interactivity contribute
greatly to the effectiveness of a dashboard.
A dashboard is meant to be viewed at a glance, so that the the visual
elements to be shown must be arranged in a display that can be viewed all
at once in a screen, without having to scroll or navigate through multiple
pages, minimizing the effort of viewing information. In addition, important
information must be noticed quickly. From a designer’s viewpoint, it is crucial
to know who the users of the dashboard will be and what their goals are, in
order to define to which of the categories we defined above the dashboard
belongs. This information is typically obtained through user interviews.
To design a dashboard that can be effective and usable for its audience,
we need to choose data visualizations that convey the information clearly, are
easy to interpret, avoid excessive use of space, and are attractive and legible.
For example, dashboards may provide the user with visualizations that allow
data comparison. Line graphs, bar charts, and bullet bars are effective visual
metaphors to use for quick comparisons. Analytical dashboards should pro-
vide interactivity, such as filtering or drill-down exploration. A scatter plot
can provide more detail behind comparisons by showing patterns created by
individual data points.
Operational dashboards should display any variations that would require
action in a way that is quickly and easily noticeable. KPIs are used for effec-
tively showing the comparison and drawing attention to data that indicate
that action is required. A KPI must be set up to show where data falls within
a specified range, so if a value falls below or above a threshold the visual el-
ement utilizes color coding to draw attention to that value. Typically, red
is used to show when performance has fallen below a target, green indicates
good performance, and yellow can be used to show that no action is required.
If multiple KPIs are used in a dashboard, the color coding must be used con-
sistently for the different KPIs, so a user does not have to go through the
extra work of decoding color codes for KPIs that have the same meaning.
For example, we must use the same shade of red for all KPIs on a dashboard
that show if a measure is performing below a threshold.
We must avoid including distracting tools in a dashboard, like motion and
animations. Also, using too many colors, or colors that are too bright, is
distracting and must be avoided. Dashboard visualization should be easy to
interpret and self-explanatory. Thus, only important text (like graph titles,
category labels, or data values) should be placed on the dashboard. While a
dashboard may have a small area, text should not be made so small that it
is difficult to read. A good way to test readability is through test users.
We will apply the above guidelines to define a dashboard for the Northwind
case study in Sect. 7.6.
6.6 Bibliographic Notes 203
6.5 Summary
The first part of this chapter was devoted to querying data warehouses. For
this, we used three different languages, MDX, DAX, and SQL. Both MDX
and DAX can be used as an expression language for defining, respectively,
multidimensional and tabular models, as well as a query language for ex-
tracting data from them. In this chapter, we limited ourselves to their use as
a query language and introduced the main functionalities of MDX and DAX
through examples. We continued by studying Key Performance Indicators,
We gave a classification of them and provided guidelines for their definition.
The chapter finished with the study of dashboards. We characterized different
types of dashboards, and gave guidelines for their definition.
MDX was first introduced in 1997 by Microsoft as part of the OLE DB for
OLAP specification. After the commercial release of Microsoft OLAP Services
in 2018 and Microsoft Analysis Services in 2005, MDX was adopted by the
wide range of OLAP vendors, both at the server and the client side. The latest
version of the OLE DB for OLAP specification was issued by Microsoft in
1999. In Analysis Services 2005 Microsoft added some MDX extensions like
subqueries. There are many books about MDX, a recent one is [189]. MDX
is also covered, although succinctly, in general books covering OLAP tools,
such as [108].
Self-service business intelligence is an approach to data analytics that en-
ables business users to access and work with corporate information in order to
create personalized reports and analytical queries on their own, without the
involvement of IT specialists. In order to realize this vision, Microsoft intro-
duced the Business Intelligence Semantic Model (BISM), which we introduced
in Chap. 3. The BISM supports two models, the traditional multidimensional
model and a new tabular model. The tabular model was designed to be sim-
pler and easier to understand by users familiar with Excel and the relational
data model. In addition, Microsoft has created a new query language to query
the tabular model. This language, called DAX (Data Analysis Expressions),
is a new functional language that is an extension of the formula language in
Excel. In this chapter, we covered DAX, despite being, at the time of writ-
ing, only supported by Microsoft tools. Books entirely devoted to the tabular
model and DAX are [204, 205], although these topics are also covered in the
book [108] already cited above. Data analysis with Power BI and also with
Excel is covered in [74].
The field of visual analytics has produced a vast amount of research results
(see for example the work by Andrienko et al. [9]). However, most of the books
on KPIs and dashboards are oriented to practitioners. A typical reference on
204 6 Data Analysis in Data Warehouses
KPIs is the book by Parmenter [183]. Scientific articles on KPIs are, e.g., [21,
28, 58, 61]. Dashboard design is studied in depthby Few, a specialist in data
visualization [75]. Practical real-world cases are presented in [260]. The design
and implementation of dashboards for Power BI is covered in[170]. Finally,
Microsoft Reporting Services is described (among other books) in [242]
6.1 Describe what is MDX and what it is used for. Describe the two modes
supported by MDX.
6.2 Define what are tuples and sets and MDX.
6.3 Describe the basic syntax of MDX queries and describe the several
clauses that compose an MDX query. Which clauses are required and
which are optional?
6.4 Describe conceptually how an MDX query is executed by specifying
the conceptual order of executing the different clauses composing the
query.
6.5 Define the slicing operation in MDX. How does this operation differ
from the filtering operation specified in SQL in the WHERE clause?
6.6 Why is navigation essential for querying multidimensional databases?
Give examples of navigation functions in MDX and exemplify their
use in common queries.
6.7 What is a cross join in MDX? For which purpose is a cross join needed?
Establish similarities and differences between the cross join in MDX
and the various types of join in SQL.
6.8 What is subcubing in MDX? Does subcubing provide additional ex-
pressive power to the language?
6.9 Define calculated members and named sets. Why are they needed for?
State the syntax for defining them in an MDX query.
6.10 Why time series analysis is important in many business scenarios?
Give examples of functionality that is provided by MDX for time
series analysis.
6.11 What is filtering and how does this differ from slicing?
6.12 How you do sorting in MDX? What are the limitations of MDX in
this respect?
6.13 Give examples of MDX functions that are used for top and bottom
analysis. How do they differ from similar functions provided by SQL?
6.14 Describe the main differences between MDX and SQL.
Chapter 7
Data Analysis in the Northwind Data
Warehouse
This chapter provides a practical overview of the data analysis topics pre-
sented in Chap. 6, namely, querying, key performance indicators, and dash-
boards. These topics are illustrated in the Northwind case study using several
Microsoft tools: Analysis Services, Reporting Services, and Power BI.
The begining of the chapter is devoted to the topic of querying data ware-
houses. For this, we revisit the queries already presented in Sect. 4.4, which
address representative data analysis needs. In Sect. 7.1 we query the multidi-
mensional database using MDX, in Sect. 7.2 we query the tabular database
using DAX, and in Sect. 7.3 we query the relational data warehouse using
SQL. This allows us to compare the main features of these languages in
Sect. 7.4. We continue the chapter by defining in Sect. 7.5 a set of key per-
formance indicators for the Northwind case study and implement them in
the multidimensional and tabular models. Finally, we conclude the chapter
by defining in Sect. 7.6 a dashboard for the Northwind case study.
Query 7.1. Total sales amount per customer, year, and product category.
SELECT [Order Date].Year.CHILDREN ON COLUMNS,
NON EMPTY Customer.[Company Name].CHILDREN *
Product.[Category Name].CHILDREN ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
Here, we display the years on the column axis and we use a cross join of
the Customer and Category dimensions to display both dimensions in the
row axis. We use the CHILDREN function instead of MEMBERS to prevent
displaying the All members of the three dimensions involved in the query.
The NON EMPTY keyword is used to avoid displaying customers that never
Query 7.2. Yearly sales amount for each pair of customer country and sup-
plier countries.
SELECT [Order Date].Year.MEMBERS ON COLUMNS,
NON EMPTY Customer.Country.MEMBERS *
Supplier.Country.MEMBERS ON ROWS
FROM Sales
WHERE Measures.[Sales Amount]
In this query, we use a cross join of the Customer and Supplier dimensions to
display the pair of countries from both dimensions in the row axis.
Query 7.3. Monthly sales by customer state compared to those of the previ-
ous year.
WITH MEMBER Measures.[Previous Year] AS
(Measures.[Sales Amount],
PARALLELPERIOD([Order Date].Calendar.Month,12)),
FORMAT_STRING = '$###,##0.00'
SELECT { Measures.[Sales Amount], Measures.[Previous Year] } ON COLUMNS,
NON EMPTY ORDER(Customer.Geography.State.MEMBERS,
Customer.Geography.CURRENTMEMBER.NAME, BASC) *
[Order Date].Calendar.Month.MEMBERS ON ROWS
FROM Sales
In this query, we do a cross join of the Customer and Order Date dimensions
to display the states and months on the row axis. We use the ORDER function
to sort the states of the customers in alphabetical order irrespective of the
Geography hierarchy. The calculated measure Previous Year computes the sales
amount of the same month of the previous year for the current state and
month using the PARALLELPERIOD function. The format for displaying the
new measure is also defined.
Query 7.4. Monthly sales growth per product, that is, total sales per product
compared to those of the previous month.
WITH MEMBER Measures.[Previous Month] AS
(Measures.[Sales Amount],
[Order Date].Calendar.CURRENTMEMBER.PREVMEMBER),
FORMAT_STRING = '$###,##0.00'
MEMBER Measures.[Sales Growth] AS
(Measures.[Sales Amount]) - (Measures.[Previous Month]),
FORMAT_STRING = '$###,##0.00; $-###,##0.00'
SELECT { Measures.[Sales Amount], Measures.[Previous Month],
Measures.[Sales Growth] } ON COLUMNS,
NON EMPTY ORDER(Product.Categories.Product.MEMBERS,
Product.Categories.CURRENTMEMBER.NAME, BASC) *
[Order Date].Calendar.Month.MEMBERS ON ROWS
FROM Sales
7.1 Querying the Multidimensional Model in MDX 207
In this query, we do a cross join of the Product and Order Date dimensions
to display the products and months on the row axis. The calculated mea-
sure Previous Month computes the sales amount of the previous month of
the current category and month, while the calculated measure Sales Growth
computes the difference of the sales amount of the current month and the
one of the previous month.
Here, we use the TOPCOUNT function to find the three employees who have
the highest value of the sales amount measure. We use the CHILDREN func-
tion instead of MEMBERS since otherwise the All member will appear in the
first position, as it contains the total sales amount of all employees.
The calculated measure Top Sales computes the maximum value of sales
amount for the current year among all employees. The calculated measure
Top Employee uses the function TOPCOUNT to obtain the tuple composed
of the current year and the employee with highest sales amount. The ITEM
function retrieves the first member of the specified tuple. Since such member
is a combination of year and employee, ITEM applied again to obtain the
employee. Finally, the NAME function retrieves the name of the employee.
Query 7.7. Countries that account for top 50% of the sales amount.
SELECT Measures.[Sales Amount] ON COLUMNS,
{ Customer.Geography.[All],
TOPPERCENT([Customer].Geography.Country.MEMBERS, 50,
Measures.[Sales Amount]) } ON ROWS
FROM Sales
208 7 Data Analysis in the Northwind Data Warehouse
In this query, we use the TOPPERCENT function for selecting the countries
whose cumulative total is equal to the specified percentage. We can see in
the answer below, that the sum of the values for the three listed countries
slightly exceeds 50% of the sales amount.
Query 7.8. Total sales and average monthly sales by employee and year.
WITH MEMBER Measures.[Avg Monthly Sales] AS
AVG(DESCENDANTS([Order Date].Calendar.CURRENTMEMBER,
[Order Date].Calendar.Month),Measures.[Sales Amount]),
FORMAT_STRING = '$###,##0.00'
SELECT { Measures.[Sales Amount], Measures.[Avg Monthly Sales] } ON COLUMNS,
Employee.[Full Name].CHILDREN *
[Order Date].Calendar.Year.MEMBERS ON ROWS
FROM Sales
In this query, we cross join the Employee and Order Date dimensions to display
the employee name and the year on the row axis. The calculated measure Avg
Monthly Sales computes the average of sales amount of the current employee
for all months of the current year.
Query 7.9. Total sales amount and discount amount per product and month.
WITH MEMBER Measures.[TotalDisc] AS
Measures.Discount * Measures.Quantity * Measures.[Unit Price],
FORMAT_STRING = '$###,##0.00'
SELECT { Measures.[Sales Amount], [TotalDisc] } ON COLUMNS,
NON EMPTY ORDER(Product.Categories.Product.MEMBERS,
Product.Categories.CURRENTMEMBER.NAME, BASC) *
[Order Date].Calendar.Month.MEMBERS ON ROWS
FROM Sales
In this query, we cross join the Product and Order Date dimensions to display
the product and the month on the row axis. The calculated measure TotalDisc
multiplies the discount, quantity, and unit price measures to compute the
total discount amount of the current product and month.
Query 7.11. Moving average over the last 3 months of the sales amount by
product category.
WITH MEMBER Measures.MovAvg3M AS
AVG([Order Date].Calendar.CURRENTMEMBER.LAG(2):
[Order Date].Calendar.CURRENTMEMBER,
Measures.[Sales Amount]), FORMAT_STRING = '$###,##0.00'
SELECT [Order Date].Calendar.Month.MEMBERS ON COLUMNS,
Product.[Category].MEMBERS ON ROWS
FROM Sales
WHERE (Measures.MovAvg3M)
Here, we use the LAG function and the range operator ‘:’ to construct the set
composed of the current month and its preceding 2 months. Then, we take
the average of the measure Sales Amount over these 3 months.
Query 7.12. Personal sales amount made by an employee compared with the
total sales amount made by herself and her subordinates during 2017.
WITH MEMBER Measures.[Personal Sales] AS
(Employee.Supervision.DATAMEMBER, [Measures].[Sales Amount]),
FORMAT_STRING = '$###,##0.00'
SELECT { Measures.[Personal Sales], Measures.[Sales Amount] } ON COLUMNS,
ORDER(Employee.Supervision.MEMBERS - Employee.Supervision.[All],
Employee.Supervision.CURRENTMEMBER.NAME, BASC) ON ROWS
FROM Sales
WHERE [Order Date].Calendar.Year.[2017]
Query 7.13. Total sales amount, number of products, and sum of the quan-
tities sold for each order.
WITH MEMBER Measures.[NbProducts] AS
COUNT(NONEMPTY([Order].[Order No].CURRENTMEMBER *
[Order].[Order Line].MEMBERS))
SELECT { Measures.[Sales Amount], NbProducts, Quantity } ON COLUMNS,
[Order].[Order No].CHILDREN ON ROWS
FROM Sales
In this query, we use the fact (or degenerate) dimension Order, which is
defined from the fact table Sales in the data warehouse. The dimension has
two attributes, the order number and the order line, and the order number is
displayed on the rows axis. In the calculated measure NbProducts, a cross join
is used to obtain the order lines associated to the current order. By counting
the elements in this set, we can obtain the number of distinct products of
the order. Finally, the measures Sales Amount, NbProducts, and Quantity are
displayed on the column axis.
Query 7.14. For each month, total number of orders, total sales amount,
and average sales amount by order.
WITH MEMBER Measures.AvgSales AS
Measures.[Sales Amount]/Measures.[Order No],
FORMAT_STRING = '$###,##0.00'
SELECT { Measures.[Order No], [Sales Amount], AvgSales } ON COLUMNS,
NON EMPTY [Order Date].Calendar.Month.MEMBERS ON ROWS
FROM Sales
This query displays the months of the Order Date dimension on the row axis,
and the measures Order No, Sales Amount, and AvgSales on the column axis,
the latter being a calculated measure. For Sales Amount, the roll-up operation
computes the sum of the values in a month. For the Order No measure, since
in the cube definition the aggregate function associated to the measure is
DistinctCount, the roll-up operation computes the number of orders within
a month. Notice that for computing the average in the calculated measure
AvgSales we divided the two measures Sales Amount and Order No. If we used
instead AVG(Measures.[Sales Amount]), the result obtained will correspond to
the Sales Amount. Indeed, the average will be applied to a set containing as
only element the measure of the current month.
Query 7.15. For each employee, total sales amount, number of cities, and
number of states to which she is assigned.
WITH MEMBER NoCities AS
Measures.[Territories Count]
MEMBER NoStates AS
DISTINCTCOUNT(Employee.[Full Name].CURRENTMEMBER *
City.Geography.State.MEMBERS)
SELECT { Measures.[Sales Amount], Measures.NoCities, Measures.NoStates }
ON COLUMNS, Employee.[Full Name].CHILDREN ON ROWS
FROM Sales
7.2 Querying the Tabular Model in DAX 211
Query 7.1. Total sales amount per customer, year, and product category.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Customer[CompanyName], 'Date'[Year], Product[CategoryName],
"Sales Amount", [Sales Amount] )
ORDER BY [CompanyName], [Year], [CategoryName]
Query 7.2. Yearly sales amount for each pair of customer and supplier coun-
tries.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
EVALUATE
SUMMARIZECOLUMNS(
Customer[Country], Supplier[Country], 'Date'[Year],
"Sales Amount", [Sales Amount] )
ORDER BY Customer[Country], Supplier[Country], [Year]
In this query, we define the measure Sales Amount as before and aggregate it
for each pair of customer and supplier countries and per year.
212 7 Data Analysis in the Northwind Data Warehouse
Query 7.3. Monthly sales by customer state compared to those of the previ-
ous year.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Sales Amount PY] =
CALCULATE( [Sales Amount], DATEADD('Date'[Date], -1, YEAR ) )
EVALUATE
SUMMARIZECOLUMNS(
Customer[State], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "Sales Amount PY", [Sales Amount PY] )
ORDER BY [State], [Year], [MonthNumber]
In this query, we define the measure Sales Amount as before and we define
the measure Sales Amount PY by using the DATEADD function. We then
aggregate both measures by customer state and month. The above query
requires that the data model defines the Date table as the reference table for
time-intelligence calculations. Alternatively, we can define the measure using
standard DAX functions as shown next.
MEASURE Sales[Sales Amount PY] =
CALCULATE( [Sales Amount],
FILTER( ALL( 'Date' ),
'Date'[MonthNumber] = MAX( 'Date'[MonthNumber] ) &&
'Date'[Year] = MAX( 'Date'[Year] ) - 1 ) )
Here, we define the measure by filtering all the dates up to those in the
current month in the previous year.
Query 7.4. Monthly sales growth per product, that is, total sales per product
compared to those of the previous month.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Sales Amount PM] =
CALCULATE( [Sales Amount], DATEADD('Date'[Date], -1, MONTH ) )
MEASURE Sales[Sales Growth] = [Sales Amount] - [Sales Amount PM]
EVALUATE
SUMMARIZECOLUMNS(
Product[ProductName], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "Sales Amount PM", [Sales Amount PM],
"Sales Growth", [Sales Growth] )
ORDER BY [ProductName], [Year], [MonthNumber]
Here, we use the TOPN function to find the three employees who have the
highest value of the Sales Amount measure. Recall that in Sect. 5.9.2 we
defined the calculated column EmployeeName as the concatenation of the
first name and last name of employees.
The Employee Rank measure uses the RANKX function to rank the em-
ployees according to the Sales Amount measure. In the function SUMMA-
RIZECOLUMNS, the column Top Sales is defined, which has values only for
those employees who are ranked first. Finally, the FILTER function keeps the
rows having a non-blank Top Sales column.
Query 7.7. Countries that account for top 50% of the sales amount.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Country Sales] =
CALCULATE( [Sales Amount], ALL( Customer[Country] ) )
MEASURE Sales[Perc Sales] = [Sales Amount] / [Country Sales]
MEASURE Sales[Cumul Sales] =
CALCULATE( [Sales Amount],
FILTER(
ADDCOLUMNS(
ALL( Customer[Country] ), "Sales", [Sales Amount] ),
[Sales Amount] >=
MINX( VALUES ( Customer[Country] ), [Sales Amount] ) ) )
MEASURE Sales[Cumul Perc] = [Cumul Sales] / [Sales Amount]
214 7 Data Analysis in the Northwind Data Warehouse
VAR Total =
SUMMARIZECOLUMNS(
Customer[Country], "Sales Amount", [Sales Amount],
"Perc Sales", [Perc Sales], "Cumul Sales", [Cumul Sales],
"Cumul Perc", [Cumul Perc] )
EVALUATE
FILTER( Total,
RANKX( Total, [Cumul Sales], , ASC ) <=
COUNTROWS( FILTER( Total, [Cumul Perc] < 0.5 ) ) + 1 )
ORDER BY [Cumul Perc]
Query 7.8. Total sales and average monthly sales by employee and year.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Avg Monthly Sales] = [Sales Amount] /
COUNTROWS(
FILTER(
SUMMARIZE(
FILTER( ALL( 'Date' ), 'Date'[Year] = MAX( 'Date'[Year] ) ),
'Date'[MonthNumber], "Monthly Sales", [Sales Amount] ),
[Sales Amount] > 0 ) )
EVALUATE
SUMMARIZECOLUMNS(
Employee[EmployeeName], 'Date'[Year], "Sales Amount", [Sales Amount],
"Avg Monthly Sales", [Avg Monthly Sales] )
ORDER BY [Full Name], [Year]
Query 7.9. Total sales amount and discount amount per product and month.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Total Discount] =
SUMX( Sales, Sales[Quantity] * Sales[Discount] * Sales[UnitPrice] )
EVALUATE
SUMMARIZECOLUMNS(
'Product'[ProductName], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "Total Discount", [Total Discount] )
ORDER BY [ProductName], [Year], [MonthNumber]
Query 7.11. Moving average over the last 3 months of the sales amount by
product category.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[MovAvg3M] =
CALCULATE(
AVERAGEX( VALUES ( 'Date'[YearMonth] ), [Sales Amount] ),
DATESINPERIOD( 'Date'[Date], MAX( 'Date'[Date] ), -3, MONTH ) )
EVALUATE
SUMMARIZECOLUMNS(
Product[CategoryName], 'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber],
"Sales Amount", [Sales Amount], "MovAvg3M", [MovAvg3M] )
ORDER BY [CategoryName], [Year], [MonthNumber]
Here, we use the DATESINPERIOD function to select all the dates three
months before the current one. These will be used as context for the CALCU-
LATE function. The AVERAGEX function computes the Sales Amount measure
for each YearMonth value (in the context of the measure it is at most three
values), and these are then averaged.
216 7 Data Analysis in the Northwind Data Warehouse
Query 7.12. Personal sales amount made by an employee compared with the
total sales amount made by herself and her subordinates during 2017.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Subordinates Sales] =
CALCULATE( [Sales Amount],
FILTER( ALL( Employee ),
PATHCONTAINS( Employee[SupervisionPath],
SELECTEDVALUE( Employee[EmployeeKey] ) ) ) )
EVALUATE
SUMMARIZECOLUMNS(
Employee[EmployeeName], FILTER( 'Date', 'Date'[Year] = 2017 ),
"Personal Amount", [Sales Amount], "Subordinates Amount", [Subordinates Sales] )
ORDER BY [EmployeeName]
Query 7.13. Total sales amount, number of products, and sum of the quan-
tities sold for each order.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[NbProducts] = COUNTROWS( VALUES ( Sales[ProductKey] ) )
MEASURE Sales[Quantity] = SUM( Sales[Quantity] )
EVALUATE
SUMMARIZECOLUMNS(
Sales[OrderNo], "Sales Amount", [Sales Amount], "NbProducts", [NbProducts],
"Quantity", [Quantity] )
ORDER BY [OrderNo]
This query addresses the fact dimension Order, which is defined from the fact
table Sales. In the measure NbProducts we use the function COUNTROWS to
obtain the number of distinct products of the order. The measure Quantity
aggregates the quantity values for the products. Finally, the main query shows
the requested measures.
Query 7.14. For each month, total number of orders, total sales amount,
and average sales amount by order.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[Nb Orders] = COUNTROWS( SUMMARIZE( Sales, Sales[OrderNo] ) )
MEASURE Sales[AvgSales] = DIVIDE ( [Sales Amount], [Nb Orders] )
7.3 Querying the Relational Data Warehouse in SQL 217
EVALUATE
SUMMARIZECOLUMNS(
'Date'[MonthName], 'Date'[Year], 'Date'[MonthNumber], "Nb Orders", [Nb Orders],
"Sales Amount", [Sales Amount], "AvgSales", [AvgSales] )
ORDER BY [Year], [MonthNumber]
The measure Nb Orders computes the number of orders while the measure
AvgSales divides the measure Sales Amount by the previous measure.
Query 7.15. For each employee, total sales amount, number of cities, and
number of states to which she is assigned.
DEFINE
MEASURE Sales[Sales Amount] = SUM( [SalesAmount] )
MEASURE Sales[NoCities] =
CALCULATE( COUNTROWS( Territories ),
CROSSFILTER( Territories[EmployeeKey], Employee[EmployeeKey], BOTH ) )
MEASURE Sales[NoStates] =
CALCULATE( COUNTROWS( VALUES( EmployeeGeography[State] ),
CROSSFILTER( Territories[EmployeeKey], Employee[EmployeeKey], BOTH ) ) ),
EVALUATE
FILTER(
SUMMARIZECOLUMNS(
Employee[FirstName], Employee[LastName], "Sales Amount", [Sales Amount],
"NoCities", [NoCities], "NoStates", [NoStates] ),
[FirstName] <> BLANK() )
ORDER BY [FirstName], [LastName]
Given the schema of the Northwind data warehouse in Fig. 5.4, we revisit
the queries of the previous sections in SQL. This is of particular importance
218 7 Data Analysis in the Northwind Data Warehouse
because some OLAP tools automatically translate MDX or DAX queries into
SQL and send them to a relational server.
To simplify month-related computations, we create the next two views.
CREATE VIEW YearMonth AS
SELECT DISTINCT Year, MonthNumber, MonthName
FROM Date
CREATE VIEW YearMonthPM AS
WITH YearMonthPrevMonth AS (
SELECT Year, MonthNumber, MonthName,
LAG(Year * 100 + MonthNumber) OVER (ORDER BY
Year * 100 + MonthNumber) AS PrevMonth
FROM YearMonth )
SELECT Year, MonthNumber, MonthName, PrevMonth / 100 AS PM_Year,
PrevMonth % 100 AS PM_MonthNumber
FROM YearMonthPrevMonth
View YearMonth contains all the year and months from the Date dimen-
sion. View YearMonthPM associates with each year and month in the Date
dimension the same month of the previous year. For this, table YearMonth-
PrevMonth uses the LAG window function to associate each month with the
previous one represented by a numerical expression in the format YYYYMM.
For example, the expression associated with January 2017 would be 201612.
Finally, the main query splits this expression into the year and the month.
Query 7.1. Total sales amount per customer, year, and product category.
SELECT C.CompanyName, D.Year, A.CategoryName,
SUM(SalesAmount) AS SalesAmount
FROM Sales S, Customer C, Date D, Product P, Category A
WHERE S.CustomerKey = C.CustomerKey AND S.OrderDateKey = D.DateKey AND
S.ProductKey = P.ProductKey AND P.CategoryKey = A.CategoryKey
GROUP BY C.CompanyName, D.Year, A.CategoryName
ORDER BY C.CompanyName, D.Year, A.CategoryName
Here, we join the fact table with the involved dimension tables and aggregate
the results by company, year, and category.
Query 7.2. Yearly sales amount for each pair of customer country and sup-
plier countries.
SELECT CO.CountryName AS Country, SO.CountryName AS Country,
D.Year, SUM(SalesAmount) AS SalesAmount
FROM Sales F, Customer C, City CC, State CS, Country CO,
Supplier S, City SC, State SS, Country SO, Date D
WHERE F.CustomerKey = C.CustomerKey AND C.CityKey = CC.CityKey AND
CC.StateKey = CS.StateKey AND CS.CountryKey = CO.CountryKey AND
F.SupplierKey = S.SupplierKey AND S.CityKey = SC.CityKey AND
SC.StateKey = SS.StateKey AND SS.CountryKey = SO.CountryKey AND
F.OrderDateKey = D.DateKey
GROUP BY CO.CountryName, SO.CountryName, D.Year
ORDER BY CO.CountryName, SO.CountryName, D.Year
7.3 Querying the Relational Data Warehouse in SQL 219
Here, the tables of the geography dimension are joined twice with the fact
table to obtain the countries of the customer and the supplier.
Query 7.3. Monthly sales by customer state compared to those of the previ-
ous year.
WITH StateMonth AS (
SELECT DISTINCT StateName, Year, MonthNumber, MonthName
FROM Customer C, City Y, State S, YearMonth M
WHERE C.CityKey = Y.CityKey AND Y.StateKey = S.StateKey ),
SalesStateMonth AS (
SELECT StateName, Year, MonthNumber,
SUM(SalesAmount) AS SalesAmount
FROM Sales F, Customer C, City Y, State S, Date D
WHERE F.CustomerKey = C.CustomerKey AND
C.CityKey = Y.CityKey AND Y.StateKey = S.StateKey AND
F.OrderDateKey = D.DateKey
GROUP BY S.StateName, D.Year, D.MonthNumber )
SELECT S.StateName, S.MonthName, S.Year, M1.SalesAmount,
M2.SalesAmount AS SalesAmountPY
FROM StateMonth S LEFT OUTER JOIN SalesStateMonth M1 ON
S.StateName = M1.StateName AND
S.Year = M1.Year AND
S.MonthNumber = M1.MonthNumber
LEFT OUTER JOIN SalesStateMonth M2 ON
S.StateName = M2.StateName AND
S.Year - 1 = M2.Year AND
S.MonthNumber = M2.MonthNumber
WHERE M1.SalesAmount IS NOT NULL OR M2.SalesAmount IS NOT NULL
ORDER BY S.StateName, S.Year, S.MonthNumber
The query starts by defining a table StateMonth makes the Cartesian prod-
uct of the all the customer states and all months in the view YearMonth. Then,
table SalesStateMonth computes the monthly sales by state. Finally, the main
query performs twice a left outer join of the table StateMonth with the table
SalesStateMonth to compute the result.
Query 7.4. Monthly sales growth per product, that is, total sales per product
compared to those of the previous month.
WITH ProdYearMonthPM AS (
SELECT DISTINCT ProductName, Year, MonthNumber, MonthName,
PM_Year, PM_MonthNumber
FROM Product, YearMonthPM ),
SalesProdMonth AS (
SELECT ProductName, Year, MonthNumber,
SUM(SalesAmount) AS SalesAmount
FROM Sales S, Product P, Date D
WHERE S.ProductKey = P.ProductKey AND S.OrderDateKey = D.DateKey
GROUP BY ProductName, Year, MonthNumber )
SELECT P.ProductName, P.MonthName, P.Year, S1.SalesAmount,
S2.SalesAmount AS SalesPrevMonth, COALESCE(S1.SalesAmount,0) -
COALESCE(S2.SalesAmount,0) AS SalesGrowth
220 7 Data Analysis in the Northwind Data Warehouse
We group the sales by employee and apply the SUM aggregation to each
group. The result is then sorted in descending order of the aggregated sales
and the TOP function is used to obtain the first three tuples.
The table SalesProdYearEmp computes the yearly sales by product and em-
ployee. In the query, we select the tuples of this table such that the total sales
equals the maximum total sales for the product and the year.
7.3 Querying the Relational Data Warehouse in SQL 221
Query 7.7. Countries that account for top 50% of the sales amount.
WITH TotalSales AS (
SELECT SUM(SalesAmount) AS SalesAmount
FROM Sales),
SalesCountry AS (
SELECT CountryName, SUM(SalesAmount) AS SalesAmount
FROM Sales S, Customer C, City Y, State T, Country O
WHERE S.CustomerKey = C.CustomerKey AND
C.CityKey = Y.CityKey AND Y.StateKey = T.StateKey AND
T.CountryKey = O.CountryKey
GROUP BY CountryName ),
CumulSalesCountry AS (
SELECT S.*, SUM(SalesAmount) OVER (ORDER BY SalesAmount DESC
ROWS UNBOUNDED PRECEDING) AS CumulSales
FROM SalesCountry S )
SELECT C.CountryName, C.SalesAmount, C.SalesAmount / T.SalesAmount AS
PercSales, C.CumulSales, C.CumulSales / T.SalesAmount AS CumulPerc
FROM CumulSalesCountry C, TotalSales T
WHERE CumulSales <= (
SELECT MIN(CumulSales) FROM CumulSalesCountry
WHERE CumulSales >= (
SELECT 0.5 * SUM(SalesAmount) FROM SalesCountry ) )
ORDER BY SalesAmount DESC
The table SalesCountry aggregates the sales by country. In the table Cumul-
SalesCountry, for each row in the previous table, we define a window contain-
ing all the rows sorted in decreasing value of sales amount, and compute the
sum of the current row and all the preceding rows in the window. Finally, in
the main query, we have to select the countries in CumulSalesCountry whose
cumulative sales amount is less or equal than the minimum value that is
higher or equal to the 50% of the total sales amount.
Query 7.8. Total sales and average monthly sales by employee and year.
WITH MonthlySalesEmp AS (
SELECT E.FirstName + ' ' + E.LastName AS EmployeeName,
D.Year, D.MonthNumber, SUM(SalesAmount) AS SalesAmount
FROM Sales S, Employee E, Date D
WHERE S.EmployeeKey = E.EmployeeKey AND
S.OrderDateKey = D.DateKey
GROUP BY E.FirstName, E.LastName, D.Year, D.MonthNumber )
SELECT EmployeeName, Year, SUM(SalesAmount) AS SalesAmount,
AVG(SalesAmount) AS AvgMonthlySales
FROM MonthlySalesEmp
GROUP BY EmployeeName, Year
ORDER BY EmployeeName, Year
Query 7.9. Total sales amount and discount amount per product and month.
SELECT P.ProductName, D.Year, D.MonthNumber,
SUM(S.SalesAmount) AS SalesAmount,
SUM(S.UnitPrice * S.Quantity * S.Discount) AS TotalDisc
FROM Sales S, Date D, Product P
WHERE S.OrderDateKey = D.DateKey AND S.ProductKey = P.ProductKey
GROUP BY P.ProductName, D.Year, D.MonthNumber
ORDER BY P.ProductName, D.Year, D.MonthNumber
Here, we group the sales by product and month. Then, the SUM aggregation
function is used to obtain the total sales and the total discount amount.
This query follows an approach similar to Query 7.4 so that the year-to-date
aggregation is computed over all the months, including those for which there
are no sales for a category. Table SalesCategoryMonth aggregates the sales
amount by category and month. Table CategorySales collects all categories in
the previous table. Table CategoryMonth computes the Cartesian product of
the previous table with the YearMonth view defined at the beginning of this
section containing all the months in the Date dimension. In the main query, we
perform a left outer join of the previous table with SalesCategoryMonth and,
for each row in the result, define a window containing all the rows with the
same category and year, sort the rows in the window by month, and compute
the sum of the current row and all the preceding rows in the window.
7.3 Querying the Relational Data Warehouse in SQL 223
Query 7.11. Moving average over the last 3 months of the sales amount by
product category.
WITH SalesCategoryMonth AS ( ... ),
CategorySales AS ( ... ),
CategoryMonth AS ( ... )
SELECT C.CategoryName, C.MonthName, C.Year, SalesAmount,
AVG(SalesAmount) OVER (PARTITION BY C.CategoryName ORDER BY
C.Year, C.MonthNumber ROWS 2 PRECEDING) AS MovAvg3M
FROM CategoryMonth C LEFT OUTER JOIN SalesCategoryMonth S ON
C.CategoryName = S.CategoryName AND C.Year = S.Year AND
C.MonthNumber = S.MonthNumber
ORDER BY CategoryName, Year, C.MonthNumber
This query is a variation of the previous one and it defines the same temporary
tables, which are not repeated here. In the query, we perform a left outer join
of the CategoryName table with the SalesCategoryMonth table and for each
row of the result, we define a window containing all the tuples with the same
category, order the tuples in the window by year and month, and compute
the average of the current row and the two preceding ones.
Query 7.12. Personal sales amount made by an employee compared with the
total sales amount made by herself and her subordinates during 2017.
WITH Supervision AS (
SELECT EmployeeKey, SupervisorKey
FROM Employee
WHERE SupervisorKey IS NOT NULL
UNION ALL
SELECT E.EmployeeKey, S.SupervisorKey
FROM Supervision S, Employee E
WHERE S.EmployeeKey = E.SupervisorKey ),
SalesEmp2017 AS (
SELECT EmployeeKey, SUM(S.SalesAmount) AS PersonalSales
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey AND D.Year = 2017
GROUP BY EmployeeKey ),
SalesSubord2017 AS (
SELECT SupervisorKey AS EmployeeKey,
SUM(S.SalesAmount) AS SubordinateSales
FROM Sales S, Supervision U, Date D
WHERE S.EmployeeKey = U.EmployeeKey AND
S.OrderDateKey = D.DateKey AND D.Year = 2017
GROUP BY SupervisorKey )
SELECT FirstName + ' ' + LastName AS EmployeeName, S1.PersonalSales,
COALESCE(S1.PersonalSales + S2.SubordinateSales, S1.PersonalSales)
AS PersSubordSales
FROM Employee E JOIN SalesEmp2017 S1 ON E.EmployeeKey = S1.EmployeeKey
LEFT OUTER JOIN SalesSubord2017 S2 ON
S1.EmployeeKey = S2.EmployeeKey
ORDER BY EmployeeName
Query 7.13. Total sales amount, number of products, and sum of the quan-
tities sold for each order.
SELECT OrderNo, SUM(SalesAmount) AS SalesAmount,
MAX(OrderLineNo) AS NbProducts, SUM(Quantity) AS Quantity
FROM Sales
GROUP BY OrderNo
ORDER BY OrderNo
Recall that the Sales fact table contains both the order number and the order
line number, which constitute a fact dimension. In the query, we group the
sales by order number, and then we apply the SUM and MAX aggregation
functions to obtain the requested values.
Query 7.14. For each month, total number of orders, total sales amount,
and average sales amount by order.
WITH OrderAgg AS (
SELECT OrderNo, OrderDateKey, SUM(SalesAmount) AS SalesAmount
FROM Sales
GROUP BY OrderNo, OrderDateKey )
SELECT Year, MonthNumber, COUNT(OrderNo) AS NoOrders,
SUM(SalesAmount) AS SalesAmount, AVG(SalesAmount) AS AvgAmount
FROM OrderAgg O, Date D
WHERE O.OrderDateKey = D.DateKey
GROUP BY Year, MonthNumber
ORDER BY Year, MonthNumber
Table OrderAgg computes the sales amount of each order. Note that we also
need to keep the key of the time dimension, which will be used in the main
query to join the fact table and the time dimension table. Then, grouping
the tuples by year and month, we compute the aggregated values requested.
Query 7.15. For each employee, total sales amount, number of cities, and
number of states to which she is assigned.
SELECT FirstName + ' ' + LastName AS EmployeeName,
SUM(SalesAmount) / COUNT(DISTINCT CityName) AS TotalSales,
COUNT(DISTINCT CityName) AS NoCities,
COUNT(DISTINCT StateName) AS NoStates
FROM Sales F, Employee E, Territories T, City C, State S
WHERE F.EmployeeKey = E.EmployeeKey AND E.EmployeeKey = T.EmployeeKey AND
T.CityKey = C.CityKey AND C.StateKey = S.StateKey
GROUP BY FirstName, LastName
ORDER BY EmployeeName
7.4 Comparison of MDX, DAX, and SQL 225
We can see that the sum of the SalesAmount measure is multiplied by the
percentage to account for the double-counting problem.
In the preceding sections, we used MDX, DAX, and SQL for querying the
Northwind data warehouse. In this section, we compare these languages.
At a first glance, the syntax of the three languages looks similar. Also their
functionality is similar since, indeed, we expressed the same set of queries in
the three languages. However, there are some fundamental differences be-
tween SQL, MDX, and DAX, which we discuss next.
The main difference between SQL, MDX, and DAX is the ability of MDX
to reference multiple dimensions. Although it is possible to use SQL exclu-
sively to query cubes, MDX provides commands that are designed specifically
to retrieve multidimensional data with almost any number of dimensions. For
DAX, being based in the tabular model, it is naturally harder to handle mul-
tiple dimensions and hierarchies, as we have discussed in Chap. 6. In some
sense this is similar to SQL, given that the latter refers to only two dimen-
sions, columns and rows. Nevertheless, this fundamental difference between
the three languages somehow disappears since for most OLAP tools it is dif-
ficult to display a result set using more than two dimensions (although of
course, different line formats, colors, etc., are used for this). We have seen
in our example queries the use of the cross join operator to combine several
dimensions in one axis when we needed to analyze measures across more than
two dimensions in MDX, for instance. In SQL, the SELECT clause is used to
define the column layout for a query. However, in MDX the SELECT clause
is used to define several axis dimensions. The approach in DAX is different,
since there is no SELECT clause, of course.
In SQL, the WHERE clause is used to filter the data returned by a query,
whereas in MDX, the WHERE clause is used to provide a slice of the data
226 7 Data Analysis in the Northwind Data Warehouse
returned by a query. While the two concepts are similar, they are not equiva-
lent. In an SQL query, the WHERE clause contains an arbitrary list of items
which may or may not be returned in the result set, in order to narrow down
the scope of the data that are retrieved. In MDX, however, the concept of a
slice implies a reduction in the number of dimensions, and thus, each mem-
ber in the WHERE clause must identify a distinct portion of data from a
different dimension. Furthermore, unlike in SQL, the WHERE clause in MDX
cannot filter what is returned on an axis of a query. To filter what appears
on an axis of a query, we can use functions such as FILTER, NONEMPTY,
and TOPCOUNT. This is also the approach in DAX, where the values that
we do not want to display are filtered out using the FILTER clause, and there
is no WHERE clause whatsoever, and there is no notion of slicing either.
We next compare the queries over the Northwind data warehouse as ex-
pressed in MDX in Sect. 7.1, DAX in Sect 7.2, and SQL in Sect. 7.3.
Consider Query 7.1. A first observation is that, in SQL, the joins between
tables must be explicitly indicated in the query, whereas they are implicit in
MDX and DAX. Also, in SQL, an inner join will remove empty combinations,
whereas in MDX and DAX, NON EMPTY and ISBLANK, respectively, must
be specified to achieve this. On the other hand, outer joins are needed in SQL
if we want to show empty combinations.
Furthermore, in SQL the aggregations needed for the roll-up operations
must be explicitly stated through the GROUP BY and the aggregation func-
tions in the SELECT clause, while in MDX the aggregation functions are
stated in the cube definition and they are automatically performed upon
roll-up operations, and in DAX this is done when the measures are computed
with the MEASURE statement. Finally, in SQL the display format must be
stated in the query, while in MDX this is stated in the cube definition, and
in DAX in the model definition.
Consider now the comparison of measures of the current period with re-
spect to those of a previous period, such as the previous month or the same
month in the previous year. An example is given in Query 7.3. In MDX this
can be done with calculated members using the WITH MEMBER clause. In
SQL a temporary table is defined in the WITH clause in which the aggrega-
tions needed for the roll-up operation are performed for each period, and an
outer join is needed in the main query to obtain the measure of the current
period together with that of a previous period. Nevertheless, as shown in
Query 7.4, obtaining the previous month in SQL is somehow complex since
we must account for two cases depending on whether the previous month is in
the same year or in the previous year. In this sense, the approach of DAX is
similar to that in MDX, and the MEASURE statement computes the sales in
the previous period (like MDX does using WITH). Therefore, either in DAX
or in MDX this computation looks simpler and more flexible, although it
requires more semantics to be included in the query languages, which makes
them more cryptic for non-expert users.
7.4 Comparison of MDX, DAX, and SQL 227
query handles the problem naturally through the use of the bridge table when
the cube was created. DAX in this case requires extra handling because of
the limitations imposed by the directionality of the relationships between the
employees and the territories they are assigned to.
To conclude, Table 7.1 summarizes some of the advantages and disadvan-
tages of the three languages.
Pros Cons
numeric values from a cube. The expressions for the status and the trend
should return a value between −1 and 1. This value is used to display a
graphical indicator of the KPI property. The weight controls the contribution
of the KPI to its parent KPI, if it has one. Analysis Services creates hidden
calculated members on the Measures dimension for each KPI property above,
although they can be used in an MDX expression.
We show next how to implement the Sales performance KPI defined in the
previous section. Recall that we want to monitor the monthly sales amount
with respect to the goal of achieving 15% growth year over year. Let us now
give more detail about what the users want. The performance is considered
satisfactory if the actual sales amount is at least 95% of the goal. If the sales
amount is within 85–95% of the goal, there should be an alert. If the sales
amount drops under 85% of the goal, immediate actions are needed to change
the trend. These alerts and calls to action are commonly associated with the
use of KPIs. We are also interested in the trends associated with the sales
amount; if the sales amount is 20% higher than expected, this is great news
and should be highlighted. Similarly, if the sales amount is 20% lower than
expected, then we have to deal immediately with the situation.
The MDX query that computes the goal of the KPI is given next:
WITH MEMBER Measures.SalesPerformanceGoal AS
CASE
WHEN ISEMPTY(PARALLELPERIOD([Order Date].Calendar.Month, 12,
[Order Date].Calendar.CurrentMember))
THEN Measures.[Sales Amount]
ELSE 1.15 * ( Measures.[Sales Amount],
PARALLELPERIOD([Order Date].Calendar.Month, 12,
[Order Date].Calendar.CurrentMember) )
END, FORMAT_STRING = '$###,##0.00'
SELECT { [Sales Amount], SalesPerformanceGoal } ON COLUMNS,
[Order Date].Calendar.Month.MEMBERS ON ROWS
FROM Sales
The CASE statement sets the goal to the actual monthly sales if the corre-
sponding month of the previous year is not included in the time frame of the
cube. Since the sales in the Northwind data warehouse started in July 2016,
this means that the goal until June 2017 is set to the actual sales.
We can use Visual Studio to define the above KPI, which we name Sales
Performance. For this, we need to provide MDX expressions for each of the
above properties as follows:
• Value: The measure defining the KPI is [Measures].[Sales Amount].
• Goal: The goal to increase 15% over last year’s sales amount is given by
FORMAT(CASE ... END, '$###,##0.00')
between −1 and 1. The KPI browser displays a red traffic light when the
status is −1, a yellow traffic light when the status is 0, and a green traffic
light when the status is 1. The MDX expression is given next:
CASE
WHEN KpiValue("Sales Performance") / KpiGoal("Sales Performance") >= 0.95
THEN 1
WHEN KpiValue("Sales Performance") / KpiGoal("Sales Performance") < 0.85
THEN -1
ELSE 0
END
The KpiValue and KpiGoal functions above retrieve, respectively, the ac-
tual and the goal values of the given KPI. The expression computes the
status by dividing the actual value by the goal value. If it is 95% or more,
the value of the status is 1, if it is less than 85%, the value of the status
is −1, otherwise, the value is 0.
• Trend: Among the available graphical indicators, we select the status
arrow. The associated MDX expression is given next:
CASE
WHEN ( KpiValue("Sales Performance") - KpiGoal("Sales Performance") ) /
KpiGoal("Sales Performance") > 0.2
THEN 1
WHEN ( KpiValue("Sales Performance") - KpiGoal("Sales Performance") ) /
KpiGoal("Sales Performance") <= -0.2
THEN -1
ELSE 0
END
This expression computes the trend by subtracting the goal from the
value, then dividing by the goal. If there is a shortfall of 20% or more the
value of the trend is −1, if there is a surplus of 20% or more the value of
the trend is 1, otherwise, the value is 0.
• Weight: We leave it empty.
Figure 7.1 shows the KPI for November and December 2017. As can be seen,
where the figures for the month of December achieved the goal, this was not
the case for the month of November.
Now that the KPI is defined, we can issue an MDX query such as the next
one to display the KPI.
SELECT { Measures.[Sales Amount], Measures.[Sales Performance Goal],
Measures.[Sales Performance Trend] } ON COLUMNS,
{ [Order Date].Calendar.Month.[November 2017],
[Order Date].Calendar.Month.[December 2017] } ON ROWS
FROM Sales
232 7 Data Analysis in the Northwind Data Warehouse
(a)
(b)
Fig. 7.1 Display of the Sales Performance KPI. (a) November 2017; (b) December 2017
'Date'[MonthNumber], 'Date'[Year],
"Sales Amount", FORMAT( [Sales Amount], "$###,##0.00" ),
"Sales Target", FORMAT( [Sales Target], "$###,##0.00" ),
"Sales Status", [Sales Status], "Sales Trend", [Sales Trend] )
ORDER BY [Year], [MonthNumber]
In the Sales Target measure, the variable SalesAmountPY computes the sales
amount in the previous year, so that the target is set to the sales amount if
there is no value for the previous year, otherwise the target is set to a 15%
increase of the sales amount in the previous year. The Sales Status measure
returns a value of 1 if the ratio between the actual value and the target
value of the KPI is 95% or more, it is −1 if this ratio is less than 85%,
and 0 otherwise. Finally, the Sales Trend measure returns a value of 1 if the
difference between the actual value and the target value divided by the latter
is 20% or more, it is −1 if this ratio is less than 20%, and 0 otherwise.
Now that we have defined the DAX expressions for the KPI, we can use
them to display the KPI in PowerBI Desktop. For this we need to install the
Tabular Editor so that we can define the KPI in the model. In the Tabular
Editor we need to transform the Sales Amount measure into a KPI as shown
in Fig. 7.2. The values can then be visualized as shown in Fig. 7.3.
In this section, we show how to define a dashboard for the Northwind case
study. Since the database contains sales until May 5, 2018, we assume that
we are currently at that date.
We want to display in a dashboard a group of indicators to monitor the
performance of several sectors of the company. The dashboard will contain:
• A graph showing the evolution of the total sales per month, together with
the total sales in the same month for the previous year.
• A gauge monitoring the percentual variation of total sales in the last
month with respect to the same month for the previous year. The goal is
to obtain a 5% increase.
• Another graph conveying the shipping costs. The graph reports, monthly,
the total freight cost with respect to the total sales. The goal is that the
shipping costs must be less than the 5% of the sales amount.
• A gauge showing the year-to-date shipping costs to sales ratio. This is
the KPI introduced in Sect. 6.3.
• A table analyzing the performance of the sales force of the company.
It will list the three employees with the poorest sales performance as a
percentage of their target sales as of the current date. For each one of
them, we compute the total sales and the percentage with respect to the
7.6 Dashboards for the Northwind Case Study 235
Fig. 7.4 Defining the Northwind dashboard in Reporting Services using Visual Studio
236 7 Data Analysis in the Northwind Data Warehouse
Figure 7.4 shows the definition of the Northwind dashboard using Visual
Studio. As shown in the left of the figure, the data source is the Northwind
data warehouse. There are five datasets, one for each element of the dash-
board. Each dataset has an associated SQL query and a set of fields, which
correspond to the columns returned by the SQL query. The query shown in
the dialog box corresponds to the top left chart of the report. Figure 7.5
shows the resulting dashboard. We explain below its different components.
The top left chart shows the monthly sales together with those of the same
month of the previous year. The SQL query is given next.
7.6 Dashboards for the Northwind Case Study 237
WITH MonthlySales AS (
SELECT Year(D.Date) AS Year, Month(D.Date) AS Month,
SUBSTRING(DATENAME(month, D.Date), 1, 3) AS MonthName,
SUM(S.SalesAmount) AS MonthlySales
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey
GROUP BY Year(D.Date), Month(D.Date), DATENAME(month, D.Date) )
SELECT MS.Year, MS.Month, MS.MonthName, MS.MonthlySales,
PYMS.MonthlySales AS PreviousYearMonthlySales
FROM MonthlySales MS, MonthlySales PYMS
WHERE PYMS.Year = MS.Year - 1 AND MS.Month = PYMS.Month
In the above query, table MonthlySales computes the monthly sales amount.
Notice that the column MonthName computes the first three letters of the
month name, which is used for the labels of the x-axis of the chart. Then,
the main query joins table MonthlySales twice to obtain the result.
The top right gauge shows the percentage of the last month’s sales with
respect to the sales in the same month of the previous year. Recall that the
last order in the database was placed on May 5, 2018, and thus we would
like to show the figures for the last complete month, that is April 2018. The
gauge defines a range (shown at the interior of the scale) with a gradient from
white to light blue and ranging from 0% to 115%. This corresponds to the
KPI targeting a 15% increase of the monthly sales amount with respect to
the same month of the previous year. The query for the gauge is given next:
WITH MonthlySales AS (
SELECT Year(D.Date) AS Year, Month(D.Date) AS Month,
SUM(S.SalesAmount) AS MonthlySales
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey
GROUP BY Year(D.Date), Month(D.Date) ),
LastMonth AS (
SELECT Year(MAX(D.Date)) AS Year, Month(MAX(D.Date)) AS MonthNumber
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey )
SELECT MS.Year, MS.Month,
MS.MonthlySales,
PYMS.MonthlySales AS PYMonthlySales,
(MS.MonthlySales - PYMS.MonthlySales) / PYMS.MonthlySales AS Percentage
FROM LastMonth L, YearMonthPM Y, MonthlySales MS, MonthlySales PYMS
WHERE L.Year = Y.Year AND L.MonthNumber = Y.MonthNumber AND
Y.PM_Year = MS.Year AND Y.PM_MonthNumber = MS.Month AND
PYMS.Year = MS.Year - 1 AND MS.Month = PYMS.Month
The query for the middle left chart, which shows the shipping costs with
respect to the total sales by month is given next.
SELECT Year(D.Date) AS Year, Month(D.Date) AS Month,
SUBSTRING(DATENAME(mm, D.Date), 1, 3) AS MonthName,
SUM(S.SalesAmount) AS TotalSales, SUM(S.Freight) AS TotalFreight,
SUM(S.Freight) / SUM(S.SalesAmount) AS Percentage
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey
GROUP BY Year(D.Date), Month(D.Date), DATENAME(mm, D.Date)
ORDER BY Year, Month, DATENAME(mm, D.Date)
Here we compute the total sales and the total freight cost by month, and the
percentage of the former comprised by the latter.
The gauge in the middle right of Fig. 7.5 shows the year-to-date shipping
costs to sales ratio. The range of the gauge reflects the KPI used for moni-
toring shipping costs, targeted at remaining below 5% of the sales amount.
The corresponding query is given next.
WITH LastMonth AS (
SELECT Year(MAX(D.Date)) AS Year, Month(MAX(D.Date)) AS Month
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey )
SELECT SUM(S.SalesAmount) AS TotalSales, SUM(S.Freight) AS TotalFreight,
SUM(S.Freight) / SUM(S.SalesAmount) AS Percentage
FROM LastMonth L, YearMonthPM Y, Sales S, Date D
WHERE L.Year = Y.Year AND L.Month = Y.MonthNumber AND
S.OrderDateKey = D.DateKey AND Year(D.Date) = Y.PM_Year AND
Month(D.Date) <= Y.PM_MonthNumber
Table LastDay computes the year and the day-of-the-year number of the last
order (124 for May 4, 2018 in our example). Table TgtSales computes the
target sales that employees must achieve for the current year, as a 5% increase
of the sales amount for the previous year. Table ExpSales computes the year-
to-date sales as well as the expected sales obtained by multiplying the latter
by a factor accounting for the number of days remaining in the year. Finally,
in the main query we join the last two tables with the Employee table, compute
the expected percentage, and display the three lowest-performing employees.
In this section, we show how the dashboard of the previous section can be
implemented in Power BI. Figure 7.6 shows the resulting dashboard. We
explain next the DAX queries for various components.
The top left chart shows the monthly sales compared with those of the
same month in the previous year. The required measures are given next.
[Sales Amount] = SUM ( [SalesAmount] )
[PY Sales] = CALCULATE( [Sales Amount], SAMEPERIODLASTYEAR( 'Date'[Date] ) )
240 7 Data Analysis in the Northwind Data Warehouse
The top right gauge shows the percentage change in sales between the last
month and the same month in the previous year. The required measures are
given next.
[LastOrderDate] = CALCULATE( MAX( 'Date'[Date] ), FILTER( ALL( 'Sales' ), TRUE ) )
[LM Sales] =
VAR LastOrderEOMPM = CALCULATE ( MAX ( 'Date'[YearMonth] ),
FILTER ( 'Date', 'Date'[Date] = EOMONTH ( [LastOrderDate], -1 ) ) )
RETURN
CALCULATE ( [Sales Amount],
FILTER( ALL( 'Date' ), 'Date'[YearMonth] = LastOrderEOMPM ) )
[LMPY Sales] =
VAR LastOrderEOMPY = CALCULATE ( MAX ( 'Date'[YearMonth] ),
FILTER ( 'Date', 'Date'[Date] = EOMONTH ( [LastOrderDate], -13 ) ) )
RETURN
CALCULATE ( [Sales Amount],
FILTER( ALL( 'Date' ), 'Date'[YearMonth] = LastOrderEOMPY ) )
[Perc Change Sales] = DIVIDE([LM Sales] - [LMPY Sales], [LMPY Sales], 0)
[Perc Change Sales Min] = 0.0
[Perc Change Sales Max] = 2.0
[Perc Change Sales Target] = 1.15
LastOrderDate computes the date of the last order. LM Sales computes the
sales of the month preceding the last order date, while LMPY Sales com-
putes the sales of the same month in the previous year. For this, we use the
EOMONTH function, which computes the last day of the month, a specified
number of months in the past or in the future. Perc Change Sales uses the two
previous measures to compute the percentage change. In addition to this, we
also set the minumum, maximum, and target values for the gauge.
Next, we show the measures for the middle left chart, which displays the
shipping costs with respect to the total sales by month.
[Total Freight] = SUM( [Freight] )
[Total Freight to Sales Ratio] = DIVIDE( [Total Freight], [Sales Amount], 0 )
The middle right gauge shows the year-to-date shipping costs to sales ratio.
In addition to computing this measure, we need set the minimum, maximum,
and target values for the gauge as given next.
[YTD Freight] = CALCULATE( [Total Freight],
FILTER( 'Date', 'Date'[Year] = MAX( 'Date'[Year] ) ) )
[YTD Sales] = CALCULATE( [Sales Amount],
FILTER( 'Date', 'Date'[Year] = MAX( 'Date'[Year] ) ) )
[YTD Freight to Sales Ratio] = DIVIDE( [YTD Freight], [YTD Sales], 0 )
[Freight to Sales Ratio Min Value] = 0.0
[Freight to Sales Ratio Max Value] = 0.2
[Freight to Sales Ratio Target Value] = 0.05
Finally, the measures for the bottom table showing the forecast of the
three lowest-performing employees at the end of the year is given next:
7.8 Review Questions 241
The first measure computes the day-of-the-year number for the last order
date (124 for May 4, 2018 in our example). The expected sales are obtained
by multiplying the year-to-date sales by a factor accouting for the number of
days remaining until the end of the year. The target sales are computed as a
5% increase of the previous year’s sales, while the expected quota is computed
as the ratio between the expected sales and the target sales. Finally, the last
measure computes the rank of the employees with respect to the expected
quota. Notice that the filter to display only three employees is done in the
Power BI interface.
7.7 Summary
The first part of this chapter was devoted to illustrating how MDX, DAX,
and SQL can be used for querying data warehouses. For this, we addressed
a series of queries to the Northwind data warehouse. We concluded the first
part of the chapter by comparing the expressiveness of MDX, DAX, and
SQL, highlighting the advantages and disadvantages of these languages. We
continued the chapter by illustrating how to define KPIs for the Northwind
case study in both Analysis Services Multidimensional and Tabular. Finally,
we concluded the chapter by illustrating how to create dashboards for the
Northwind case study using Microsoft Reporting Services and Power BI.
7.1 What are key performance indicators or KPIs? What are they used
for? Detail the conditions a good KPI must satisfy.
7.2 Define a collection of KPIs using an example of an application domain
that you are familiar with.
7.3 Explain the notion of dashboard. Compare the different definitions
for dashboards.
7.4 What types of dashboards do you know? How would you use them?
7.5 Comment on the dashboard design guidelines.
242 7 Data Analysis in the Northwind Data Warehouse
7.9 Exercises
Exercise 7.1. Using the multidimensional model for the Foodmart data
warehouse defined in Ex. 5.9, write in MDX the queries given in Ex. 4.9.
Exercise 7.2. Using the tabular model for the Foodmart data warehouse
defined in Ex. 5.10, write in DAX the queries given in Ex. 4.9.
Exercise 7.3. Using the relational schema of the Foodmart data warehouse
in Fig. 5.41, write in SQL the queries given in Ex. 4.9.
Exercise 7.6. The Foodmart company wants to define a set of KPIs based
on its data warehouse. The finance department wants to monitor the overall
performance of the company stores, to check the percentage of the stores
accountable for 85% of the total sales. The sales department wants to monitor
the evolution of the sales cost. It also wants to measure the monthly rate of
new customers. Propose KPIs that can help the departments in these tasks.
Define these KPIs together with the goals that they are aimed to evaluate.
Exercise 7.8. Define in Analysis Services Tabular the KPIs of Ex. 7.6.
Exercise 7.9. Using the Foodmart data warehouse, define in Reporting Ser-
vices a dashboard to display the best five customers (regarding sales amount)
for the last year, the best five selling products for the last year, the evolution
in the last 2 years of the product sales by family, and the evolution in the
last 2 years of the promotion sales against nonpromoted sales.
In this section, we give an overview of the three classic techniques for im-
proving data warehouse performance: materialized views, indexing, and par-
titioning. Later in the chapter we study these techniques in detail.
As we studied in Chap. 2, a view in the relational model is just a query
that is stored in the database with an associated name, and which can then
be used like a normal table. This query can involve base tables (i.e., tables
physically stored in the database) and/or other views. A materialized view
is a view that is physically stored in a database. Materialized views enhance
query performance by precalculating costly operations such as joins and ag-
gregations and storing the results in the database. In this way, queries that
only need to access materialized views will be executed faster. Obviously, the
increased query performance is achieved at the expense of storage space.
A typical problem of materialized views is updating since all modifica-
tions to the underlying base tables must be propagated into the view. When-
ever possible, updates to materialized views are performed in an incremental
way, avoiding to recalculate the whole view from scratch. This implies captur-
ing the modifications to the base tables and determining how they influence
the content of the view. Much research work has been done in the area of
view maintenance. We study the most classic ones in this chapter.
In a data warehouse, given that the number of aggregates grows exponen-
tially with the number of dimensions and hierarchies, normally not all possi-
ble aggregations can be precalculated and materialized. Thus, an important
problem in designing a data warehouse is the selection of materialized
views. The goal is to select an appropriate set of views that minimizes the
total query response time and the cost of maintaining the selected views,
given a limited amount of resources such as storage space or materialization
time. Many algorithms have been designed for selection of materialized views
and currently some commercial DBMSs provide tools that tune the selection
of materialized views on the basis of previous queries to the data warehouse.
Once the views to be materialized have been defined, the queries addressed
to a data warehouse must be rewritten in order to best exploit such views
to improve query response time. This process, known as query rewriting,
tries to use the materialized views as much as possible, even if they only
partially fulfill the query conditions. Selecting the best rewriting for a query
is a complex process, in particular for queries involving aggregations. Many
algorithms have been proposed for query rewriting in the presence of materi-
alized views. These algorithms impose various restrictions on the given query
and the potential materialized views so that the rewriting can be done.
A drawback of the materialized view approach is that it requires one to an-
ticipate the queries to be materialized. However, data warehouse queries are
often ad hoc and cannot always be anticipated. As queries which are not pre-
calculated must be computed at run time, indexing methods are required to
ensure effective query processing. Traditional indexing techniques for OLTP
8.2 Materialized Views 247
systems are not appropriate for multidimensional data. Indeed, most OLTP
transactions access only a small number of tuples, and the indexing techniques
used are designed for this situation. Since data warehouse queries typically
access a large number of tuples, alternative indexing mechanisms are needed.
Two common types of indexes for data warehouses are bitmap indexes
and join indexes. Bitmap indexes are a special kind of index, particularly
useful for columns with a low number of distinct values (i.e., low cardinality
attributes), although several compression techniques eliminate this limita-
tion. On the other hand, join indexes materialize a relational join between
two tables by keeping pairs of row identifiers that participate in the join. In
data warehouses, join indexes relate the values of dimensions to rows in the
fact table. For example, given a fact table Sales and a dimension Client, a join
index maintains for each client a list of row identifiers of the tuples recording
the sales to this client. Join indexes can be combined with bitmap indexes,
as we will see in this chapter.
Partitioning is a mechanism used in relational databases to improve the
efficiency of queries. It consists in dividing the contents of a relation into
several files that can be more efficiently processed in this way. For exam-
ple, a table can be partitioned such that the most often used attributes are
stored in one partition, while other less often used attributes are kept in an-
other partition. Another partitioning scheme in data warehouses is based on
time, where each partition contains data about a particular time period, for
instance, a year or a range of months.
In the following sections we study these techniques in detail.
It is clear that inserting a tuple like (p2, c3, 110) in the table Sales would
have no effect on the view, since the tuple does not satisfy the view condition.
However, the insertion of the tuple (p2, c3, 160) would possibly modify the
view. An algorithm can easily update it without accessing the base relation,
basically adding the product if it is not already in the view.
Let us now analyze the deletion of a tuple from Sales, for example,
(p2, c3, 160). We cannot delete p2 from the view until checking if p2 has
not been ordered by some other customer in a quantity greater than 150,
which requires to scan the relation Sales.
In summary, although in some cases insertion can be performed just ac-
cessing the materialized view, deletion always requires further information.
Consider now a view FoodCustomers which includes a join. The view con-
tains the customers that ordered at least one product in the food category (we
use the simplified and denormalized Product dimension defined in Sect. 5.7).
FoodCustomers = πCustomerKey (σCategoryName=‘Food’ (Product) ∗ Sales)
If we insert the tuple (p3, c4, 170) in table Sales, we cannot know if c4 will
be in the view FoodCustomers (of course assuming that it is not in the view
already) unless we check in the base relations whether or not p3 is in the food
category.
The above examples show the need of characterizing the kinds of view
maintenance problems in terms of the kind of update and of the operations
in the view definition. Two main classes of algorithms for view maintenance
have been studied in the database literature:
• Algorithms using full information, which means the views and the base
relations.
• Algorithms using partial information, namely, the materialized views and
the key constraints.
8.2 Materialized Views 249
Fig. 8.1 An example of the counting algorithm: (a) Instance of the Sales relation; (b)
View FoodCustomers, including the number of possible derivations of each tu-
ple; (c) View FoodCustomers after the deletion of (p1, c2, 100)
Suppose that we delete tuple (p1, c2, 100) from Sales. Although c2 in
FoodCustomers is derived from the deleted tuple, it has also an alterna-
tive derivation, through (p2, c2, 50). Thus, deleting (p1, c2, 100) does not
prevent c2 to be in the view. The counting algorithm computes a relation
∆− (FoodCustomers) which contains the tuples that can be derived from
(p1, c2, 100), and therefore affected by the deletion of such tuple, and adds a
−1 to each tuple. In this example, ∆− (FoodCustomers) will contain the tuples
250 8 Physical Data Warehouse Design
The first query computes the effect on the view of the changes to Product,
and the second one does the same with the changes to Shipper. Consider
the two relations Product and Shipper in Fig. 8.2a,b, as well as ∆+ (Product)
in Fig. 8.2c containing the tuples inserted in Product. When we insert a
matching tuple like (p3, MP3, s2), the projection of the left outer join with
Shipper would be (p3, MP3, s2, DHL). In this case, the algorithm should also
delete (NULL, NULL, s2, DHL) (because (s2, DHL) now has a matching tu-
ple), together with adding (p3, MP3, s2, DHL). For the tuple (p4, PC, NULL)
is inserted into Product, the left outer join between (p4, PC, NULL) and Ship-
per, yields (p4, PC, NULL, NULL), which is inserted into the view. Figure 8.2e
shows the final state of the view.
8.2 Materialized Views 251
Fig. 8.2 An example of maintenance of a full outer join view. (a) Table Product; (b)
Table Shipper; (c) ∆+ (Product); (d) View ProductShipper; (e) Resulting view
after the insertions
Suppose that c3 is in the view and we delete the tuple (p1, c3, 100) from the
relation Sales. We could not delete c3 from the view without checking if this
customer ordered another food product. If in the base relations we find that
there is another tuple in Sales of the form (p, c3, q), such that p is in the food
category, then c3 will remain in the view. Thus, the view FoodCustomers is
not self-maintainable with respect to deletions on Sales. Analogously, this
view is not self-maintainable with respect to insertions into any of the two
base relations, because for any tuple inserted, for example, into Sales, we
must check if the product is in the food category (except if c3 is already in
the view, in which case nothing should be done).
We say an attribute is distinguished in a view V if it appears in the
SELECT clause of the view definition. An attribute A belonging to a relation
R is exposed in a view V if A is used in a predicate in V . We briefly present
some well-known results in view maintenance theory.
252 8 Physical Data Warehouse Design
shown in Fig. 8.3a. Notice that the tuple (NULL, NULL) is excluded from this
projection. The tables ∆+ (Product) and ∆− (Product) denoting, respectively,
the tuples inserted and deleted from Product, are shown in Fig. 8.3b,c. Since
the view is self-maintainable, we can join these delta tables with Proj_Shipper
instead of Shipper, thus avoiding to access the base relations. The joins be-
tween delta tables and Proj_Shipper are shown in Fig. 8.3d,e. Finally, the
result of both joins is merged with the original view and the side effects are
addressed. For example, when inserting (p3, MP3, s2, DHL), we must delete
(NULL, NULL, s2, DHL). Analogously, when deleting (p1, TV, s1, Fedex), we
must insert (NULL, NULL, s1, Fedex). Figure 8.3f shows the final result.
Fig. 8.3 An example of self-maintenance of a full outer join view. (a) Proj_Shipper;
(b) ∆+ (Product); (c) ∆− (Product); (d) ∆+ (Product) n o Proj_Shipper; (e)
∆− (Product) n
o Proj_Shipper; (f ) Final result
The Count attribute is added in order to maintain the view in the presence
of deletions, as we will explain later. In the propagate phase, we define two
tables, ∆+ (Sales) and ∆− (Sales), which store the insertions and deletions to
the fact table, and a view where the net changes to the summary tables are
stored. The latter is called a summary-delta table, which in this example is
created as follows:
CREATE VIEW SD_DailySalesSum(ProductKey, DateKey,
SD_SumQuantity, SD_Count) AS
WITH Temp AS (
( SELECT ProductKey, DateKey,
Quantity AS _Quantity, 1 AS _Count
FROM ∆+ (Sales) )
UNION ALL
( SELECT ProductKey, DateKey,
-1 * Quantity AS _Quantity, -1 AS _Count
FROM ∆− (Sales) ) )
SELECT ProductKey, DateKey, SUM(_Quantity), SUM(_Count)
FROM Temp
GROUP BY ProductKey, DateKey
In the temporary table Temp of the view definition, we can see that for each
tuple in ∆+ (Sales) we store a 1 in the _Count attribute, while for each tuple in
∆− (Sales) we store a −1. Analogously, the Quantity attribute values are multi-
plied by 1 or −1 depending if they are retrieved from ∆+ (Sales) or ∆− (Sales),
respectively. Then, in the main SELECT clause, the SD_SumQuantity at-
tribute contains the net sum of the quantity for each combination of Product-
Key and DateKey, while SD_Count contains the net number of tuples in the
view corresponding to such combination.
During the refresh phase, we apply to the summary table the net changes
stored in the summary-delta table. Below we give a general scheme of the
refresh algorithm valid when the aggregate function is SUM.
Refresh Algorithm
INPUT: Summary-delta table SD_DailySalesSum
Summary table DailySalesSum
OUTPUT: Updated summary table DailySalesSum
BEGIN
For each tuple T in SD_DailySalesSum DO
IF NOT EXISTS (
SELECT *
FROM DailySalesSum D
8.3 Data Cube Maintenance 255
Figure 8.5b shows the summary-delta table. As can be seen, we need a col-
umn for keeping the maximum value in the tuples inserted or deleted, as well
as another column counting the number of insertions or deletions of tuples
256 8 Physical Data Warehouse Design
Fig. 8.4 An example of the propagate and refresh algorithm with aggregate function
SUM. (a) Original view DailySalesSum; (b) ∆+ (Sales); (c) ∆− (Sales); (d)
Summary-delta table SD_DailySalesSum; (e) View DailySalesSum after update
having the maximum value. Thus, the first four tuples in the summary-delta
table correspond to insertions, while the last three correspond to deletions
since the count value is negative. The view for creating the summary-delta
table is given next.
CREATE VIEW SD_DailySalesMax(ProductKey, DateKey,
SD_MaxQuantity, SD_Count) AS (
SELECT ProductKey, DateKey, Quantity, COUNT(*)
FROM ∆+ (Sales) S1
WHERE Quantity = (
SELECT MAX(Quantity)
FROM ∆+ (Sales) S2
WHERE S1.ProductKey = S2.ProductKey AND
S1.DateKey = S2.DateKey )
GROUP BY ProductKey, DateKey
UNION ALL
8.3 Data Cube Maintenance 257
Fig. 8.5 An example of the propagate and refresh algorithm with aggregate func-
tion MAX. (a) Original view DailySalesMax; (b) Summary-delta table
SD_DailySalesMax; (c) Updated view DailySalesMax
Finally, Fig. 8.5c shows the view after the update. Let us consider first
the insertions. The tuple for p1 in the summary-delta table does not have a
corresponding tuple in the view, and thus, it is inserted in the view. The tuple
for p2 in the summary-delta table has a maximum value smaller than that in
the view so the view is not modified. The tuple for p3 in the summary-delta
table has a quantity value equal to the maximum in the view so the maximum
value remains the same and the counter is increased to 7. The tuple for p4 in
the summary-delta table has a maximum value greater than the maximum
in the view, and thus, the view must be updated with the new maximum and
the new counter.
Now consider the deletions. The tuple for p5 in the summary-delta table
has a quantity value smaller than the maximum in the view so the view is not
modified. The tuple for p6 in the summary-delta table has a quantity value
258 8 Physical Data Warehouse Design
equal to the maximum in the view but with a greater count value. In this
case, we decrease the counter in the view to 1. The tuple for p7 illustrates
why the MAX function is not self-maintainable with respect to deletions.
The maximum value and the counter in the summary-delta table are equal
to those value in the view. There are two possible cases. If there are other
tuples in the base table with the same combination (p7, t7) we must obtain
the new maximum value and the new count from the base tables. This case
is depicted in Fig. 8.5c. Otherwise, if there are no other tuples in the base
table with the same combination (p7, t7), we must simply delete the tuple
from the view.
The algorithm for refreshing the view DailySalesMax from the summary-
delta table SD_DailySalesMax is left as an exercise.
Level 0 All
Level 1 A B C D
Level 2 AB AC AD BC BD CD
Level 4 ABCD
The PipeSort algorithm gives a global strategy for computing the data
cube, which includes the first four optimization methods specified above.
The algorithm includes cache-results and amortize-scans strategies by means
of computing nodes with common prefixes in a single scan. This is called
pipelined evaluation in database query optimization. In this way, we could
compute ABCD, ABC, AB, and A in a single scan because the attribute order
in the view is the sorting order in the file. For example, in the base table below,
260 8 Physical Data Warehouse Design
with a single scan of the first five tuples we can compute the aggregations
(a1, b1, c1, 200), (a1, b1, c2, 500), (a1, b1, 700), (a1, b2, 400), and (a1, 1100).
A B C D
a1 b1 c1 d1 100
a1 b1 c1 d2 100
a1 b1 c2 d1 200
a1 b1 c2 d1 300
a1 b2 c1 d1 400
a2 b1 c1 d1 100
a2 b1 c2 d2 400
··· ··· ··· ··· ···
The input of the algorithm is a data cube lattice in which each edge eij ,
where node i is the parent of node j, is labeled with two costs, S(eij ) and
A(eij ). S(eij ) is the cost of computing j from i if i is not sorted. A(eij ) is
the cost of computing j from i if i is already sorted. Thus, A(eij ) ≤ S(eij ).
In addition, we consider the lattice organized into levels, where each level k
contains views with exactly k attributes, starting from All, where k = 0. This
data structure is called a search lattice.
The output of the algorithm is a subgraph of the search lattice such that
each node has exactly one parent from which it will be computed in a certain
mode, that is, sorted or not (note that in the search lattice, each node, except
All, has more than one parent). If the attribute order of a node j is a prefix of
the order of its parent i, then j can be computed from i without sorting the
latter, and in the resulting graph, the edge will have cost A(eij ). Otherwise,
i has to be sorted to compute j and the edge will have cost S(eij ). Note that
for any node i in an output graph there can be at most one outgoing edge
marked A, and many outgoing edges marked S. The goal of the algorithm is
to find an output graph representing an execution plan such that the sum of
the costs labeling the edges is minimum.
To obtain the minimum cost output graph, the algorithm proceeds level
by level, starting from level 0 until level N − 1, where N is the number of
levels in the search lattice. We find the best way of computing the nodes in
each level k from the nodes in level k + 1, reducing the problem to a weighted
bipartite matching problem as follows. Consider a pair (k, k +1) of levels. The
algorithm first transforms the level k + 1 by making k copies of each one of
its nodes. Thus, each node in level k + 1 will have k + 1 children, that is, k + 1
outgoing edges. All original edges have cost A(eij ) and all replicated edges
have cost S(eij ). Therefore, this transformed graph induces a bipartite graph
(because there are edges between nodes in different levels but not between
nodes in the same level). Finally, we compute the minimum cost matching
such that each node j in level k will be matched to some node i in level k + 1.
If j is connected to i by an A() edge, then j determines the attribute order
in which i will be sorted during its computation. If, instead, j is connected
to i by an S() edge, i will be sorted in order to compute j.
8.4 Computation of a Data Cube 261
A B C D
AB AB AC AC AD AD BC BC BD BD CD CD
(a)
A B C D
AB AB AC AC AD AD BC BC BD BD CD CD
(b)
Fig. 8.7 Minimum bipartite matching between two levels in the cube lattice
Figure 8.8 shows an evaluation plan for computing the cube lattice of
Fig. 8.6 using the PipeSort algorithm. The minimum cost sort plan will first
262 8 Physical Data Warehouse Design
sort the base fact table in CBAD order and compute CBA, CB, C, and All
aggregations in a pipelined fashion. Then, we sort the base fact table in the
BADC order and proceed as above to compute aggregates BAD, BA, and A.
We continue in the same way with ACDB and DBCA. Note how the views in
level 1 (A, B, C, and D) are computed from the views in level 2 in the way
that was indicated by the bipartite graph matching in Fig. 8.7.
All
C B A D
CB BA AC DB AD CD
CBAD
Fig. 8.8 Evaluation plan for computing the cube lattice in Fig. 8.6
We have already said that algorithms like PipeSort, and most algorithms com-
puting summary tables, require knowing the size of each aggregate. However,
in general this is not known in advance. Thus, we need to accurately predict
the sizes of the different aggregates. There are three classic methods for this,
although a wide array of statistical techniques could be used. The first of
these methods is purely analytical, the second is based on sampling, and the
last one on probabilistic counting.
The analytical algorithm is based on a result by Feller from 1957, stating
that choosing r elements (which we can assume are the tuples in a relation)
randomly from a set of n elements (which are all the different values a set
of attributes can take), the expected number of distinct elements obtained is
n − n × (1 − n1 )r . This assumes that data are uniformly distributed. If it turns
out not to be the case, and data present some skew, we will be overestimating
the size of the data cube. For instance, let us suppose a relation R(ProductKey,
CustomerKey, DateKey). If we want to estimate the size of the aggregation
over ProductKey and CustomerKey, we should know the number of different
values of each attribute. Then, n = |ProductKey|×|CustomerKey|, and r is the
number of tuples in R. The main advantage of this method is its simplicity
8.4 Computation of a Data Cube 263
and performance. The obvious drawback of the algorithm is that it does not
consult the database, and the results can be used only when we know that
data are uniformly distributed.
The basic idea of the sampling-based algorithm is to take a random
subset of the database and compute the cube over this subset. Let D be the
database, S the sample, and Cube(S) the size of the cube computed from
S. The size of the cube will be estimated as Cube(S) × |D| |S| . This method is
simple and fast, and it has been reported that it provides satisfactory results
over real-world data sets.
The probabilistic counting algorithm is based on the following obser-
vation: suppose we want to compute the number of tuples of the aggregation
of sales by product category and shipper. We would first aggregate along the
dimension Product, to generate product categories, and count the number
of distinct shippers generated by this operation. For example, for the set of
product-shipper pairs {(p1, s1), (p2, s1), (p3, s2), (p4, s4), (p5, s4)}, if p1 and
p2 correspond to category c1, and the rest to category c2, the aggregation
will have three tuples: {(c1, s1), (c2, s2), (c2, s4)}. In other words, c1 yields
only one value of shipper, and c2 yields two distinct values of shipper. Thus,
estimating the number of distinct tuples in a group (in this case, shippers by
category), we can estimate the number of tuples in that group. This idea is
used to estimate the size of a data cube by means of counting the number
of distinct elements in a multiset as proposed in a well-known algorithm by
Flajolet and Martin, performing this for all possible combinations of the hi-
erarchies in the cube. The algorithm estimates the sizes of the aggregations
in a cube at the cost of scanning the whole database once. However, this is
cheaper than actually computing the cube, and it is proved that the error
has a bound. Details of this algorithm fall beyond the scope of this book.
The algorithm above works as follows. Given a view w (not yet material-
ized), let us denote u the (materialized) view of minimum cost from which w
can be computed. Given a candidate view v selected for materialization, for
each view w that depends on v, the benefit of materializing w (denoted by
Bw) is computed as the difference between the costs of v and u. If computing
w from v is more expensive than doing it from u (C(v) > C(u)), materializing
the candidate view does not benefit the computation of w (Bw = 0). The
algorithm iterates over all views w, and
P finally, the benefit of materializing v
is the sum of all individual benefits ( wv Bw).
The view selection algorithm computes, in each iteration, the view v whose
materialization gives the maximum benefit. This algorithm is given next.
View Selection Algorithm
INPUT: A lattice L, each view node v labeled with its expected number of rows
The number of views to materialize, k
OUTPUT: The set of views to materialize
BEGIN
S = {The bottom view in L}
FOR i = 1 TO k DO
Select a view v not in S such that B(v, S) is maximized
S = S ∪ {v}
END DO
S is the selection of views to materialize
END
The set S contains the views already materialized. In each one of the k itera-
tions, the algorithm computes the benefit produced by the materialization of
each of the views not yet in S. The one with the maximum benefit is added
to S, and a new iteration begins.
Let us apply the algorithm to the lattice in Fig. 8.9. In addition to the
node label, beside each node we indicate the cost of the view that the node
represents. Assume that we can materialize three views and that the bottom
view is already materialized.
Let us show how to select the first view to materialize. We need to compute
the benefit of materializing each view, knowing that S = {ABC}. We start
266 8 Physical Data Warehouse Design
All 1
A 20 B 60 C 40
ABC 2,000
Fig. 8.9 Dependency lattice. Initially, the only view materialized is ABCD
with node AB, which is a good candidate, since it offers a cost reduction of
1,600 units for each view that depends on it. For example, node A depends
on AB. Currently, computing A has cost 2,000, since this is performed from
ABC. If we materialize AB, the cost of computing A will drop to 400.
The benefit of materializing AB given S is given by
P
B(AB, S) = wAB Bw.
Thus, for each view w covered by AB, we compute C(ABC) − C(AB), be-
cause ABC is the only materialized view when the algorithm begins. That is,
C(ABC) − C(AB) is the benefit of materializing AB for each view covered by
AB. For example, to compute B without materializing AB, we would need to
scan ABC at cost 2,000. With AB being materialized this reduces to 400. The
same occurs with all the views that have a path to All that passes through
AB, that is, A, B, All, and AB itself. For C, AC, and BC, the materialization
of AB is irrelevant. Then,
In an analogous way,
P
B(BC, S) = wBC Bw = 1, 300 × 4 = 5, 200,
It can be proved that the benefit of this greedy algorithm is at least 63%
of the benefit of the optimal algorithm. On the other hand, even this is a
classic algorithm, pedagogically interesting for presenting the problem, a clear
drawback is that it does not consider the frequency of the queries over each
view. Thus, in our example, even though the sum of the benefit is maximum,
nothing is said about the frequency of the queries asking for A or B. This
drawback has been addressed in several research papers.
index over such attribute, a single disk block access will do the job since this
attribute is a key for the relation.
Although indexing provides advantages for fast data access, it has a draw-
back: almost every update on an indexed attribute also requires an index up-
date. This suggests that database administrators should be careful on defining
indexes because their proliferation can lead to bad updating performance.
The most popular indexing technique in relational databases is the B+ -
tree. All major vendors provide support for some variation of B+ -tree indexes.
A B+ -tree index is a multilevel structure containing a root node and pointers
to the next lower level in a tree. The lowest level is formed by the leaves of
the tree, which in general contain a record identifier for the corresponding
data. Often, the size of each node equals the size of a block, and each node
holds a large number of keys, so the resulting tree has a low number of levels
and the retrieval of a record can be very fast. This works well if the attribute
being indexed is a key of the file or if the number of duplicate values is low.
We have seen that queries submitted to an OLAP system are of a very
different nature than those of an OLTP system. Therefore, new indexing
strategies are needed for OLAP systems. Some indexing requirements for a
data warehouse system are as follows:
• Symmetric partial match queries: Most OLAP queries involve partial
ranges. An example is the query “Total sales from January 2006 to De-
cember 2010.” As queries can ask for ranges over any dimension, all the
dimensions of the data cube should be symmetrically indexed so that
they can be searched simultaneously.
• Indexing at multiple levels of aggregation: Since summary tables can be
large or queries may ask for particular values of aggregate data, summary
tables must be indexed in the same way as base nonaggregated tables.
• Efficient batch update: As already said, updates are not so critical in
OLAP systems, which allows more columns to be indexed. However, the
refreshing time of a data warehouse must be taken into account when
designing the indexing schema. Indeed, the time needed for reconstructing
the indexes after the refreshing extends the downtime of the warehouse.
• Sparse data: Typically, only 20% of the cells in a data cube are nonempty.
The indexing schema must thus be able to deal efficiently with sparse and
nonsparse data.
Bitmap indexes and join indexes are commonly used in data warehouse
systems to cope with these requirements. We study these indexes next.
Consider the table Product in Fig. 8.10a. For clarity, we assume a simplified
example with only six products. We show next how to build a bitmap index
8.5 Indexes for Data Warehouses 269
Fig. 8.10 An example of bitmap indexes for a Product dimension table. (a) Product
dimension table; (b) Bitmap index for attribute QuantityPerUnit; (c) Bitmap
index for attribute UnitPrice
Now, assume the query “Products with unit price equal to 75.” A query
processor will just need to know that there is a bitmap index over UnitPrice in
Product, and look for the bit vector with a value of 75. The vector positions
where a ‘1’ is found indicate the positions of the records that satisfy the
query, in this case, the third row in the table.
For queries involving a search range, the process is a little bit more in-
volved. Consider the query “Products having between 45 and 55 pieces per
unit, and with a unit price between 100 and 200.” To compute this query, we
first look for the index over QuantityPerUnit, and the bit vectors with labels
between 45 and 55. There are two such vectors, with labels 45 and 50. The
products having between 45 and 55 pieces per unit are the ones correspond-
270 8 Physical Data Warehouse Design
Fig. 8.11 Finding the products having between 45 and 55 pieces per unit and with
a unit price between 100 and 200. (a) OR for QuantityPerUnit; (b) OR for
UnitPrice; (c) AND operation
ing to an OR operation between these vectors. Then we look for the index
over UnitPrice and the bit vectors with labels between 100 and 200. There
are three such vectors, with labels 100, 110, and 120. The products having
unit price between 100 and 200 are, again, the ones corresponding to an OR
operation between these vectors. We obtain the two vectors labeled OR1 and
OR2 in Fig. 8.11a,b, respectively. Finally, an AND between these two vectors,
shown in Fig. 8.11c, gives the rows satisfying both conditions. The result is
that products p4 and p5 satisfy the query.
The operation just described is the main reason of the high performance
achieved by bitmapped indexing in the querying process. When performing
AND, OR, and NOT operations, the system will just perform a bit comparison,
and the resulting bit vector is obtained at a very low CPU cost.
The above example suggests that the best opportunities for these in-
dexes are found where the cardinality of the attributes being indexed is
low. Otherwise, we will need to deal with large indexes composed of a
large number of sparse vectors, and the index can become space ineffi-
cient. Continuing with our example, assume that the Product table con-
tains 100,000 rows. A bitmapped index on the attribute UnitPrice will occupy
100,000 × 6/8 bytes = 0.075 MB. A traditional B-tree index would occupy
approximately 100,000 × 4 = 0.4 MB (assume 4 bytes are required to store
a record identifier). It follows that the space required by a bitmapped index
is proportional to the number of entries in the index and to the number of
rows, while the space required by traditional indexes depends strongly on the
number of records to be indexed. OLAP systems typically index attributes
with low cardinality. Therefore, one of the reasons for using bitmap indexes
is that they occupy less space than B+ -tree indexes, as shown above.
There are two main reasons that make bitmap indexes not adequate in
OLTP environments. First, these systems are subject to frequent updates,
which are not efficiently handled by bitmap indexes. Second, in database
systems locking occurs at page level and not at the record level. Thus, con-
8.5 Indexes for Data Warehouses 271
currency can be heavily affected if bitmap indexes are used for operational
systems, given that a locked page would lock a large number on index entries.
As we have seen, bitmap indexes are typically sparse: the bit vectors have
a few ‘1’s among many ‘0’s. This characteristic makes them appropriate for
compression. We have also seen that even without compression, for low cardi-
nality attributes bitmap outperforms B+ -tree in terms of space. In addition,
bitmap compression allows indexes to support high-cardinality attributes.
The downside of this strategy is the overhead of decompression during query
evaluation. Given the many textbooks on data compression, and the high
number of compression strategies, we next just give the idea of a simple and
popular strategy, called run-length encoding (RLE). Many sophisticated tech-
niques are based on RLE, as we comment on the bibliographic notes section
of this chapter.
Run-length encoding is very popular for compressing black and white
and grayscale images since it takes advantage of the fact that the bit value
of an image is likely to be the same as the one of its neighboring bits. There
are many variants of this technique, most of them based on how they manage
decoding ambiguity. The basic idea is the following: if a bit of value v oc-
curs n consecutive times, replace these occurrences with the number n. This
sequence of bits is called a run of length n.
In the case of bitmap indexes, since the bit vectors have a few ‘1’s among
many ‘0’s, if a bit of value ‘0’ occurs n consecutive times, we replace these
occurrences with the number n. The ‘1’s are written as they come in the vec-
tor. Let us analyze the following sequence of bits: 0000000111000000000011.
We have two runs of lengths 7 and 10, respectively, three ‘1’s in between, and
two ‘1’s at the end. This vector can be trivially represented as the sequence
of integers 7,1,1,1,10,1,1. However, this encoding can be ambiguous since we
may not be able to distinguish if a ‘1’ is an actual bit or the length of a run.
Let us see how we can handle this problem. Let us call j the number of bits
needed to represent n, the length of a run. We can represent the run as a
sequence of j − 1 ‘1’ bits, followed by a ‘0’, followed by n in binary format.
In our example, the first run, 0000000, will be encoded as 110111, where the
first two ‘1’s correspond to the j − 1 part, ‘0’ indicates the component of the
run, and the last three ‘1’s are the number 7 (the length of the run) in binary
format.
Finally, the bitmap vector above is encoded as 1100111111111010101,
where the encoding is indicated in boldface and the actual bits of the vector
are indicated in normal font. Note that since we know the length of the array,
we could get rid of the trailing ‘1’s to save even more space.
272 8 Physical Data Warehouse Design
It is a well-known fact that join is one of the most expensive database oper-
ations. Join indexes are particularly efficient for join processing in decision-
support queries since they take advantage of the star schema design, where,
as we have seen, the fact table is related to the dimension tables by foreign
keys, and joins are typically performed on these foreign keys.
Fig. 8.12 An example of a join and a bitmap join indexes: (a) Product dimension ta-
ble, (b) Sales fact table, (c) Join index, (d) Bitmap join index on attribute
Discontinued
The main idea of join indexes consists in precomputing the join as shown
in Fig. 8.12. Consider the dimension table Product and the fact table Sales
from the Northwind data warehouse. We can expect that many queries re-
quire a join between both tables using the foreign key. Figure 8.12a depicts
table Product, with an additional attribute RowIDProd, and Fig. 8.12b shows
table Sales extended with an additional attribute RowIDSales. Figure 8.12c
shows the corresponding join index, basically a table containing pointers to
the matching rows. This structure can be used to efficiently answer queries
requiring a join between tables Product and Sales.
A particular case of join index is the bitmap join index. Suppose now that a
usual query asks for total sales of discontinued products. In this case, a bitmap
join index can be created on table Sales over the attribute Discontinued, as
8.6 Evaluation of Star Queries 273
Queries over star schemas are called star queries since they make use of
the star schema structure, joining the fact table with the dimension tables.
For example, a typical star query over our simplified Northwind example in
Sect. 8.3 would be “Total sales of discontinued products, by customer name
and product name.” This query reads in SQL:
SELECT C.CustomerName, P.ProductName, SUM(S.SalesAmount)
FROM Sales S, Customer C, Product P
WHERE S.CustomerKey = C.CustomerKey AND
S.ProductKey = P.ProductKey AND P.Discontinued = 'Yes'
GROUP BY C.CustomerName, P.ProductName
We will study now how this query is evaluated by an engine using the indexing
strategies studied above.
An efficient evaluation of our example query would require the definition of
a B+ -tree over the dimension keys CustomerKey and ProductKey, and bitmap
indexes on Discontinued in the Product dimension table and on the foreign
key columns in the fact table Sales. Figure 8.13a,c,d shows the Product and
Customer dimension tables and the Sales fact table, while the bitmap indexes
are depicted in Fig. 8.13b,e,f.
Let us describe how this query is evaluated by an OLAP engine. The
first step consists in obtaining the record numbers of the records that satisfy
the condition over the dimension, that is, Discontinued = 'Yes'. As shown in
the bitmap index (Fig. 8.13b), such records are the ones with ProductKey
values p2, p4, and p6. We then access the bitmap vectors with these labels
in Fig. 8.13f, thus performing a join between Product (Fig. 8.13a) and Sales.
Only the vectors labeled p2 and p4 match the search since there is no fact
record for product p6. The third, fourth, and sixth rows in the fact table are
the answer since they are the only ones with a ‘1’ in the corresponding vectors
in Fig. 8.13f. We then obtain the key values for the CustomerKey (c2 and c3)
274 8 Physical Data Warehouse Design
Fig. 8.13 An example of evaluation of star queries with bitmap indexes: (a) Product
table, (b) Bitmap for Discontinued, (c) Customer table, (d) Sales fact table,
(e) Bitmap for CustomerKey, (f ) Bitmap for ProductKey, and (g) Bitmap join
index for Discontinued
using the bitmap index in Fig. 8.13e. With these values we search in the B+ -
tree index over the keys in tables Product and Customer to find the names of
the products and the customer satisfying the query condition. Note that this
performs the join between the dimensions and the fact table. As we can see
in Figs. 8.10a and 8.13c, the records correspond to the names cust2, cust3,
prod2, and prod4, respectively. Finally, the query answer is (cust2, prod2, 200)
and (cust3, prod4, 100).
Note that the last join with Customer would not be needed if the query
would have been of the following form:
8.7 Partitioning 275
The query above only mentions attributes in the fact table Sales. Thus, the
only join that needs to be performed is the one between Product and Sales.
We illustrate now the evaluation of star queries using bitmap join indexes.
We have seen that the main idea is to create a bitmap index over a fact table
using an attribute belonging to a dimension table, precomputing the join be-
tween both tables and building a bitmap index over the latter. Figure 8.13g
shows the bitmap join index between Sales and Product over the attribute
Discontinued. Finding the facts corresponding to sales of discontinued prod-
ucts, as required by the query under study, is now straightforward: we just
need to find the vector labeled ‘Yes’, and look for the bits set to ‘1’. During
query evaluation, this avoids the first step described in the previous section,
when evaluating the query with bitmap indexes. This is done at the expense
of the cost of (off-line) precomputation.
Note that this strategy can reduce dramatically the evaluation cost if in
the SELECT clause there are no dimension attributes, and thus we do not
need to join back with the dimensions using the B+ -tree as explained above.
Thus, the answer for the alternative query above would just require a simple
scan of the Sales table, in the worst case.
8.7 Partitioning
in another partition. In this way, more records can be brought into main
memory with a single operation, reducing the processing time. Column-store
database systems (which are studied in Chap. 15) are based on this technique.
There are several horizontal partitioning strategies in database systems.
Given n partitions, round-robin partitioning, assigns the i-th tuple in
insertion order to partition i mod n. This strategy is good for parallel se-
quential access. However, accessing an individual tuple based on a predicate
requires accessing the entire relation. Hash partitioning applies a hash
function to some attribute to assign tuples to partitions. The hash function
should distribute rows among partitions in a uniform fashion, yielding par-
titions of about the same size. This strategy allows exact-match queries on
the distributing attribute to be processed by exactly one partition. Range
partitioning assigns tuples to partitions based on ranges of values of some
attribute. The temporal dimension is a natural candidate for range parti-
tioning. For example, when partitioning a table by a date column, a January
2021 partition will only contain rows from that month. This strategy can
deal with non-uniform data distributions and can support both exact-match
queries and range queries. List partitioning assigns tuples to partitions by
specifying a list of values of some attribute. In this way, data can be organized
in an ad hoc fashion. Finally, composite partitioning combines the basic
data distribution methods above. For example, a table can be range parti-
tioned, and each partition can be further subdivided using hash partitioning.
Partitioning database tables into smaller ones improves query perfor-
mance. This is especially true when parallel processing is applied (as we
will see in Sect. 8.8) or when a table is distributed across different servers
(as we will see in Sect. 15.3). There are two classic techniques for improving
query performance using partitioning. Partition pruning enables a smaller
subset of the data to be accessed when queries refer to data located in only
some of the partitions. For example, a Sales fact table in a warehouse can
be partitioned by month. A query requesting orders for a single month only
needs to access the partition corresponding to that month instead of access-
ing all the table, greatly reducing query response time. The performance of
joins can also be enhanced by using partitioning. This occurs when the two
tables to be joined are partitioned on the join attributes or, in the case of
foreign-key joins, when the reference table is partitioned on its primary key.
In these cases, a large join is broken into smaller joins that occur between
each of the partitions, producing significant performance gains, which can be
improved by taking advantage of parallel execution.
Partitioning also facilitates administrative tasks, since tables and indexes
are partitioned into smaller, more manageable pieces. In this way, mainte-
nance operations can be performed on the partitions. For example, a data-
base administrator may back up just a single partition of a table instead of
the whole table. In the case of indexes, partitioning is advised, for example,
in order to perform maintenance on parts of the data without invalidating
the entire index. In addition, partitioned database tables and indexes in-
8.8 Parallel Processing 277
duce high data availability. For example, if some partitions of a table become
unavailable, it is possible that most of the other partitions of the table re-
main online and available, in particular if partitions are allocated to various
different devices. In this way, applications can continue to execute queries
and transactions that do not need to access the unavailable partitions. Even
during normal operation, since each partition can be stored in a separate ta-
blespace, backup and recovery operations can be performed over individual
partitions, independently of each other. Thus, the active parts of the data-
base can be made available sooner than in the case of an unpartitioned table.
that the join between R and S is equal to the union of the joins between
Ri and Si . Each join between Ri and Si is performed in parallel. Of course,
the algorithms above are the basis for actual implementations which exploit
main memory and multicore processors.
We next explain parallel processing in a representative system like Post-
greSQL. PostgreSQL achieves interquery parallelism through multiple con-
nections, typically in a multiuser setting, but also by processing the same
query on different partitions, which is essential for OLAP queries. It also
allows parallel query processing in a single connection, which means that a
single process can have multiple threads to process a query, taking advantage
of multicore processors. This allows intraquery parallelism.
Process
Tuples Tuples
Gather
Coordinate
A range partition holds values of the partition key within a range. Both
minimum and maximum values of the range need to be specified, where the
minimum value is inclusive and the maximum value is exclusive. Below we
show how three partitions on salary ranges are created.
CREATE TABLE EmployeeSalaryLow PARTITION OF Employee
FOR VALUES FROM (MINVALUE) TO (50000);
CREATE TABLE EmployeeSalaryMiddle PARTITION OF Employee
FOR VALUES FROM (50000) TO (100000);
CREATE TABLE EmployeeSalaryHigh PARTITION OF Employee
FOR VALUES FROM (100000) TO (MAXVALUE);
The query plan devised by the query optimizer is as follows, according to the
architecture shown in Fig. 8.14.
Finalize Aggregate
Gather
Workers Planned: 4
Partial Aggregate
Parallel Append
Parallel Seq Scan on EmployeeSalaryHigh
Parallel Seq Scan on EmployeeSalaryMiddle
Parallel Seq Scan on EmployeeSalaryLow
The optimizer first defines how many Workers the query requires. In the
example above, we have eight cores available, although only four are needed,
since more may cause a decrease in performance. The Parallel Append causes
the Workers to spread out across the partitions. For example, one of the
partitions may end up being scanned by two Workers, and the other two
partitions are attended by one Worker each. When the engine has finished
with a given partition, the Worker(s) allocated to it are spread out across
the remaining partitions, so that the total number of Workers per partition
is kept as even as possible.
In this section, we discuss how the theoretical concepts studied in this chapter
are applied in Microsoft SQL Server. We start with the study of how mate-
rialized views are supported in these tools. We then introduce a novel kind
of index provided by SQL Server called column-store index. Then, we study
partitioning, followed by a description of how the three types of multidimen-
sional data representation introduced in Chap. 5, namely, ROLAP, MOLAP,
and HOLAP, are implemented in Analysis Services.
thus precomputing and materializing such view. We have seen that this is a
mandatory optimization technique in data warehouse environments.
When we create an indexed view, it is essential to verify that the view and
the base tables satisfy the many conditions required by the tool. For exam-
ple, the definition of an indexed view must be deterministic, meaning that all
expressions in the SELECT, WHERE, and GROUP BY clauses are determin-
istic. For instance, the DATEADD function is deterministic because it always
returns the same result for any given set of argument values for its three pa-
rameters. On the contrary, GETDATE is not deterministic because it is always
invoked with the same argument, but the value it returns changes each time
it is executed. Also, indexed views may be created with the SCHEMABIND-
ING option. This indicates that the base tables cannot be modified in a way
that would affect the view definition. For example, the following statement
creates an indexed view computing the total sales by employee over the Sales
fact table in the Northwind data warehouse.
CREATE VIEW EmployeeSales WITH SCHEMABINDING AS (
SELECT EmployeeKey, SUM(UnitPrice * OrderQty * Discount)
AS TotalAmount, COUNT(*) AS SalesCount
FROM Sales
GROUP BY EmployeeKey )
CREATE UNIQUE CLUSTERED INDEX CI_EmployeeSales ON
EmployeeSales (EmployeeKey)
An indexed view can be used in two ways: when a query explicitly ref-
erences the indexed view and when the view is not referenced in a query
but the query optimizer determines that the view can be used to generate a
lower-cost query plan.
In the first case, when a query refers to a view, the definition of the view
is expanded until it refers only to base tables. This process is called view
expansion. If we do not want this to happen, we can use the NOEXPAND
hint, which forces the query optimizer to treat the view like an ordinary
table with a clustered index, preventing view expansion. The syntax is as
follows:
SELECT EmployeeKey, EmployeeName, ...
FROM Employee, EmployeeSales WITH (NOEXPAND)
WHERE ...
In the second case, when the view is not referenced in a query, the query
optimizer determines when an indexed view can be used in a given query
execution. Thus, existing applications can benefit from newly created indexed
views without changing those applications. Several conditions are checked to
determine if an indexed view can cover the entire query or a part of it, for
example, (1) the tables in the FROM clause of the query must be a superset
of the tables in the FROM clause of the indexed view; (2) the join conditions
in the query must be a superset of the join conditions in the view; and (3)
the aggregate columns in the query must be derivable from a subset of the
aggregate columns in the view.
282 8 Physical Data Warehouse Design
If a partitioned table is created in SQL Server and indexed views are built
on this table, SQL Server automatically partitions the indexed view by using
the same partition scheme as the table. An indexed view built in this way
is called a partition-aligned indexed view. The main feature of such a view
is that the database query processor automatically maintains it when a new
partition of the table is created, without the need of dropping and recreating
the view. This improves the manageability of indexed views.
We show next how we can create a partition-aligned indexed view on the
Sales fact table of the Northwind data warehouse. To facilitate maintenance,
and for efficiency reasons, we decide to partition this fact table by year. This
is done as follows.
To create a partition scheme, we need first to define the partition function.
We want to define a scheme that partitions the table by year, from 2016
through 2018. The partition function is called PartByYear and takes as input
an attribute of integer type, which represents the values of the surrogate keys
for the Date dimension.
CREATE PARTITION FUNCTION [PartByYear] (INT)
AS RANGE LEFT FOR VALUES (184, 549, 730);
Here, 184, 549, and 730 are, respectively, the surrogate keys representing the
dates 31/12/2016, 31/12/2017, and 31/12/2018. These dates are the bound-
aries of the partition intervals. RANGE LEFT means that the records with
values less or equal than 184 will belong to the first partition, the ones greater
than 184 and less or equal than 549 to the second, and the records with values
greater than 730 to the third partition.
Once the partition function has been defined, the partition scheme is cre-
ated as follows:
CREATE PARTITION SCHEME [SalesPartScheme]
AS PARTITION [PartByYear] ALL to ( [PRIMARY] );
Here, PRIMARY means that the partitions will be stored in the primary
filegroup, that is, the group that contains the startup database information.
Filegroup names can be used instead (can be more than one). ALL indicates
that all partitions will be stored in the primary filegroup.
The Sales fact table is created as a partitioned table as follows:
CREATE TABLE Sales (CustomerKey INT, EmployeeKey INT,
OrderDateKey INT, ...) ON SalesPartScheme(OrderDateKey)
Since the clustered index was created using the same partition scheme, this
is a partition-aligned indexed view.
ROLAP Storage
In the ROLAP storage mode, the aggregations of a partition are stored
in indexed views in the relational database specified as the data source of
the partition. The indexed views in the data source are accessed to answer
queries. In the ROLAP storage, the query response time and the processing
time are generally slower than with the MOLAP or HOLAP storage modes
(see below). However, ROLAP enables users to view data in real time and can
save storage space when working with large data sets that are infrequently
queried, such as purely historical data.
8.9 Physical Design in SQL Server and Analysis Services 285
If a ROLAP partition has its source data stored in SQL Server, Analysis
Services tries to create indexed views to store aggregations. When these views
cannot be created, aggregation tables are not created. Indexed views for ag-
gregations can be created if several conditions hold in the ROLAP partition
and the tables in it. The more relevant ones are as follows:
• The partition cannot contain measures that use the MIN or MAX aggre-
gate functions.
• Each table in the ROLAP partition must be used only once.
• All table names in the partition schema must be qualified with the owner
name, for example, [dbo].[Customer].
• All tables in the partition schema must have the same owner.
• The source columns of the partition measures must not be nullable.
MOLAP Storage
In the MOLAP storage mode, both the aggregations and a copy of the source
data are stored in a multidimensional structure. Such structures are highly
optimized to maximize query performance. Since a copy of the source data
resides in the multidimensional structure, queries can be processed without
accessing the source data of the partition.
Note however that data in a MOLAP partition reflect the most recently
processed state of a partition. Thus, when source data are updated, objects
in the MOLAP storage must be reprocessed to include the changes and make
them available to users. Changes can be processed from scratch or, if possi-
ble, incrementally, as explained in Sect. 8.2. This update can be performed
without taking the partition or cube offline. However, if structural changes to
OLAP objects are performed, the cube must be taken offline. In these cases,
it is recommended to update and process cubes on a staging server.
HOLAP Storage
The HOLAP storage mode combines features of MOLAP and ROLAP
modes. Like MOLAP, in HOLAP the aggregations of the partition are stored
in a multidimensional data structure. However, like in ROLAP, HOLAP does
not store a copy of the source data. Thus, if queries only access summary
data of a partition, HOLAP works like MOLAP very efficiently. Queries that
need to access unaggregated source data must retrieve it from the relational
database and therefore will not be as fast as if it were stored in a MOLAP
structure. However, this can be solved if the query can use cached data, that
is, data that are stored in main memory rather than on disk.
In summary, partitions stored as HOLAP are smaller than the equivalent
MOLAP partitions since they do not contain source data. On the other hand,
they can answer faster than ROLAP partitions for queries involving summary
data. Thus, this mode tries to capture the best of both worlds.
286 8 Physical Data Warehouse Design
Figure 8.15 shows a unique initial partition for the Sales measure group in a
data cube created from the Northwind cube. As we can see, this is a MOLAP
partition. Assume now that, for efficiency reasons, we want to partition this
measure group by year. Since in the Northwind data warehouse we have data
from 2016, 2017, and 2018, we will create one partition for each year. We
decided that the first two ones will be MOLAP partitions, and the last one, a
ROLAP partition. To define the limits for the partitions, the Analysis Services
cube wizard creates an SQL query template, shown in Fig. 8.16, which must
be completed in the WHERE clause with the key range corresponding to each
partition. To obtain the first and last keys for each period, a query such as
the following one must be addressed to the data warehouse.
SELECT MIN(DateKey), MAX(DateKey)
FROM Date
WHERE Date >= '2017-01-01' AND Date <= '2017-12-31'
The values obtained from this query can then be entered in the wizard for
defining the partition for 2017.
Figure 8.17 shows the three final partitions of the measure group Sales.
Note that in that figure the second partition is highlighted. Figure 8.18 shows
the properties of such partition, in particular the ROLAP storage mode. This
dialog box also can be used to change the storage mode.
the intermediate measure group. Also, if possible, the size of the intermediate
fact table underlying the intermediate measure group must be reduced.
Aggregations are also used by Analysis Services to enhance query perfor-
mance. Thus, the most efficient aggregations for the query workload must
be selected to reduce the number of records that the storage engine needs
to scan on disk to evaluate a query. When designing aggregations, we must
evaluate the benefits that aggregations provide when querying, against the
time it takes to create and refresh such aggregations. Moreover, unnecessary
aggregations can worsen query performance. A typical example is the case
when a summary table matches an unusual query. This can make the sum-
mary table to be moved into the cache to be accessed faster. Since this table
will be rarely used afterwards, it can deallocate a more useful table from the
cache (which has a limited size), with the obvious negative effect on query.
Summarizing, we must avoid designing a large number of aggregations since
they may reduce query performance.
The Analysis Services aggregation design algorithm does not automatically
consider every attribute for aggregation. Consequently, we must check the
attributes that are considered for aggregation and determine if we need to
suggest additional aggregation candidates, for example, because we detected
8.11 Summary 289
that most user queries not resolved from cache are resolved by partition reads
rather than aggregation reads. Analysis Services uses the Aggregation Usage
property to determine which attributes it should consider for aggregation.
This property can take one of four values: full (every aggregation for the
cube must include this attribute), none (no aggregation uses the attribute),
unrestricted (the attribute must be evaluated), and default (a rule is applied
to determine if the attribute must be used). The administrator can use this
property to change its value for influencing its use for aggregation.
As we have explained, partitions enable Analysis Services to access less
data to answer a query when it cannot be answered from the data cache
or from aggregations. Data must be partitioned matching common queries.
Analogously to the case of measure groups, we must avoid partitioning in
a way that requires most queries to be resolved from many partitions. It is
recommended by the vendor that partitions contain at least 2 million and at
most 20 million records. Also, each measure group should contain less than
2,000 partitions. A separate ROLAP partition must be selected for real-time
data and this partition must have its own measure group.
We can also optimize performance by writing efficient MDX queries and
expressions. For this, run-time checks in an MDX calculation must be avoided.
For example, using CASE and IF functions that must be repeatedly evaluated
during query resolution will result in a slow execution. In that case, it is
recommended to rewrite the queries using the SCOPE function. If possible,
Non_Empty_Behavior must be used to enable the query execution engine to
use the bulk evaluation mode. In addition, EXISTS rather than filtering on
member properties should be used since this enables bulk evaluation mode.
Too many subqueries must be avoided if possible. Also, if possible, a set
must be filtered before using it in a cross join to reduce the cube space before
performing such cross join.
The cache of the query engine must be used efficiently. First, the server
must have enough memory to store query results in memory for reuse in
subsequent queries. We must also define calculations in MDX scripts because
these have a global scope that enables the cache related to these queries to be
shared across sessions with the same security permissions. Finally, the cache
must be warmed by executing a set of predefined queries using any tool.
Other techniques are similar to the ones used for tuning relational data-
bases, like tuning memory and processor usage. For details, we refer the reader
to the Analysis Services documentation.
8.11 Summary
is, how and when a view can be updated without recomputing it from scratch.
In addition, we presented algorithms that compute efficiently the data cube
when all possible views are materialized. Also, we showed that when full
materialization is not possible, we can estimate which is the best set to be
chosen for materialization given a set of constraints. We then studied two
typical indexing schemes used in data warehousing, namely, bitmap and join
indexes, and how they are used in query evaluation. Finally, we discussed
partitioning techniques and strategies, aimed at enhancing data warehouse
performance and management. The last three sections of the chapter were
devoted to study physical design and query performance in Analysis Services,
showing how the theoretical concepts studied in the first part of the chapter
are applied over real-world tools.
A general book about physical database design is [140], while physical design
for SQL Server is covered, for instance, in [56, 227]. Most of the topics stud-
ied in this chapter have been presented in classic data warehousing papers.
Incremental view maintenance has been studied in [95, 96]. The summary
table algorithm is due to Mumick et al. [159]. The PipeSort algorithm, as
well as other data cube computation techniques, is discussed in detail in [3].
The view selection algorithm was proposed in a classic paper by Harinarayan
et al. [100]. Bitmap indexes were first introduced in [173] and bitmap join
indexes in [172]. A study of the joint usage of indexing, partitioning, and view
materialization in data warehouses is reported in [25]. A book on indexing
structures for data warehouses is [123]. A study on index selection for data
warehouses can be found in [85], while [223] surveys bitmap indices for data
warehouses. A popular bitmap compression technique, based of run-length
encoding, is WAH (Word Align Hybrid) [266]. The PLWAH (Position List
Word Align Hybrid) bitmap compression technique [223] was proposed as a
variation of the WAH scheme, and it is reported to be more efficient than the
former, particularly in terms of storage. Rizzi and Saltarelli [198] compare
view materialization against indexing for data warehouse design. Finally, a
survey of view selection methods is [146].
8.1 What is the objective of physical data warehouse design? Specify dif-
ferent techniques that are used to achieve such objective.
8.2 Discuss advantages and disadvantages of using materialized views.
8.3 What is view maintenance? What is incremental view maintenance?
8.14 Exercises 291
8.14 Exercises
computed from the full outer join of tables Employee and Orders, where Name
is obtained by concatenating FirstName and LastName.
Define the view EmpOrders in SQL. Show an example of instances for
the relations and the corresponding view. By means of examples, show how
the view EmpOrders must be modified upon insertions and deletions in table
Employee. Give the SQL command to compute the delta relation of the view
292 8 Physical Data Warehouse Design
from the delta relations of table Employee. Write an algorithm to update the
view from the delta relation.
Exercise 8.5. By means of examples, explain the propagate and refresh algo-
rithm for the aggregate functions AVG, MIN, and COUNT. For each aggregate
function, write the SQL command that creates the summary-delta table from
the tables containing the inserted and deleted tuples in the fact table, and
write the algorithm that refreshes the view from the summary-delta table.
Exercise 8.7. Consider the graph in Fig. 8.19, where each node represents a
view and the numbers are the costs of materializing the view. Assuming that
the bottom of the lattice is materialized, determine using the View Selection
Algorithm the five views to be materialized first.
All 1
A 20 B 40 C 60 D 80
ABCD 10000
Exercise 8.12. Given the Sales table below and the Employee table from
Ex. 8.11, show how a join index on attribute EmployeeKey would look like.
RowID Product Customer Employee Date Sales
Sales Key Key Key Key Amount
1 p1 c1 e1 d1 100
2 p1 c2 e3 d1 100
3 p2 c2 e4 d2 100
4 p2 c3 e5 d2 100
5 p3 c3 e1 d3 100
6 p4 c4 e2 d4 100
7 p5 c4 e2 d5 100
Exercise 8.13. Given the Department table below and the Employee table
from Ex. 8.11, show how a bitmap join index on attribute DeptKey would
look like.
8.14 Exercises 295
Exercise 8.14. Consider the tables Sales in Ex. 8.12, Employee in Ex. 8.11,
and Department in Ex. 8.13.
a. Propose an indexing scheme for the tables, including any kind of index
you consider it necessary. Discuss possible alternatives according to sev-
eral query scenarios. Discuss the advantages and disadvantages of creating
the indexes.
b. Consider the query:
SELECT E.EmployeeName, SUM(S.SalesAmount)
FROM Sales S, Employee E, Department D
WHERE S.EmployeeKey = E.EmployeeKey AND
E.DepartmentKey = D.DepartmentKey AND
( D.Location = 'Brussels' OR D.Location = 'Paris' )
GROUP BY E.EmployeeName
Explain a possible query plan that uses the indexes defined in (a).
Chapter 9
Extraction, Transformation, and Loading
Continent
Country State
Load
+
(a) (b)
Fig. 9.1 Activities: (a) Single task; (b) Collapsed and expanded subprocess
(b)
Fig. 9.2 (a) Different types of gateways; (b) Splitting and merging gateways
these conditions must be written; this is left to the modeler. Gateways are
represented by diamond shapes. BPMN defines several types of gateways,
shown in Fig. 9.2a, which are distinguished by the symbol used inside the di-
amond shape. All these types can be splitting or merging gateways, as shown
in Fig. 9.2b, depending on the number of ingoing and outgoing branches. An
exclusive gateway models an exclusive OR decision, that is, depending on
a condition, the gateway activates exactly one of its outgoing branches. It
can be represented as an empty diamond shape or a diamond shape with an
‘X’ inside. A default flow (see below) can be defined as one of the outgoing
flows, if no other condition is true. An inclusive gateway triggers or merges
one or more flows. In a splitting inclusive gateway, any combination of out-
going flows can be triggered. However, a default flow cannot be included in
such a combination. In a merging inclusive gateway, any combination can be
chosen to continue the flow. A parallel gateway allows the synchronization
between outgoing and incoming flows. A splitting parallel gateway is analo-
gous to an AND operation: the incoming flow triggers one or more outgoing
parallel flows. On the other hand, a merging parallel gateway synchronizes
the flow merging all the incoming flows into a single outgoing one. Finally,
complex gateways can represent complex conditions. For example, a merging
300 9 Extraction, Transformation, and Loading
complex gateway can model that when three out of five flows are completed,
the process can continue without waiting for the completion of the other two.
Activity Activity
Canceled Compensated
Send Correct
Message Error
Events (see Fig. 9.3) represent something that happens that affects the
sequence and timing of the workflow activities. Events may be internal or
external to the task into consideration. There are three types of events, which
can be distinguished depending on whether they are drawn with a single, a
double, or a thick line. Start and end events indicate the beginning and ending
of a process, respectively. Intermediate events include time, message, cancel,
and terminate events. Date events can be used to represent situations when a
task must wait for some period of time before continuing. Message events can
be used to represent communication, for example, to send an e-mail indicating
that an error has occurred. They can also be used for triggering a task, for
example, a message may indicate that an activity can start. Cancel events
listen to the errors in a process and notify them either by an explicit action
like sending a message, as in the canceled activity shown in Fig. 9.4, or by
an implicit action to be defined in the next steps of the process development.
Compensation events can be employed to recover errors by launching specific
compensation activities, which are linked to the compensation event with
the association connecting object (Fig. 9.5), as shown in Fig. 9.4. Finally,
terminate events stop the entire process, including all parallel processes.
Connecting objects are used to represent how objects are connected.
There are three types of connecting objects, as illustrated in Fig. 9.5.
A sequence flow represents a sequencing constraint between flow objects.
It is the basic connecting object in a workflow. It states that if two activities
are linked by a sequence flow, the target activity will start only when the
9.1 Business Process Modeling Notation 301
source one has finished. If multiple sequence flows outgo from an activity, all
of them will be activated after its execution. In case there is a need to control
a sequence flow, it is possible to add a condition to the sequence flow by using
the conditional sequence flow. A sequence flow may be set as the default
flow in case of many outgoing flows. For example, as explained above, in an
exclusive or an inclusive gateway, if no other condition is true, then the default
flow is followed. Note that sequence flows can replace splitting and merging
gateways. For example, an exclusive gateway splitting into two paths could
be replaced by two conditional flows, provided the conditions are mutually
exclusive. Inclusive gateways could be replaced by conditional flows, even
when the former constraint does not apply.
A message flow represents the sending and receiving of messages between
organizational boundaries (i.e., pools, explained below). A message flow is the
only connecting object able to get through the boundary of a pool and may
also connect to a flow object within that pool.
An association relates artifacts (like annotations) to flow objects (like
activities). We give examples below. An association can indicate direction-
ality using an open arrowhead, for example, when linking the compensation
activity in case of error handling.
Looping Looping
Activity Subprocess
+
A loop (see Fig. 9.6) represents the repeated execution of a process for
as long as the underlying looping condition is true. This condition must be
evaluated for every loop iteration and may be evaluated at the beginning or
at the end of the iteration. In the example of Fig. 9.7, we represent a task
of an ETL process that connects to a server. At a high abstraction level,
the subprocess activity hides the details. It has the loop symbol attached (a
curved arrow), indicating that the subprocess is executed repeatedly until
an ending condition is reached. When we expand the subprocess, we can see
what happens within it: the server waits for 3 minutes (this waiting task
is represented by the time event). If the connection is not established, the
302 9 Extraction, Transformation, and Loading
Connect to
Server
+
Connect to Server
Establish Y
Connection
N Condition:
Connection OK?
Wait 3'
15'
Send error
e-mail
request for the connection is launched again. After 15 minutes (another time
event), if the connection was not reached, the task is stopped and an error
e-mail is sent (a message event).
Currency
Server
Exchange Rate
Server 2
+ +
Server 1
Retrieve: CountryKey
Database: NorthwindDW
Table: Country
Condition: Where: Country
Found? Matches: CountryName
Lookup
Artifacts are used to add information to a diagram. There are three types
of artifacts. A data object represents either data that are input to a process,
data resulting from a process, data that needs to be collected, or data that
needs to be stored. A group organizes tasks or processes that have some kind
of significance in the overall model. A group does not affect the flow in the
diagram. Annotations are used to add extra information to flow objects.
For example, an annotation for an activity in an ETL process can indicate
a gateway condition or the attributes involved in a lookup task, as shown in
Fig. 9.9. Annotations may be associated to both activities and subprocesses
in order to describe their semantics.
ferred through the arrows. Given the discussion above, designing ETL pro-
cesses using business process modeling tools appears natural. We present next
the conceptual model for ETL processes based on BPMN.
Control tasks represent the orchestration of an ETL process, independently
of the data flowing through such process. Such tasks are represented by means
of the constructs described in Sect. 9.1. For example, gateways are used to
control the sequence of activities in an ETL process. The most used types of
gateways in an ETL context are exclusive and parallel gateways. Events are
another type of objects often used in control tasks. For instance, a cancelation
event can be used to represent the situation when an error occurs and may
be followed by a message event that sends an e-mail to notify the failure.
Swimlanes can be used to organize ETL processes according to several
strategies, namely, by technical architecture (such as servers to which tasks
are assigned), by business entities (such as departments or branches), or by
user profiles (such as manager, analyst, or designer) that give special access
rights to users. For example, Fig. 9.8 illustrates the use of swimlanes for the
Northwind ETL process (we will explain in detail this process later in this
chapter). The figure shows some of the subprocesses that load the Product di-
mension table, the Date dimension table, and the Sales fact table (represented
as compound activities with subprocesses); it also assumes their distribution
between Server 1 and Server 2. Each one of these servers is considered as a
lane contained inside the pool of data warehouse servers. We can also see
that a swimlane called Currency Server contains a web service that receives
an input currency (like US dollars), an amount, and an output currency (like
euros) and returns the amount equivalent in the output currency. This could
be used in the loading of the Sales fact table. Thus, flow messages are ex-
changed between the Sales Load activity, and the Exchange Rate task which
is performed by the web service. These messages go across both swimlanes.
Data tasks represent activities typically carried out to manipulate data,
such as input data, output data, and data transformation. Since such data
manipulation operations occur within an activity, data tasks can be consid-
ered as being at a lower abstraction level than control tasks. Recall that
arrows in a data task represent not only a precedence relationship between
its activities but also the flow of data records between them.
Data tasks can be classified into row and rowset operations. Row opera-
tions apply transformations to the data on a row-by-row basis. In contrast,
rowset operations deal with a set of rows. For example, updating the value
of a column is a row operation, while aggregation is a rowset operation. Data
tasks can also be classified (orthogonally to the previous classification) into
unary or n-ary data tasks, depending of the number of input flows.
Figure 9.10 shows examples of unary row operations: Input Data, Insert
Data, Add Columns, and Convert Columns. Note the annotations linked to the
tasks by means of association flows. The annotations contain metadata that
specify the parameters of the task. For example, in Fig. 9.10a, the annotation
tells that the data is read from an Excel file called Date.xls. Similarly, the
9.2 Conceptual ETL Design Using BPMN 305
Convert
Input Data Insert Data Add Columns
Columns
annotation in Fig. 9.10b tells that the task inserts tuples in the table Category
of the NorthwindDW database, where column CategoryKey is obtained from
the CategoryID in the flow. Further, new records will be appended to the table.
The task in Fig. 9.10c adds a column named SalesAmount to the flow whose
value is computed from the expression given. Here, it is supposed that the
values of the columns appearing in the expression are taken from the current
record. Finally, Fig. 9.10d converts the columns Date and DayNbWeek (e.g.,
read from an Excel file as strings) into a Date and a Smallint, respectively.
Fig. 9.11 Rowset operations: (a) Aggregate (unary); (b) Join (binary); (c) Union (n-
ary)
Figure 9.11 shows three rowset operations: Aggregate (unary), Join (bi-
nary), and Union (n-ary). These operations receive a set of rows to process
altogether, rather than operating row by row. Again, annotations comple-
ment the diagram information. For example, in the case of the Union task,
the annotation tells the name of the input and output columns and informs
if duplicates must be kept. Note that the case of the union is a particular
one: if duplicates are retained, then it becomes a row operation since it can
be done row by row. If duplicates are eliminated, then it becomes a rowset
operation because sorting is involved in the operation.
A common data task in an ETL process is the lookup, which checks if
some value is present in a file, based on a single or compound search key.
306 9 Extraction, Transformation, and Loading
Y Found
Lookup Lookup
N
NotFound
In this section, using the concepts explained in the previous ones, we present
a conceptual model of the ETL process that loads the Northwind data ware-
house from the operational database and other sources. Later, in Sect. 9.5, we
show how this model can be implemented in Microsoft Integration Services.
The operational data reside in a relational database, whose logical schema
is shown in Fig. 2.4. These data must be mapped to a data warehouse, whose
schema is given in Fig. 5.4. In addition to the operational database, some
other files are needed for loading the data warehouse. We next describe these
files, as well as the requirements of the process.
First, an Excel file called Date.xls contains the data needed for loading the
Date dimension table. The time interval of this file covers the dates contained
in the table Orders of the Northwind operational database.
We can see in Fig. 5.4 that in the Northwind data warehouse the dimen-
sion tables Customer, Supplier, and Employee share the geographic hierarchy
starting at the City level. Data for the hierarchy State → Country → Continent
are loaded from an XML file called Territories.xml that begins as shown in
Fig. 9.13a. A graphical representation of the schema of the XML file is shown
in Fig. 9.13b, where rectangles represent elements and rounded rectangles
represent attributes. The cardinalities of the relationships are also indicated.
9.3 Conceptual Design of the Northwind ETL Process 307
Insert Data: inserts tuples into a database corresponding to records in the flow
Database Name of the database
Table Name of the table
[Columns] List of Col or Col = Expr, where column Col in the database either
takes the values for the same column of the flow or takes the values
from the expression Expr
[Options] Either Empty or Append depending on whether the table is emptied
before inserting the new tuples, the latter is the default
Intersection: computes the intersection of two flows
Input* List of column names from the two input flows. Input* is used if the
column names are the same for both flows, otherwise Input1 and
Input2 are used, each flow defining the column names
Output List of column names from the output flow
Join: computes the join of two flows
Condition List of Col1 op Col2, where Col1 belongs to the first input flow, Col2
to the second flow, and op is a comparison operator
[Join Type] Either Inner Join, Left Outer Join, Right Outer Join, or Full Outer
Join, the first one is the default
Lookup: adds columns to the flow obtained by looking up data from a database
Retrieve List of column names added to the output flow
Database Database name
Table | Query Name of the table or SQL query
Where List of column names from the input flow
Matches List of attribute names from the lookup table or SQL query
Lookup: replaces column values of the flow with values obtained by looking up data
from a database
Replace List of column names from the input flow whose values are replaced
in the output flow
Database Database name
Table | Query Name of the table or SQL query
Where List of column names from the input flow
Matches List of attribute names from the lookup table or description of an
SQL query
Multicast: produces several output flows from an input flow
Input List of column names from the input flow
Output* List of column names from each output flow. Output* is used
if the column names are the same for all flows, otherwise
Output1, . . . , Outputn are used, each flow defining the column
names
Distinct: removes duplicate records from the flow
(None) This task does not have any annotation
Rename Columns: changes the name of columns from the flow
Columns List of Col->NewCol where column Col from the input flow is re-
named NewCol in the output flow
Sort: sorts the records of the flow
Columns List of column names from the input flow, where for each of them,
either ASC or DESC is specified, the former being the default
9.3 Conceptual Design of the Northwind ETL Process 309
Notice that type is an attribute of State that contains, for example, the value
state for Austria. However, for Belgium it contains the value province (not
shown in the figure). Notice also that EnglishStateName, RegionName, and
RegionCode are optional, as indicated by the cardinality 0..1.
It is worth noting that the attribute Region of tables Customers and Sup-
pliers in the Northwind database contains in fact a state or province name
(e.g., Québec), or a state code (e.g., CA). Similarly, the attribute Country
contains a country name (e.g., Canada) or a country code (e.g., USA). To
identify to which state or province a city belongs, a file called Cities.txt is
used. The file contains three fields separated by tabs and begins as shown
in Fig. 9.14a, where the first line contains field names. In the case of cities
located in countries that do not have states, as it is the case of Singapore, a
null value is given for the second field. The above file is also used to identify
to which state the city in the attribute TerritoryDescription of table Territo-
ries in the Northwind database corresponds. A temporary table in the data
warehouse, called TempCities, will be used for storing the contents of this file.
The structure of the table is given in Fig. 9.14b.
310 9 Extraction, Transformation, and Loading
0..n 0..1
State RegionName
0..1
RegionCode
(b)
Fig. 9.13 (a) Beginning of the file Territories.xml; (b) XML Schema of the file
It is worth noting that the keys of the operational database are reused in
the data warehouse as surrogate keys for all dimensions except for dimension
Customer. In this dimension, the key of the operational database is kept in
the attribute CustomerID, while a new surrogate key is generated during the
ETL process.
9.3 Conceptual Design of the Northwind ETL Process 311
Fig. 9.14 (a) Beginning of the file Cities.txt; and (b) Associated table TempCities
In addition, for the Sales table in the Northwind data warehouse, the
following transformations are needed:
• The attribute OrderLineNo must be generated in ascending order of Pro-
ductID (in the operational database there is no order line number).
• The attribute SalesAmount must be calculated taking into account the
unit price, the discount, and the quantity.
• The attribute Freight, which in the operational database is related to the
whole order, must be evenly distributed among the lines of the order.
Figure 9.15 provides a general overview of the whole ETL process. The
figure shows the control tasks needed to perform the transformation from
the operational database and the additional files presented above, and the
loading of the transformed data into the data warehouse. We can see that the
process starts with a start event, followed by activities (with subprocesses)
that can be performed in parallel (represented by a splitting parallel gateway)
which populate the dimension hierarchies. Finally, a parallel merging gate-
way synchronizes the flow, meaning that the loading of the Sales fact table
(activity Sales Load) can only start when all other tasks have been completed.
If the process fails, a cancelation event is triggered and an error message in
the form of an e-mail is dispatched.
Figure 9.16 shows the task that loads the Category table in the data ware-
house. It is composed of an input data task and an insertion data task. The
former reads the table Categories from the operational database. The latter
loads the table Category in the data warehouse, where the CategoryID at-
tribute in the Categories table is mapped to the CategoryKey attribute in the
Category table. Similarly, Fig. 9.17 shows the task that loads the Date table
from an Excel file. It is similar to the previously explained task, but includes
a conversion of columns, which defines the data types of the attributes of
the target table Date in the data warehouse, and the addition of a column
DateKey initialized with null values so the database can generate surrogate
keys for this attribute. We do not show the task that loads the TempCities
table, shown in Fig. 9.14b, since it is similar to the one that loads the Cate-
gories table just described, except that the data is input from a file instead
of a database.
312 9 Extraction, Transformation, and Loading
NorthwindDW Load
Continent TempCities
Country State Load
Load + +
Employee
Product Load Shipper Load Load
+ + +
Territories
Sales Load Load
+ +
End
Event
Send error
e-mail
Fig. 9.15 Overall view of the conceptual ETL process for the Northwind data warehouse
The control task that loads the tables composing the hierarchy State →
Country → Continent is depicted in Fig. 9.18a. As can be seen, this requires
a sequence of data tasks. Figure 9.18b shows the data task that loads the
Continent table. It reads the data from the XML file using the following
XPath expression:
<Continents>/<Continent>/<ContinentName>
Then, a new column is added to the flow in order to be able to generate the
surrogate key for the table in the data warehouse.
Figure 9.18c shows the task that loads the Country table. It reads the data
from the XML file using the following XPath expressions:
9.3 Conceptual Design of the Northwind ETL Process 313
Convert
Input Data Insert Data
Columns
<Continents>/<Continent>/<Country>/*
<Continents>/<Continent>/<ContinentName>
In this case, we need to read from the XML file not only the attributes of
Country, but also the ContinentName to which a country belongs. For exam-
ple, when reading the Country element corresponding to Austria we must also
obtain the corresponding value of the element ContinentName, that is, Eu-
rope. Thus, the flow is now composed of the attributes CountryName, Coun-
tryCode, CountryCapital, Population, Subdivision, State, and ContinentName
(see Fig. 9.13b). The ContinentName value is then used in a lookup task for
obtaining the corresponding ContinentKey from the data warehouse. Finally,
the data in the flow is loaded into the warehouse. We do not show the task
that loads the State table since it is similar to the one that loads the Country
table just described.
The process that loads the City table is depicted in Fig. 9.19. The first
task is an input data task over the table TempCities. Note that the final goal
is to populate a table with a state key and a country key, one of which is
null depending on whether the country is divided into states or not. Thus,
the first exclusive gateway tests whether State is null or not (recall that this
is the optional attribute). In the first case, a lookup obtains the CountryKey.
In the second case, we must match (State, Country) pairs in TempCities to
values in the State and Country tables. However, as we have explained, states
and countries can come in many forms, thus we need three lookup tasks, as
shown in the annotations in Fig. 9.19. The three lookups are as follows:
• The first lookup process records where State and Country correspond,
respectively, to StateName and CountryName. An example is state Loire
and country France.
314 9 Extraction, Transformation, and Loading
Continent
Country State
Load
+
(a)
File: Territories.xml
Database: NorthwindDW
Type: XML
Table: Continent
Fields: <XPath Expr>
(b)
(c)
Fig. 9.18 Load of the tables for the State → Country → Continent hierarchy. (a) Asso-
ciated control task; (b) Load of the Continent table; (c) Load of the Country
table
• The second lookup process records where State and Country correspond,
respectively, to EnglishStateName and CountryName. An example is state
Lower Saxony, whose German name is Niedersachsen, together with coun-
try Germany.
• Finally, the third lookup process records where State and Country corre-
spond, respectively, to StateName and CountryCode. An example is state
Florida and country USA.
The SQL query associated to these lookups is as follows:
SELECT S.*, CountryName, CountryCode
FROM State S JOIN Country C ON S.CountryKey = C.CountryKey
Finally, a union is performed with the results of the four flows, and the table
is populated with an insert data task. Note that in the City table, if a state
was not found in the initial lookup (Input1 in Fig. 9.19), the attribute State
will be null; on the other hand, if a state was found it means that the city will
9.3 Conceptual Design of the Northwind ETL Process 315
Database: NorthwindDW
Table: TempCities
Retrieve: CountryKey
Database: NorthwindDW
Input Data Table: Country
Where: Country File: BadCities.txt
Matches: CountryName Type: Text
Not
Y Found
Condition:
Lookup Insert Data
State Null?
have an associated state, therefore, the Country attribute will be null (Inputs2,
Input3, and Input4 in the figure). Records for which the state and/or country
are not found are stored into a BadCities.txt file.
Figure 9.20 shows the conceptual ETL process for loading the Customer
table in the data warehouse. The input table Customers is read from the
operational database using an input data task. Recall that the Region at-
tribute in this table corresponds actually to a state name or a state code.
Since this attribute is optional, the first exclusive gateway checks whether
this attribute is null or not. If Region is null, a lookup checks if the cor-
responding (City, Country) pair matches a pair in TempCities, and retrieve
the State attribute from the latter, creating a new column. Since the value
State just obtained may be null for countries without states, another exclu-
sive gateway tests whether State is null, in which case a lookup obtains the
CityKey by matching values of City and Country in a lookup table defined by
the following SQL query:
SELECT CityKey, CityName, CountryName
FROM City C JOIN Country T ON C.CountryKey = T.CountryKey
Then, we send the obtained records to a union task in order to load them in
the data warehouse.
316 9 Extraction, Transformation, and Loading
Database: Northwind
Table: Customers Retrieve: State
Database: NorthwindDW
Table: TempCities
Input Data Where: City, Country File: BadCustomer.txt
Matches: City, Country Type: Text
Not Retrieve: CityKey
Found Database: NorthwindDW
Y Query: <SQL Query>
Condition:
Lookup Insert Data Where: City, Country
Region Null?
Matches: CityName,
N CountryName
Condition: Found
State Null?
Add Column
Lookup Insert Data
Y Not
Column: State = Region N Found
Input*: Customers.*, File: BadCustomers.txt
Input*: Customers.*,State
CityKey Found Type: Text
Output: Customers.*,State
Union Output: Customers.*,
Retrieve: CityKey CityKey Database: NorthwindDW
Database: NorthwindDW Table: Customer
Query: <SQL Query> Found
Where: City, State, Country
Matches: CityName, Lookup Union Insert Data
StateName, CountryName
Retrieve: CityKey
NotFound NotFound
Database: NorthwindDW
Found Found
Query: <SQL Query> Lookup Found Lookup
Where: City, State, Country
NotFound
Matches: CityName, NotFound
StateName, CountryCode Retrieve: CityKey Retrieve: CityKey
Database: NorthwindDW Database: NorthwindDW
Query: <SQL Query>
Where: City, State, Country
Lookup Query: <SQL Query>
Where: City, State, Country
Matches: CityName, Matches: CityName,
StateCode, CountryName StateCode, CountryCode
Two additional cases are needed with respect to the City Load task:
• The fourth lookup process records where State and Country correspond,
respectively, to StateCode and CountryName. An example is state BC and
country Canada.
• The fifth lookup process records where State and Country correspond,
respectively, to StateCode and CountryCode. An example is state AK and
country USA.
Finally, we perform the union of all flows, add the column CustomerKey for
the surrogate key initialized to null, and write to the target table by means
of an insert data task. We omit the description of the ETL process that loads
the Supplier table, since it is similar to the one that loads the Customer table
just described.
Database: NorthwindDW
Update
Input Data Insert Data Table: Territories
Columns Columns: EmployeeKey =
EmployeeID, CityKey
NotFound
Retrieve: CityKey
Database: NorthwindDW
Remove
Table: City Lookup Insert Data
Where: TerritoryDescription Duplicates
Matches: CityName
Found
Figure 9.21 depicts the process for loading the Territories bridge table. The
input table is the following SQL query:
SELECT E.*, TerritoryDescription
FROM EmployeeTerritories E JOIN Territories T ON E.TerritoryID = T.TerritoryID
Then, an update column task is applied to remove the leading spaces (with
operation trim) from the attribute TerritoryDescription. The city key is then
obtained with a lookup over the table City in the data warehouse, which adds
the attribute CityKey to the flow. The data flow continues with a task that
removes duplicates in the assignment of employees to cities. Indeed, in the
Northwind operational database New York appears twice in the Territories
table with different identifiers, and employee number 5 is assigned to both
of these versions of New York in the EmployeeTerritories table. Finally, after
removing duplicates, we populate the Territories table with an insert data
task, where the column EmployeeID in the flow is associated with the attribute
EmployeeKey in the data warehouse.
318 9 Extraction, Transformation, and Loading
Database: Northwind
Query: < SQL Query >
Input Data
Where: RequiredDate
Lookup
Matches: Time.Date Database: NorthwindDW
Found Table: Sales
Figure 9.22 shows the conceptual ETL process for loading the Sales fact
table. This task is performed once all the other tasks loading the dimension
tables have been done. The process starts with an input data task that obtains
data from the operational database by means of the SQL query below:
SELECT O.CustomerID, EmployeeID AS EmployeeKey, O.OrderDate,
O.RequiredDate AS DueDate, O.ShippedDate,
ShipVia AS ShipperKey, P.ProductID AS ProductKey,
P.SupplierID AS SupplierKey, O.OrderID AS OrderNo,
ROW_NUMBER() OVER (PARTITION BY D.OrderID
ORDER BY D.ProductID) AS OrderLineNo,
D.UnitPrice, Quantity, Discount,
D.UnitPrice * (1-Discount) * Quantity AS SalesAmount,
O.Freight/COUNT(*) OVER (PARTITION BY D.OrderID) AS Freight
FROM Orders O, OrderDetails D, Products P
WHERE O.OrderID = D.OrderID AND D.ProductID = P.ProductID
A sequence of lookups follows, which obtains the missing foreign keys for the
dimension tables. Finally, the fact table is loaded with the data retrieved.
(a)
(b)
Fig. 9.24 Load of the Category (a) and the Date (b) dimension tables
addition of the surrogate key column is implicit in the Date Load task. We
further explain this next.
As explained in Sect. 9.3, in some tables, the keys of the operational data-
base are reused as surrogate keys in the data warehouse, while in other tables
a surrogate key must be generated in the data warehouse. Therefore, the map-
ping of columns in the OLE DB Destination tasks should be done in one of
the ways shown in Fig. 9.25. For example, for table Category (Fig. 9.25a)
we reuse the key in the operational database (CategoryID) as key in the data
warehouse (CategoryKey). On the other hand, for table Customer (Fig. 9.25b),
the CustomerID key in the operational database is kept in the CustomerID col-
umn in the data warehouse, and a new value for CustomerKey is generated
during the insertion in the data warehouse.
322 9 Extraction, Transformation, and Loading
(a) (b)
Fig. 9.25 Mappings of the source and destination columns, depending on whether the
key in the operational database is reused in the data warehouse
Figure 9.26 shows the data flow used for loading the table TempCities from
the text file Cities.txt. A data conversion transformation is needed to convert
the default types obtained from the text file into the database types.
Figure 9.27a shows the data flow that loads the hierarchy composed of
the Continent, Country, and State tables. This is the Integration Services
equivalent to the conceptual control flow defined in Fig. 9.18a. A Sequence
Container is used for the three data flows that load the tables of the hierarchy.
Since Continent is the highest level in the hierarchy, we first need to produce
a key for it, so it can be later referenced from the Country level. The data
flow for loading the table Continent is given in Fig. 9.27b. With respect to the
conceptual model given in Fig. 9.18b, a data conversion is needed to convert
the data types from the XML file into the data types of the database. For
example, the ContinentName read from the XML file is by default of length
255 and it must be converted into a string of length 20. Finally, the Continent
table is loaded, and a ContinentKey is automatically generated.
The data flow that loads the table Country is given in Fig. 9.27c. With
respect to the conceptual model given in Fig. 9.18b, a merge join transforma-
tion is needed to obtain for a given Country the corresponding ContinentName.
A data conversion transformation is needed to convert the data types from
the XML file into the data types of the database. Then, a lookup transforma-
9.5 The Northwind ETL Process in Integration Services 323
(a)
(b)
(c)
Fig. 9.27 (a) Load of the tables for the Continent → Country → State hierarchy; (b)
Load of the Continent dimension table; (c) Load of the Country dimension
table
Figure 9.28 shows the data flow for loading the City table. It corresponds
to conceptual model in Fig. 9.19. The data flow needs to associate to each
324 9 Extraction, Transformation, and Loading
The task that loads the Customer table is shown in Fig. 9.29, while its
conceptual specification is given in Fig. 9.20. It starts with a conditional
split since some customers have a null value in Region (which actually holds
state values). In this case, a lookup adds a column State by matching City
and Country from Customers with City and Country from TempCities. Since the
value State just obtained may be null for countries without states, another
conditional split (called Conditional Split 1) is needed. If State is null, then
9.5 The Northwind ETL Process in Integration Services 325
On the other hand, for customers that have a nonnull Region, the values of
this column are copied into a new column State. Analogously to the loading
of City, we must perform five lookup tasks in order to retrieve the city key.
Since this process is analogous to the one in the conceptual design given in
Sect. 9.3, we do not repeat the details here. Finally, we perform the union
of all flows and load the data into the warehouse. A similar data flow task is
used for loading the Supplier dimension table.
The data flow that loads the Territories bridge table is shown in Fig. 9.30.
This data flow is similar to the conceptual design of Fig. 9.21. It starts with an
OLE DB Source task consisting in an SQL query that joins the EmployeeTer-
ritories and the Territories table as in Sect. 9.3. It continues with a derived
column transformation that removes the trailing spaces in the values of Ter-
ritoryDescription. Then, a lookup transformation searches the corresponding
values of CityKey in City. The data flow continues with a sort transforma-
tion that removes duplicates in the assignment of employees to territories,
as described in Sect. 9.3. Finally, the data flow finishes by loading the data
warehouse table.
Finally, the data flow that loads the Sales table is shown in Fig. 9.31. The
first OLE DB Source task includes an SQL query that combines data from
the operational database and the data warehouse, as follows:
SELECT ( SELECT CustomerKey FROM dbo.Customer C
WHERE C.CustomerID = O.CustomerID) AS CustomerKey,
EmployeeID AS EmployeeKey,
( SELECT DateKey FROM dbo.Date D
WHERE D.Date = O.OrderDate) AS OrderDateKey,
( SELECT DateKey FROM dbo.Date D
WHERE D.Date = O.RequiredDate) AS DueDateKey,
( SELECT DateKey FROM dbo.Date D
WHERE D.Date = O.ShippedDate) AS ShippedDateKey,
ShipVia AS ShipperKey, P.ProductID AS ProductKey,
SupplierID AS SupplierKey, O.OrderID AS OrderNo,
CONVERT(INT, ROW_NUMBER() OVER (PARTITION BY D.OrderID
ORDER BY D.ProductID)) AS OrderLineNo,
D.UnitPrice, Quantity, Discount,
CONVERT(MONEY, D.UnitPrice * (1-Discount) * Quantity) AS SalesAmount,
CONVERT(MONEY, O.Freight/COUNT(*) OVER
(PARTITION BY D.OrderID)) AS Freight
FROM Northwind.dbo.Orders O, Northwind.dbo.OrderDetails D,
Northwind.dbo.Products P
WHERE O.OrderID = D.OrderID AND D.ProductID = P.ProductID
The above query obtains data from both the Northwind operational database
and the Northwind data warehouse in a single query. This is possible to do
in Integration Services but not in other platforms such as in PostgreSQL.
Thus, the above query performs the lookups of surrogate keys from the data
warehouse in the inner queries of the SELECT clause. However, if these sur-
rogate keys are not found, null values are returned in the result. Therefore,
a conditional split transformation task selects the records obtained in the
previous query with a null value in the lookup columns and stores them in a
flat file. The correct records are loaded in the data warehouse.
Compare the above query with the corresponding one in the conceptual
specification in Fig. 9.22. While the above query implements all the necessary
lookups, in the conceptual design we have chosen to implement the lookups in
individual tasks, which conveys information in a clearer way. Therefore, the
conceptual design is more appropriate to communicate the process steps in
an intuitive way but also gives us the flexibility to choose the implementation
that is more appropriate for the application needs.
As we can see, the overall ETL process is divided into several procedures that
correspond to the tasks in the BPMN specification.
The loading of the State → Country → Continent hierarchy in SQL (cor-
responding to Fig. 9.18) is given next.
DROP FUNCTION IF EXISTS ContinentCountryStateLoad();
CREATE FUNCTION ContinentCountryStateLoad()
RETURNS VOID LANGUAGE plpgsql AS $$
BEGIN
-- Input XML file
DROP TABLE IF EXISTS TerritoriesInput;
CREATE TEMPORARY TABLE TerritoriesInput AS
SELECT xml_import('D:/InputFiles/Territories.xml') AS data;
-- Load Continent
INSERT INTO Continent(ContinentName)
SELECT unnest(xpath('/Continents/Continent/ContinentName/text()', data)::text[])
FROM TerritoriesInput;
-- Load Country
INSERT INTO Country(CountryName, CountryCode, CountryCapital, Population,
Subdivision, ContinentKey)
WITH CountriesInput AS (
SELECT xmltable.*
FROM TerritoriesInput, XMLTABLE(
'//Continents/Continent/Country' PASSING data COLUMNS
CountryName varchar(40) PATH 'CountryName',
CountryCode varchar(3) PATH 'CountryCode',
CountryCapital varchar(40) PATH 'CountryCapital',
Population integer PATH 'Population',
Subdivision text PATH 'Subdivision',
ContinentName varchar(20) PATH '../ContinentName') )
328 9 Extraction, Transformation, and Loading
We start by calling the procedure xml_import, which parses the XML file into
the temporary table TerritoriesInput. We then import the Continent, Country,
and State elements from the XML file into the corresponding tables. For the
Continent table, the path expression returns an array of text values, which is
converted into a set of rows with the function unnest in order to be inserted
into the table. Recall that the primary key ContinentKey is generated auto-
matically. Loading the Country table requires all attributes of Country and its
associated ContinentName to be input into the temporary table CountriesIn-
put. The main query then performs a lookup of the ContinentKey with the
nested SELECT clause. Loading the State table is done similarly as for the
Country table.
The loading of the City table (corresponding to Fig. 9.19) is given next.
DROP FUNCTION IF EXISTS CityLoad();
CREATE FUNCTION CityLoad()
RETURNS VOID AS $$
BEGIN
-- Drop temporary tables
DROP TABLE IF EXISTS StateCountry;
-- Insert cities without state
INSERT INTO City (CityName, CountryKey)
SELECT City, CountryKey
FROM TempCities, Country
WHERE State IS NULL AND Country = CountryName;
-- Insert non-matching tuples into error table
INSERT INTO BadCities (City, State, Country)
9.6 Implementing ETL Processes in SQL 329
Recall that the Region column in the Customers table corresponds to one of
the columns StateName, EnglishStateName, or StateCode in the State table.
Also, the Country column in the Customers table corresponds to one of the
columns CountryName or CountryCode in the Country table. We begin reading
customer data from the operational database and looking up missing State
values from the TempCities table. We continue by loading the tuples without
State. A natural join of City and Country is stored in table CC, which is used
as a lookup table for finding the CityKey for tuples, which are then inserted
into the Customer table. Recall that CustomerKey is a surrogate key in table
Customer while the CustomerID column is also kept. Then, we load the tuples
with State. A natural join of City, State, and Country is stored in table CSC.
The next INSERT performs in a single lookup for the CityKey the five lookups
in Fig. 9.20. Each disjunction in the join condition tries to match the values of
the City, State, and Country columns in CustomerInput to their corresponding
values in columns of CSC. Finally, we insert in the BadCustomers table the
tuples from CustomerInput for which a match was not found.
Finally, the loading of the Sales fact table (see Fig. 9.22) is shown next.
9.6 Implementing ETL Processes in SQL 331
Recall that the keys of the operational database are also keys in the data
warehouse for all dimensions except for Customer and Date, whose keys are,
respectively, CustomerKey and DateKey. We start with a query to the tables
Orders, OrderDetails, and Products of the operational database. Then, we
perform lookups to retrieve the CustomerKey, OrderDateKey, DueDateKey,
and ShippedDateKey columns as well as compute new columns SalesAmount,
OrderLineNo, and FreightPP. Finally, the tuples with all required columns are
inserted into the Sales fact table after a renaming operation and the erroneous
ones are inserted into the BadSales table.
9.7 Summary
A classic reference for ETL is the book by Kimball and Caserta [128]. Various
approaches for designing, optimizing, and automating ETL processes have
been proposed in the last few years. A survey of ETL technology can be
found in [253]. Simitsis et al. [254] represent ETL processes as a graph where
nodes match to transformations, constraints, attributes, and data stores, and
9.9 Review Questions 333
9.10 Exercises
Exercise 9.1. Design the conceptual ETL schema for loading the Product
dimension of the Northwind data warehouse.
Exercise 9.2. In the Northwind data warehouse, suppose that surrogate
keys CategoryKey and ProductKey are added, respectively, to the Category
and the Product tables, while the operational keys are kept in attributes
CategoryID and ProductID. Modify the conceptual ETL schema obtained in
Ex. 9.1 to take into account this situation.
Exercise 9.3. Modify the conceptual ETL schema obtained in Ex. 9.2 to
take into account a refresh scenario in which products obtained from the
operational database may be already in the Product dimension in the data
warehouse. Use a type 1 solution for the slowly changing dimension by which
the attribute values obtained from the operational database are updated in
the data warehouse.
Exercise 9.4. Modify the conceptual ETL schema obtained in Ex. 9.3 by
using a type 2 solution for the slowly changing dimension Product by which
for the products that have changed the value of the To attribute for the
current record is set to the current date and a new record is inserted in the
dimension with a null value in that attribute.
Exercise 9.5. Design the conceptual ETL schema for loading the Supplier
dimension of the Northwind data warehouse.
Exercise 9.6. Implement in Integration Services the ETL processes defined
in Exs. 9.1 and 9.5.
Exercise 9.7. Implement in SQL the ETL processes defined in Exs. 9.1
and 9.5.
Exercise 9.8. Implement in SQL the ETL process loading the Territories
bridge table of the Northwind data warehouse, whose conceptual schema is
given in Fig. 9.21.
Exercise 9.9. Implement in SQL the conceptual schemas of Exs. 9.1 and 9.5.
Exercise 9.10. Given the operational database of the French horse race ap-
plication obtained in Ex. 2.1 and the associated data warehouse obtained in
Ex. 4.5, design the conceptual ETL schema that loads the data warehouse
from the operational database.
Exercise 9.11. Given the operational database of the Formula One appli-
cation obtained in Ex. 2.2 and the associated data warehouse obtained in
Ex. 4.7, design the conceptual ETL schema that loads the data warehouse
from the operational database.
Exercise 9.12. Given the source database in Ex. 5.11 and the schema of
your solution, implement in Integration Services the ETL schema that loads
the AirCarrier data warehouse from the sources.
Chapter 10
A Method for Data Warehouse Design
A wide variety of approaches have been proposed for designing data ware-
houses. They differ in several aspects, such as whether they target data ware-
houses or data marts, the various phases that make up the design process, and
the methods used for performing requirements specification and data ware-
house design. This section highlights some of the essential characteristics of
the current approaches according to these aspects.
A data warehouse includes data about an entire organization that help
users at high management levels to take strategic decisions. However, these
decisions may also be taken at lower organizational levels related to specific
business areas, in which case only a subset of the data contained in a data
warehouse is required. This subset is typically contained in a data mart (see
Sect. 3.4), which has a similar structure to a data warehouse but is smaller
in size. Data marts can be physically collocated with the data warehouse or
they can have their own separate platform.
Like in operational databases (see Sect. 2.1), there are two major methods
for the design of a data warehouse and its related data marts:
• Top-down design: The requirements of users at different organizational
levels are merged before the design process begins, and one schema for
the entire data warehouse is built. Then, separate data marts are tailored
according to the characteristics of each business area or process.
• Bottom-up design: A separate schema is built for each data mart, tak-
ing into account the requirements of the decision-making users responsi-
ble for the corresponding specific business area or process. Later, these
schemas are merged in a global schema for the entire data warehouse.
The choice between the top-down and the bottom-up approach depends
on many factors, such as the professional skills of the development team, the
size of the data warehouse, the users’ motivation for having a data ware-
house, and the financial support, among other things. The development of an
enterprise-wide data warehouse using the top-down approach may be over-
whelming for many organizations in terms of cost and duration. It is also a
challenging activity for designers because of its size and complexity. On the
other hand, the smaller size of data marts allows the return of the investment
to be obtained in a shorter time period and facilitates the development pro-
cesses. Further, if the user motivation is low, the bottom-up approach may
deliver a data mart faster and at less cost, allowing users to quickly interact
with OLAP tools and create new reports; this may lead to an increase in
users’ acceptance level and improve the motivation for having a data ware-
house. Nevertheless, the development of these data marts requires a global
data warehouse framework to be established so that the data marts are built
considering their future integration into a whole data warehouse. Without
this global framework, such integration difficult and costly in the long term.
There is no consensus on the phases that should be followed for data ware-
house design. Some authors consider that the traditional phases of developing
operational databases described in Chap. 2, that is, requirements specifica-
tion, conceptual design, logical design, and physical design, can also be used
in developing data warehouses. Other authors ignore some of these phases, es-
10.2 General Overview of the Method 337
pecially the conceptual design phase. Several approaches for data warehouse
design have been proposed based on whether the business goals, the source
systems, or a combination of these are used as the driving force. We next
present these approaches, which we study in detail in the next sections.
The business-driven approach requires the identification of key users
that can provide useful input about the organizational goals. Users play a fun-
damental role during requirements analysis and must be actively involved in
the process of discovering relevant facts and dimensions. Users from different
levels of the organization must be selected. Then, various techniques, such as
interviews or facilitated sessions, are used to specify the information require-
ments. Consequently, the specification obtained will include the requirements
of users at all organizational levels, aligned with the overall business goals.
This is also called analysis- or requirements-driven approach.
In the data-driven approach, the data warehouse schema is obtained by
analyzing the underlying source systems. Some of the proposed techniques
require conceptual representations of the operational source systems, most of
them based on the entity-relationship model, which we studied in Chap. 2.
Other techniques use a relational schema to represent the source systems.
These schemas should be normalized, to facilitate the extraction of facts,
measures, dimensions, and hierarchies. In general, the participation of users
is only required to confirm the correctness of the derived structures, or to
identify some facts and measures as a starting point for the design of multi-
dimensional schemas. After creating an initial schema, users can specify their
information requirements.
The business/data-driven approach is a combination of the business-
and data-driven approaches, which takes into account what are the business
needs from the users and what the source systems can provide. Ideally, these
two components should match, that is, all information that the users require
for business purposes should be supplied by the data included in the source
systems. This approach is also called top-down/bottom-up analysis.
These approaches, originally proposed for the requirements specification
phase, are adapted to the other data warehouse design phases in the method
that we explain in the next section.
We next describe a general method for data warehouse design that encom-
passes various existing approaches from both research and practitioners. The
method is based on the assumption that data warehouses are a particular
type of databases dedicated to analytical purposes. Therefore, their design
should follow the traditional database design phases, as shown in Fig. 10.1.
Nevertheless, there are significant differences between the design phases for
databases and data warehouses, which stem from their different nature, as
338 10 A Method for Data Warehouse Design
explained in Chap. 3. Note that although the various phases in Fig. 10.1
are depicted consecutively, actually there are multiple interactions between
them, especially if an iterative development process is adopted in which the
system is developed in incremental versions with increased functionality.
The phases in Fig. 10.1 may be applied to define either the overall data
warehouse schema or the schemas of the individual data marts. From now
on, we shall use the term “data warehouse” to mean that the concepts that
we are discussing apply also to data marts if not stated otherwise.
For all the phases in Fig. 10.1, the specification of business and techni-
cal metadata is in continuous development. These include information about
the data warehouse and data source schemas, and the ETL processes. For
example, the metadata for a data warehouse schema may provide informa-
tion such as aliases used for various elements, abbreviations, currencies for
monetary attributes or measures, and metric systems. The elements of the
source systems should also be documented similarly. This could be a difficult
task if conceptual schemas for these systems do not exist. The metadata for
the ETL processes should consider several elements, such as the frequency of
data refreshment. Data in a fact table may be required on daily or monthly
basis, or after some specific event (e.g., after finishing a project). Therefore,
users should specify a data refreshment strategy that corresponds to their
business needs.
To illustrate the proposed method, we will use a hypothetical scenario
concerning the design of the Northwind data warehouse we have been using
as example throughout this book. We assume that the company wants to
analyze its sales along dimensions like customers, products, geography, and
so on in order to optimize the marketing strategy, for example, detecting
customers that potentially could increase their orders, or sales regions that
are underperforming. To be able to conduct the analytical process, Northwind
decided to implement a data warehouse system.
Not much attention has been paid to the requirements analysis phase in
data warehouse development, and many projects skip this phase; instead,
they concentrate on technical issues, such as database modeling or query
performance. As a consequence, many data warehouse projects fail to meet
user needs and do not deliver the expected support for decision making.
Requirements specification determines which data should be available and
how these data should be organized. In this phase, the queries of interest
for the users are also determined. This phase should lead the designer to
discover the essential elements of a multidimensional schema, like the facts
and their associated dimensions, which are required to facilitate future data
manipulation and calculations. We will see that requirements specification
for decision support and operational systems differ significantly from each
other. The requirements specification phase establishes a foundation for all
future activities in data warehouse development; in addition, it has a major
impact on the success of data warehouse projects since it directly affects the
technical aspects, as well as the data warehouse structures and applications.
We present next a general framework for the requirements specification
phase. Although we separate the phases of requirements specification and
conceptual design for readability purposes, these phases often overlap. In
many cases, as soon as initial requirements have been documented, an initial
conceptual schema starts to be sketched. As the requirements become more
complete, so does the conceptual schema. For each one of the three approaches
above, we first give a general description and then explain in more detail the
various steps; finally, we apply each approach to the Northwind case study.
We do not indicate the various iterations that may occur between steps. Our
purpose is to provide a general framework to which details can be added and
that can be tailored to the particularities of a specific data warehouse project.
In the business-driven approach, the driving force for developing the con-
ceptual schema are the business needs of users. These requirements express
the organizational goals and needs that the data warehouse is expected to
address to support the decision-making process.
The steps in the business-driven approach to requirements specification
are shown in Fig. 10.2 and described next.
Identify Users
Users at various hierarchical levels in the organization should be considered
when analyzing requirements. Executive users at the top organizational level
typically require global, summarized information. They help in understanding
high-level objectives and goals, and the overall business vision. Management
340 10 A Method for Data Warehouse Design
Document
Identify users requirements
specification
Operationalize Goals Once the goals have been defined and prioritized,
we need to make them concrete. Thus, for each goal identified in the previous
step, a collection of representative queries must be defined through interviews
with the users. These queries capture functional requirements, which de-
fine the operations and activities that a system must be able to perform.
Each user is requested to provide, in natural language, a list of queries
needed for her daily task. Initially, the vocabulary can be unrestricted. How-
ever, certain terms may have different meanings for different users. The an-
alyst must identify and disambiguate them. For example, a term like “the
best customer” should be expressed as “the customer with the highest total
sales amount.” A document is then produced, where for each goal there is a
collection of queries, and each query is associated with a user.
The process continues with query analysis and integration. Here, users
review and consolidate the queries in the document above, to avoid mis-
understandings or redundancies. The frequency of the queries must also be
estimated. Finally, a prioritization process is carried out. Since we worked
with different areas of the organization, we must unify all requirements from
these areas and define priorities between them. A possible priority hierar-
chy can be areas → users → queries of the same user. Intuitively, the idea
is that the requirement with the least priority in an area prevails over the
requirement with the highest priority in the area immediately following in im-
portance the previous one. Obviously, other criteria could also be used. This
is a cyclic process, which results in a final document containing consistent,
nonredundant queries.
In addition to the above, nonfunctional requirements should also be
elicited and specified. These are criteria that can be used to judge the oper-
ation of a system rather than specific behavior. Thus, a list of nonfunctional
requirements may be associated to each query, for example, required response
time and accuracy.
Define Facts, Measures, and Dimensions In this step, the analyst tries
to identify the underlying facts and dimensions from the queries above. This
is typically a manual process. For example, in the documentation of this
step, we can find a query “Name of top five customers with monthly average
sales higher than $1,500.” This query includes the following data elements:
customer name, month, and sales. We should also include information about
which data elements will be aggregated and the functions that must be used.
If possible, this step should also specify the granularities required for the
measures, and information about whether they are additive, semiadditive, or
nonadditive (see Sect. 3.1).
all elements required by the designers and also a dictionary of the terminol-
ogy, organizational structure, policies, and constraints of the business, among
other things. For example, the document could express in business terms what
the candidate measures or dimensions actually represent, who has access to
them, and what operations can be done. Note that this document will not be
final since additional interactions could be necessary during the conceptual
design phase in order to refine or clarify some aspects.
Table 10.1 Multidimensional elements for the analysis scenarios of the Northwind case
study obtained using the business-driven approach
Table 10.1 does not only show the dimensions but also candidate hier-
archies, inferred from the queries above and company documentation. For
example, in dimension Employee, we can see that there are two candidate hi-
erarchies: Supervision and Territories. The former can be inferred, among other
sources of information, from Requirement 4c, which suggests that users are
interested in analyzing together employees and their supervisors as a sales
force. The Territories hierarchy is derived from the documentation of the com-
pany processes, which state that employees are assigned to a given number of
cities and a city may have many employees assigned to it. In addition, users
informed that they are interested in analyzing total sales along a geographic
dimension. Note that following the previous description, the hierarchy will be
nonstrict. Requirements 1a–c suggest that customers are organized geograph-
ically and that this organization is relevant for analysis. Thus, Geography is a
candidate hierarchy to be associated with customers. The same occurs with
suppliers. The hierarchy Categories follows straightforwardly from Require-
ment 3a. The remaining hierarchies are obtained analogously.
Finally, the last step is Document Requirements Specification. The
information compiled is included in the specification of the user requirements.
For example, it can contain summarized information as presented in Ta-
10.3 Requirements Specification 345
ble 10.1 and also more descriptive parts that explain each element. The re-
quirements specification document also contains the business metadata. For
the Northwind case study, there are various ways to obtain these metadata,
for example, by interviewing users or administrative staff, or accessing the
existing company documentation. We do not detail this document here.
The data-driven approach is based on the data available at the source sys-
tems. It aims at identifying all multidimensional schemas that can be imple-
mented starting from the available operational databases. These databases
are analyzed exhaustively in order to discover the elements that can repre-
sent facts with associated dimensions, hierarchies, and measures leading to
an initial data warehouse schema.
Document
Identify source Apply derivation
requirements
systems process
specification
It is important not only to identify the data sources but also to assess
their quality. Moreover, it is often the case where the same data are available
from more than one source. Reliability, availability, and update frequency
of these sources may differ from each other. Thus, data sources must be
analyzed to assess their suitability to satisfy nonfunctional requirements. For
this, meetings with data producers are carried out, where the set of data
sources, the quality of their data, and their availability must be documented.
At the end of the whole requirements specification process, ideally we will
have for each data element the best data source for obtaining it.
Year. We call this hierarchy Calendar. We mentioned that each sales fact is
associated with three dates, thus yielding three roles for the Date dimension,
namely, OrderDate and ShippedDate (for the attribute with that name in the
operational database) and DueDate (for the RequiredDate attribute).
In addition to the Date dimension, we have seen that a sales fact is associ-
ated with three other potential dimensions: Employee, Customer, and Supplier,
derived from the respective many-to-one relationship types with the Orders
table. A careful inspection of these geographic data showed that the data
sources were incomplete. Thus, external data sources need to be checked
(like Wikipedia1 and GeoNames2 ) to complete the data. This analysis also
shows that we need several different kinds of hierarchies to account for all
possible political organizations of countries. Further, the term territories in
the database actually refers to cities, and this is the name we will adopt in
the requirements process. Also, in the one-to-many relationship type Belongs,
between Territories and Regions, we consider the latter as a candidate to be
a dimension level, yielding a candidate hierarchy City → Region.
In light of the above, we define a hierarchy, called Geography, composed
of the levels City → State → Region → Country. But this hierarchy should
also allow other paths to be followed, like City → Country (for cities that do
not belong to any state) and State → Country (for states that do not belong
to any region). This will be discussed in the conceptual design phase. This
hierarchy will be shared by the Customer and Supplier dimensions and will
also be a part of the Employee dimension via the Territories hierarchy. Note
that the latter is a nonstrict hierarchy, while Geography is a strict one.
The Products entity type induces the Product dimension, mentioned before.
The Categories entity type and the HasCategory relationship type allow us to
derive a hierarchy Product → Category, which we call Categories.
Finally, the entity type Employees is involved in a one-to-many recursive re-
lationship type called ReportsTo. This is an obvious candidate to be a parent-
child hierarchy, which we call Supervision.
Table 10.3 summarizes the result of applying the derivation process. We
included the cardinalities of the relationship between the dimensions and the
fact Sales. The term Employee City indicates a many-to-many relationship
between the Employee and City levels. All other relationships are many-to-one.
We conclude with the last step, which is Document Requirements
Specification. Similarly to the business-driven approach, all information
specified in the previous steps is documented here. This documentation in-
cludes a detailed description of the source schemas that serve as a basis for
identifying the elements in the multidimensional schema. It may also contain
elements in the source schema for which it is not clear whether they can
be used as attributes or hierarchies in a dimension. For example, we con-
sidered that the address of employees will not be used as a hierarchy. If the
1
http://www.wikipedia.org
2
http://www.geonames.org
10.3 Requirements Specification 349
Table 10.2 Multidimensional elements in the Northwind case study obtained using the
data-driven approach
source schemas use attributes or relation names with unclear semantics, the
corresponding elements of the multidimensional schema must be renamed,
specifying clearly the correspondences between the old and new names.
Table 10.3 Multidimensional elements in the Northwind case study obtained using the
data-driven approach
Facts
• Sales
– Measures: UnitPrice, Quantity, Discount
– Dimensions and cardinalities: Product (1:n), Supplier (1:n), Customer (1:n),
Employee (1:n), OrderDate (1:n), DueDate (1:n), ShippedDate (1:n), Order (1:1)
Document
Business-driven Determine
Identify users requirements
approach business needs
specification
Document
Data-driven Identify source Apply derivation
requirements
approach systems process
specification
Table 10.4 Data transformation between sources and the data warehouse
Source Source DW DW
Transformation
table attribute level attribute
Products ProductName Product ProductName —
Products QuantityPerUnit Product QuantityPerUnit —
Products UnitPrice Product UnitPrice —
... ... ... ... ...
Customers CustomerID Customer CustomerID X
Customers CompanyName Customer CompanyName —
... ... ... ... ...
Orders OrderID Order OrderNo X
Orders Order OrderLineNo X
Orders OrderDate Date — X
... ... ... ... ...
354 10 A Method for Data Warehouse Design
In this approach, the initial data warehouse is developed once the operational
schemas have been analyzed. Since not all facts will be of interest for the
purpose of decision support users must identify which facts are important.
Users can also refine the existing hierarchies since some of these are sometimes
“hidden” in an entity type or a table. Thus, the initial data warehouse schema
is modified until it becomes the final version accepted by the users.
hierarchies inferred from the analysis of the operational database (Fig. 2.1).
This led to the multidimensional schema shown in Fig. 4.1.
We start with the second step, which is Validate Conceptual Schema
with Users. The initial data warehouse schema as presented in Fig. 4.1
should be delivered to the users. In this way, they can assess its appropriate-
ness for the analysis needs. This can lead to the modification of the schema,
either by removing schema elements that are not needed for analysis or by
specifying missing elements. Recall that in the data-driven approach, during
the requirements elicitation the users have not participated, thus changes to
the initial conceptual schema will likely be needed.
The next step is Develop Final Conceptual Schema and Mappings.
The modified schema is finally delivered to the users. Given that the opera-
tional schema of the Northwind database is very simple, the mapping between
the source schema and the final data warehouse schema is almost straightfor-
ward. The implementation of such a mapping was described in Chap. 9, thus
we do not repeat it here. Further, since we already have the schemas for the
source system and the data warehouse, we can specify metadata in a similar
way to that described for the business-driven approach above.
In this approach, two activities are performed, targeting both the business
requirements of the data warehouse and the exploration of the source sys-
tems feeding the warehouse. This leads to the creation of two data warehouse
schemas (Fig. 10.7). The schema obtained from the business-driven approach
identifies the structure of the data warehouse as it emerges from the business
requirements. The data-driven approach results in a data warehouse schema
that can be extracted from the existing operational databases. After both
initial schemas have been developed, they must be matched. Several aspects
should be considered in this matching process, such as the terminology used
and the degree of similarity between the two solutions for each multidimen-
sional element, for example, between dimensions, levels, attributes, or hier-
archies. Some solutions for this have been proposed in academic literature,
although they are highly technical and complex to implement.
An ideal situation arises when both schemas cover the same analysis as-
pects, that is, the users’ needs are covered by the data in the operational
systems and no other data are needed to expand the analysis. In this case,
the schema is accepted, and mappings between elements of the source sys-
tems and the data warehouse are specified. Additionally, documentation is
developed following the guidelines studied for the business- and data-driven
approaches. This documentation contains metadata about the data ware-
house, the source systems, and the ETL process. Nevertheless, in real-world
10.5 Logical Design 357
situations, it is seldom the case that both schemas will cover the same aspects
of analysis. Two situations may occur:
1. Users require less information than what the operational databases can
provide. In this case, we must determine whether users may consider new
aspects of analysis or whether to eliminate from the schema those facts
that are not of their interest. Therefore, another iteration of the business-
and data-driven approaches is required, where either new users will be
involved or a new initial schema will be developed.
2. Users require more information than what the operational databases can
provide. In this case, the users may reconsider their needs and limit them
to those proposed by the business-driven solution. Alternatively, users
may require the inclusion of external sources or legacy systems not con-
sidered in the previous iteration. Thus, new iterations of the business-
and data-driven approaches may again be needed.
Figure 10.8 illustrates the steps of the logical design phase. First, the trans-
formation of the conceptual multidimensional schema into a logical schema is
developed. Then, the ETL processes are specified, considering the mappings
and transformations indicated in the previous phase. We shall refer next to
these two steps.
In this approach, the participation of the users is not explicitly required. They
are involved only sporadically, either to confirm the correctness of the struc-
tures derived or to identify facts and measures as a starting point for creating
multidimensional schemas. Typically, users come from the professional or the
administrative organizational level since data are represented at a low level
of detail. Also, this approach requires highly skilled and experienced design-
ers. Besides the usual modeling abilities, they should have enough business
knowledge to understand the business context and its needs. They should also
have the capacity to understand the structure of the underlying operational
databases.
The data-driven method has several advantages:
• It ensures that the data warehouse reflects the underlying relationships
in the data.
• It ensures that the data warehouse contains all necessary data from the
beginning.
• It simplifies the ETL processes since data warehouses are developed on
the basis of existing operational databases.
• It reduces the user involvement required to start the project.
• It facilitates a fast and straightforward development process, provided
that well-structured and normalized operational systems exist.
• It allows automatic or semiautomatic techniques to be applied if the
operational databases are represented using the entity-relationship model
or normalized relational tables.
However, it is important to consider the following disadvantages before
choosing this approach:
362 10 A Method for Data Warehouse Design
• Only business needs reflected in the underlying source data models can
be captured.
• The system may not meet users’ expectations since the company’s goals
and the user requirements are not reflected at all.
• The method may not be applied when the logical schemas of the underly-
ing operational systems are hard to understand or the data sources reside
on legacy systems.
• Since it relies on existing data, this approach cannot be used to address
long-term strategic goals.
• The inclusion of hierarchies may be difficult since they may be hidden in
various structures, for example, in generalization relationships.
• It is difficult to motivate end users to work with large schemas developed
for and by specialists.
• The derivation process can be difficult without knowledge of the users’
needs since, for instance, the same data can be considered as a measure
or as a dimension attribute.
10.8 Summary
In this chapter, we have presented a general method for the design of data
warehouses. Our proposal is close to the classic database design method and
is composed of the following phases: requirements specification, conceptual
design, logical design, and physical design. For the requirements specification
and conceptual design phases we have proposed three different approaches:
(1) the business-driven approach, which focuses on business needs; (2) the
data-driven approach, which develops the data warehouse schema on the basis
of the data of the underlying operational databases, typically represented us-
ing the entity-relationship or the relational model; and (3) the business/data-
driven approach, which combines the first two approaches, matching the
users’ business needs with the availability of data. The next phases of the
method presented correspond to those of classic database design. Therefore,
a mapping from the conceptual model to a logical model is specified, fol-
lowed by the definition of physical structures. The design of these structures
should consider the specific features of the target DBMS with respect to the
particularities of data warehouse applications.
There are several methods for requirements analysis based on the data-
driven approach: Böhnlein and Ulbrich-vom Ende [34] propose a method
to derive logical data warehouse structures from the conceptual schema of
operational systems. Golfarelli et al. [83] present a graphical conceptual
model for data warehouses called the Dimensional Fact Model and proposed
a semiautomatic process for building conceptual schemas from operational
entity-relationship (ER) schemas. Cabibbo and Torlone [40] present a design
method that starts from an existing ER schema and derives a multidimen-
sional schema. They also provide an implementation in terms of relational
tables and multidimensional arrays. Paim et al. [178] propose a method for
requirements specification which consists of three phases: requirements plan-
ning, specification, and validation. Paim and Castro [179] later extended this
method by including nonfunctional requirements, such as performance and
accessibility. Vaisman [244] propose a method for the specification of func-
tional and nonfunctional requirements that integrates the concepts of require-
ments engineering and data quality. This method refers to the mechanisms for
collecting, analyzing, and integrating requirements. Users are also involved
in order to determine the expected quality of the source data. Then, data
sources are selected using quantitative measures to ensure data quality. The
outcome of this method is a set of documents and a ranking of the operational
data sources that should satisfy the users requirements according to various
quality parameters.
As for the combination of approaches, Bonifati et al. [35] introduce a
method for the identification and design of data marts, which consists of
three general parts: top-down analysis, bottom-up analysis, and integration.
The top-down analysis emphasizes the user requirements and requires precise
identification and formulation of goals. On the basis of these goals, a set of
ideal star schemas is created. On the other hand, the bottom-up analysis aims
at identifying all the star schemas that can be implemented using the avail-
able source systems. This analysis requires to represent the source systems
using the ER model. The final integration phase is used to match the ideal
star schemas with realistic ones based on the existing data. Elamin et al. [68]
introduces a method denoted SSReq to generate star schemas from business
users’. This is, like the previous one, a mixed approach that produces star
schemas from both data sources and business requirements. The method has
three steps: (a) Business requirements elicitation; (b) Requirements normal-
ization, which takes the requirements to a format that eliminates redundancy;
and (c) Generation of the star schemas.
We now comment on some miscellaneous proposals that do not fit in the
classification above. In [161], the authors propose a method covering the
full requirements engineering cycle for data warehouses, based on the Goal-
Oriented Requirements Engineering (GORE) approach. A book that covers
the overall requirements engineering for data warehouses is [192]. The ra-
tionale is that requirements engineering should be integrated into agile de-
velopment. Therefore, rather than restricting incremental and iterative de-
10.10 Review Questions 365
10.1 What are the similarities and the differences between designing a data-
base and designing a data warehouse?
10.2 Compare the top-down and the bottom-up approaches for data ware-
house design. Which of the two approaches is more often used? How
does the design of a data warehouse and of a data mart differ?
10.3 Discuss the various phases in data warehouse design, emphasizing the
objective of each phase.
10.4 Summarize the main characteristics of the business-driven, data-
driven, and business/data-driven approaches for requirements spec-
366 10 A Method for Data Warehouse Design
ification. How do they differ from each other? What are their respec-
tive advantages and disadvantages? Identify in which situations one
approach would be preferred over the others.
10.5 Using an application domain that you are familiar with, illustrate
the various steps in the business-driven approach for requirements
specification. Identify at least two different users, each one with a
particular business goal.
10.6 Using the application domain of Question 10.5, illustrate the various
steps in the data-driven approach for requirements specification. De-
fine an excerpt of an ER schema from which some multidimensional
elements can be derived.
10.7 Compare the steps for conceptual design in the business-driven, data-
driven, and business/data-driven approaches.
10.8 Develop a conceptual multidimensional schema for the application
domain of Question 10.5 using among the three approaches the one
that you know best.
10.9 Illustrate the different aspects of the logical design phase by translat-
ing the conceptual schema developed in Question 10.8 into the rela-
tional model.
10.10 Describe several aspects that are important to consider in the physical
design phase of data warehouses.
10.11 Exercises
Exercise 10.1. Consider the train application described in Ex. 3.2. Using the
business-driven approach, write the requirements specifications that would
result in the MultiDim schema obtained in Ex. 4.3.
Exercise 10.2. Consider the French horse race application described in
Ex. 2.1. Using the data-driven approach, write the requirements specifica-
tions in order to produce the MultiDim schema obtained in Ex. 4.5.
Exercise 10.3. Consider the Formula One application described in Ex. 2.2.
Using the business/data-driven approach, write the requirements specifica-
tions in order to produce the MultiDim schema obtained in Ex. 4.7.
Exercise 10.4. The ranking of universities has become an important factor
in establishing the reputation of a university at the international level. Our
university wants to determine what actions it should take to improve its posi-
tion in the rankings. To simplify the discussion, we consider only the ranking
by The Times. The evaluation criteria in this ranking refer to the two main
areas of activities of universities, namely, research and education. However, a
closer analysis shows that 60% of the criteria are related to research activities
(peer review and citation/faculty scores) and 40% to the university’s com-
mitment to teaching. Therefore, we suppose that the decision-making users
10.11 Exercises 367
way, not only can new strategic contacts be established (which may lead to
international projects), but also the quality of the university’s research can
be improved.
Further, international projects promote the interaction of the university
staff with peers from other universities working in the same area and thus
could help to improve the peer review score. There are several sources of
funding for research projects: the university, industry, and regional, national,
and international institutions. Independently of the funding scheme, a project
may be considered as being international when it involves participants from
institutions in other countries.
Finally, knowledge about the international publications produced by the
university’s staff is essential for assessing the citation per faculty criterion.
Publications can be of several types, namely, articles in conference proceed-
ings or in journals, and books.
Based on the description above, we ask you to
a. Produce a requirements specification for the design of the data warehouse
using the business-driven approach. For this, you must
• Identify users.
• For each goal and subgoal, write a set of queries that these users
would require. Refine and prioritize these queries.
• Define facts, measures, and dimensions based on these queries.
• Infer dimension hierarchies.
• Build a table summarizing the information obtained.
b. Produce a conceptual schema, using the business-driven approach and
the top-down design. Discuss data availability conditions and how they
impact on the design. Identify and specify the necessary mappings.
c. Produce a conceptual schema, using the business-driven approach, and
the bottom-up design. For this, you must build three data marts: one for
the analysis of conferences, another one for the analysis of publications,
and the third one for the analysis of research projects. Then, merge the
three of them, and compare the schema produced with the one obtained
through the top-down approach above.
d. Produce a requirements specification for the design of the data warehouse
using the data-driven approach, given the entity-relationship schema of
the operational database in Fig. 10.10. For this, you must
• Explain how the facts, measures, dimensions, and hierarchies are de-
rived.
• Summarize in a table the information obtained.
10.11 Exercises 369
(0,n) (3,n)
AcademicStaff InvestigatesIn ResearchCenter
Fig. 10.10 Excerpt from the ER schema of the operational database in the university
application
Part III
Advanced Topics
Chapter 11
Temporal and Multiversion Data
Warehouses
The data warehouses studied so far in this book assumed that only facts and
their measures evolve in time. However, dimension data may also vary in time;
for instance, a product may change its price or its category. The most popular
approach for tackling this issue in the context of relational databases is the
so-called slowly changing dimensions one, which we have studied in Chap. 5.
An alternative approach is based on the field of temporal databases. Such
databases provide mechanisms for managing information that varies over
time. The combination of temporal databases and data warehouses leads to
temporal data warehouses, which are studied in the first part of this chapter.
While temporal data warehouses address the evolution of data, another
situation that may arise is that the schema of a data warehouse evolves. To
cope with this issue, the data in the warehouse can be modified to comply
with the new version of the schema. However, this is not always possible
nor desirable in many real-world situations, and therefore, it is necessary to
maintain several versions of a data warehouse. This leads to multiversion data
warehouses, which are studied in the second part of this chapter.
Section 11.1 introduces some concepts related to temporal databases, and
how to query such databases using standard SQL. In Sect. 11.2, we present a
temporal extension of the conceptual multidimensional model we use in this
book. Section 11.3 presents the mapping of our temporal multidimensional
model into the relational model, while Sect. 11.4 discusses various implemen-
tation considerations in this respect. In Sect. 11.5, we address the issue of
querying a temporal data warehouse in SQL. Sect. 11.6 compares temporal
data warehouses with slowly changing dimensions and provides an alterna-
tive implementation of the temporal Northwind data warehouse using slowly
changing dimensions. In Sect. 11.7, we study conceptual modeling of multi-
version data warehouses, while logical design of multiversion data warehouses
is presented in Sect. 11.8. We conclude in Sect. 11.9 illustrating how to query
a multiversion data warehouse.
Employee Salary
SSN FirstName LastName BirthDate Address SSN Amount FromDate ToDate
Affiliation WorksOn
SSN DNumber FromDate ToDate SSN PNumber FromDate ToDate
Controls
DNumber PNumber FromDate ToDate
Temporal Projection
Given the table Affiliation of Fig. 11.2, suppose that we want to obtain the
periods of time in which an employee has worked for the company, indepen-
dently of the department. Figure 11.3a shows the result of projecting out
attribute DNumber from the table of Fig. 11.2. As can be seen, the resulting
table is redundant. The first two rows are value equivalent, that is, they
are equal on all their columns except for FromDate and ToDate, and in ad-
dition, the time periods of these rows meet. The situation is similar for the
last two rows. Therefore, the result of the projection should be as given in
Fig. 11.3b. This process of combining several value-equivalent rows into one
provided that their time periods overlap is called coalescing.
Coalescing is a complex and costly operation in SQL, as shown next.
376 11 Temporal and Multiversion Data Warehouses
Fig. 11.3 Projecting out attribute DNumber from the table in Fig. 11.2. (a) Initial
result; (b) Coalescing the result
E E
F L
C
C1
Result
Temporal Join
Recall from Fig. 11.1 that tables Salary and Affiliation keep, respectively, the
evolution of the salary and the affiliation of employees. To determine the joint
evolution across time of salary and affiliation, a temporal join of these tables
is needed. Given one row from each table whose validity periods intersect,
a temporal join returns the salary and affiliation values together with the
intersection of the two validity periods. As illustrated in Fig.11.5, there are
four cases in which two periods may intersect.
50
Salary
Case 1 D1
Affiliation
50
Salary
Case 2 D1
Affiliation
50
Salary
Case 3 D1
Affiliation
50
Salary
Case 4 D1
Affiliation
50,D1
Result
The above query assumes that there are no duplicate rows in the tables:
at each point in time an employee has one salary and one affiliation. The
UNION ALL is used since the query does not generate duplicates and this is
more efficient than using UNION. It is worth noting that the result of this
query must be coalesced.
A temporal join can be written in a single statement using either a CASE
statement or functions. Suppose that in SQL Server we write two functions
MinDate and MaxDate as follows.
CREATE FUNCTION MinDate(@one DATE, @two DATE)
RETURNS DATE
BEGIN
RETURN CASE WHEN @one < @two THEN @one ELSE @two END
END
CREATE FUNCTION MaxDate(@one DATE, @two DATE)
RETURNS DATE
BEGIN
RETURN CASE WHEN @one > @two THEN @one ELSE @two END
END
These functions return, respectively, the minimum and the maximum of their
two arguments. They allow us to write the temporal join as follows.
SELECT S.SSN, Amount, DNumber, dbo.MaxDate(S.FromDate, A.FromDate) AS FromDate,
dbo.MinDate(S.ToDate, A.ToDate) AS ToDate
FROM Salary S, Affiliation A
WHERE S.SSN = A.SSN AND
dbo.MaxDate(S.FromDate, A.FromDate) < dbo.MinDate(S.ToDate, A.ToDate)
The two functions are used in the SELECT clause for constructing the inter-
section of the corresponding validity periods. The condition in the WHERE
clause ensures that the two validity periods overlap.
Temporal Difference
The difference is expressed in SQL using either EXCEPT or NOT EXISTS.
Suppose that we want to list the employees who work on a single project. If
the table WorksOn is nontemporal, this can be expressed as follows
SELECT W1.EmployeeKey
FROM WorksOn W1
WHERE NOT EXISTS (
SELECT *
FROM WorksOn W2
WHERE W1.EmployeeKey = W2.EmployeeKey AND
W1.PNumber <> W2.PNumber )
Now suppose that we want to find out the periods of time when employees
worked on a single project. Expressing a temporal difference in SQL requires
the four possible cases shown in Fig. 11.6 to be considered. These cases show
11.1 Manipulating Temporal Information in SQL 379
W1
Case 1 W2
W1
Case 2 W2
W1
Case 3 W2 W3
W1
Case 4
Result
an employee working on one, two, or three projects W1, W2, and W3 such
that she is not involved in any other project.
The temporal difference can be expressed in SQL as follows.
-- Case 1
SELECT W1.EmployeeKey, W1.FromDate, W2.FromDate
FROM WorksOn W1, WorksOn W2
WHERE W1.EmployeeKey = W2.EmployeeKey
W1.FromDate < W2.FromDate AND W2.FromDate < W1.ToDate AND
NOT EXISTS (
SELECT *
FROM WorksOn W3
WHERE W1.EmployeeKey = W3.EmployeeKey AND
W1.FromDate < W3.ToDate AND W3.FromDate < W2.FromDate )
UNION
-- Case 2
SELECT W1.EmployeeKey, W1.ToDate, W2.ToDate
FROM WorksOn W1, WorksOn W2
WHERE W1.EmployeeKey = W2.EmployeeKey
W2.FromDate < W1.ToDate AND W1.ToDate < W2.ToDate AND
NOT EXISTS (
SELECT *
FROM WorksOn W3
WHERE W1.EmployeeKey = W3.EmployeeKey
W1.ToDate < W3.ToDate AND W3.FromDate < W2.ToDate )
UNION
-- Case 3
SELECT W1.EmployeeKey, W2.ToDate, W3.FromDate
FROM WorksOn W1, WorksOn W2, WorksOn W3
WHERE W1.EmployeeKey = W2.EmployeeKey AND
W1.EmployeeKey = W3.EmployeeKey AND
W2.ToDate < W3.FromDate AND W1.FromDate < W2.ToDate AND
W3.FromDate < W1.ToDate AND NOT EXISTS (
SELECT *
FROM WorksOn W3
WHERE W1.EmployeeKey = W3.EmployeeKey AND
W2.ToDate < W3.ToDate AND W3.FromDate < W3.FromDate )
UNION
380 11 Temporal and Multiversion Data Warehouses
-- Case 4
SELECT EmployeeKey, FromDate, ToDate
FROM WorksOn W1
WHERE NOT EXISTS (
SELECT *
FROM WorksOn W3
WHERE W1.EmployeeKey = W3.EmployeeKey AND
W1.FromDate < W3.ToDate AND W3.FromDate < W1.ToDate )
The above query assumes that the table WorksOn is coalesced. Furthermore,
its result must be coalesced.
Temporal Aggregation
SQL provides aggregate functions such as SUM, COUNT, MIN, MAX, and
AVG. These are used for answering queries such as “compute the maximum
salary” or “compute the maximum salary by department.” The temporal
version of these queries requires a three-step process: (1) split the time line
into periods of time during which all values are constant, (2) compute the
aggregation over these periods, and (3) coalesce the result.
E1 20 30
E2 25 30
E3 30 35 35
MAX 20 25 30 30 35 35 35 30
Suppose that we want to compute the evolution over time of the maximum
salary. Figure 11.7 shows a diagram where table Salary has three employees
E1, E2, and E3, as well as the result of the temporal maximum. Computing
the temporal maximum in SQL is done as follows.
WITH SalChanges(Day) AS (
SELECT FromDate
FROM Salary
UNION
SELECT ToDate
FROM Salary ),
SalPeriods(FromDate, ToDate) AS (
SELECT P1.Day, P2.Day
FROM SalChanges P1, SalChanges P2
WHERE P1.Day < P2.Day AND NOT EXISTS (
SELECT *
FROM SalChanges P3
WHERE P1.Day < P3.Day AND P3.Day < P2.Day ) ),
TempMax(MaxSalary, FromDate, ToDate) AS (
SELECT MAX(Amount), P.FromDate, P.ToDate
11.1 Manipulating Temporal Information in SQL 381
Table SalChanges gathers the dates at which a salary change occurred, while
table SalPeriods constructs the periods from such dates. Table TempMax com-
putes the temporal maximum on these periods. Finally, table TempMax is
coalesced in the main query as explained for the temporal projection above.
We refer to the bibliographic section at the end of the chapter for references
that show how to compute more complex temporal aggregations.
Temporal Division
The division is needed to answer queries such as “List the employees who work
on all projects controlled by the department to which they are affiliated.” As
we have seen in Sect. 2.4.3, since SQL does not provide the division operation,
the above query should be rephrased as follows: “List the employees such that
there is no project controlled by the department to which they are affiliated
on which they do not work.” The latter query can be expressed in SQL with
two nested NOT EXISTS predicates as follows.
SELECT SSN
FROM Affiliation A
WHERE NOT EXISTS (
SELECT *
FROM Controls C
WHERE A.DNumber = C.DNumber AND NOT EXISTS (
SELECT *
FROM WorksOn W
WHERE C.PNumber = W.PNumber AND A.SSN = W.SSN ) )
Consider now the temporal version of the above query. As in the case of
temporal aggregation, a three-step process is needed: (1) split the time line
into periods of time during which all values are constant, (2) compute the
division over these periods, and (3) coalesce the result.
Four cases arise depending on whether the tables Affiliation, Controls, and
WorksOn are temporal or not. Here we consider the case when Controls and
WorksOn are temporal but Affiliation is not. The bibliographic section at the
end of the chapter provides further references that study the other cases.
Figure 11.8 shows possible values in the three tables and the result of the
temporal division. At the top it is shown that employee E is affiliated to de-
partment D; C1 and C2 represent two rows of Controls relating department D
with projects P1 and P2; and W1, W2 represent two rows of WorksOn relating
employee E with projects P1 and P2. Finally, Result shows the periods for
which the division must be calculated and the corresponding result (where 3
382 11 Temporal and Multiversion Data Warehouses
Affiliation(E,D)
C1 D,P1
C2 D,P2
W1 E,P1
W2 E,P2
Result ü ü û ü ü û
Fig. 11.8 Periods in which employee E works on all projects controlled by the depart-
ment D to which she is affiliated
represents true and 7 represents false). Notice that, in this example, employ-
ees may work on projects controlled by departments which are not the one
to which they are affiliated, as shown in W1 when employee E starts to work
on project P1. The query computing the temporal division is as follows.
WITH ProjChanges(SSN, Day) AS (
SELECT SSN, FromDate
FROM Affiliation A, Controls C
WHERE A.DNumber = C.DNumber
UNION
SELECT SSN, ToDate
FROM Affiliation A, Controls C
WHERE A.DNumber = C.DNumber
UNION
SELECT SSN, FromDate
FROM WorksOn
UNION
SELECT SSN, ToDate
FROM WorksOn ),
ProjPeriods(SSN, FromDate, ToDate) AS (
SELECT P1.SSN, P1.Day, P2.Day
FROM ProjChanges P1, ProjChanges P2
WHERE P1.SSN = P2.SSN AND P1.Day < P2.Day AND NOT EXISTS (
SELECT *
FROM ProjChanges P3
WHERE P1.SSN = P3.SSN AND P1.Day < P3.Day AND
P3.Day < P2.Day ) ),
TempDiv(SSN, FromDate, ToDate) AS (
SELECT DISTINCT P.SSN, P.FromDate, P.ToDate
FROM ProjPeriods P, Affiliation A
WHERE P.SSN = A.SSN AND NOT EXISTS (
SELECT *
FROM Controls C
WHERE A.DNumber = C.DNumber AND
C.FromDate <= P.FromDate AND P.ToDate <= C.ToDate
AND NOT EXISTS (
SELECT *
FROM WorksOn W
WHERE C.PNumber = W.PNumber AND
11.2 Conceptual Design of Temporal Data Warehouses 383
Table ProjChanges extracts for each employee the day when her department
starts or finishes controlling a project, as well as the day when she starts or
finishes working on a project. Table ProjPeriods constructs the periods from
the above days, while table TempDiv computes the division on these periods.
Notice how the query computing the latter table generalizes the query given
at the beginning of this section computing the nontemporal version of the
division. Finally, the main query coalesces the TempDiv table.
Time
(total,exclusive)
Simple Complex
s c
Time Time
(total,exclusive)
Instant Period
Instant Period Set Set
specified. A SimpleTime can represent, for example, the time (with a granu-
larity of day) at which an event such as a conference occurs, where one-day
conferences are represented by an Instant and other conferences, spanning
two or more days, are represented by a Period.
An InstantSet is a set of instants, which can represent, for example, the
instants at which car accidents have occurred at a particular location.
A PeriodSet is a set of periods that represent discontinuous durations, as
in the case of a project that was interrupted during a lapse of time.
A ComplexTime denotes any heterogeneous set of temporal values that
may include instants and periods.
Finally, Time is the most generic time data type, meaning “this element
has a temporal extent” without any commitment to a specific time data type.
It is an abstract type that can be used, for example, to represent the lifespan
of projects, where it may be either a Period or a PeriodSet.
We allow empty temporal values, that is, values that represent an empty
set of instants. This is needed, in particular, to express the fact that the
intersection of two temporal values is empty.
interior of a temporal value is composed of all its instants that do not belong
to the boundary. The boundary of a temporal value is the interface between
its interior and exterior and it is defined for the various time data types as
follows. An instant has an empty boundary. The boundary of a period con-
sists of its start and end instants. The boundary of a ComplexTime value is
(recursively) defined as the union of the boundaries of its components that
do not intersect with other components.
We describe next the most commonly used synchronization relationships;
the associated icons are given in Fig. 11.10.
Meets Overlaps/Intersects
Contains/Inside Covers/CoveredBy
Equals Disjoint
Starts Finishes
Precedes Succeeds
Meets: Two temporal values meet if they intersect in an instant but their
interiors do not. Note that two temporal values may intersect in an instant
but not meet.
Overlaps: Two temporal values overlap if their interiors intersect and their
intersection is not equal to either of them.
Contains/Inside: Contains and Inside are symmetric predicates: a Contains b
if and only if b Inside a. A temporal value contains another one if the
interior of the former includes all instants of the latter.
Covers/CoveredBy: These are symmetric predicates: a Covers b if and only
if b CoveredBy a. A temporal value covers another one if the former in-
cludes all instants of the latter. Thus, the former contains the latter but
without the restriction that the boundaries of the temporal extents do
not intersect. As a particular case, the two temporal values may be equal.
Disjoint/Intersects: Disjoint and Intersects are inverse predicates: when one
applies, the other does not. Two temporal values are disjoint if they do
not share any instant.
Equals: Two temporal values are equal if every instant of the first value be-
longs also to the second and conversely.
Starts/Finishes: A temporal value starts another one if the first instants of
the two values are equal. Similarly, a temporal value finishes another one
if the last instants of the two values are equal.
Precedes/Succeeds: A temporal value precedes another one if the last instant
of the former is before the first instant of the latter. Similarly, a temporal
value succeeds another one if the first instant of the former is later than
the last instant of the latter.
386 11 Temporal and Multiversion Data Warehouses
We now extend the MultiDim model that we have seen in Chap. 4 with
temporal features. The model provides support for the temporality types we
have seen in Sect. 11.1, namely, valid time (VT) and transaction time
(TT). Furthermore, it adds the notion of lifespan (LS), which captures the
time during which an object or relationship exists. For example, a lifespan
can be used to represent the duration of a project or the duration during
which an employee has worked on a project. The lifespan of an object or
relationship may be seen as the valid time of the related fact stating that the
object or relationship exists. Therefore, we use valid time for attributes and
lifespan for objects and relationships as a whole.
These temporality types can be combined for a single schema element,
for example, a level can have an associated lifespan and a transaction time.
Valid time and lifespan are typically used to analyze measures taking into
account changes in dimension data. On the other hand, transaction time is
used in traceability applications, for example for fraud detection, where the
time when changes to data occurred are required for analysis purposes.
Even though most real-world phenomena vary over time, storing their
evolution in time may not be necessary for an application. Therefore, the
choice of the data elements for which the data warehouse keeps track of their
evolution in time depends on application requirements and the availability
of this information in source systems. The MultiDim model allows users to
specify which schema elements require temporal support.
Figure 11.11 shows the conceptual schema of the temporal Northwind data
warehouse. It assumes that changes in data related to products and employ-
ees are kept for analysis purposes, as indicated by the various pictograms
included in the schema. On the other hand, it assumes that we are not in-
terested in keeping track of changes to customer data, since this dimension
does not include any temporal support. As can be seen in the figure, the
MultiDim model allows both temporal and conventional (nontemporal) lev-
els, attributes, parent-child relationships, hierarchies, and dimensions. The
definitions of the conventional elements of the model remain the same as
those presented in Chap. 4.
A temporal level keeps track of the lifespan of its members. The schema
in Fig. 11.11 includes three temporal levels (indicated by the temporal sym-
bols to the left of the level name): Product, Category, and Employee. These
levels are used to track changes in a member as a whole, for example insert-
ing or deleting a product or a category. Notice that the temporal symbols
for the Product and Category levels are different. The symbol represents
a noncontinuous lifespan; for example, a product may cease to be sold, and
return to the catalog later. On the other hand, the symbol represents a
11.2 Conceptual Design of Temporal Data Warehouses 387
Employee City
Territories
EmployeeID Supplier CityName
Geography
FirstName
LastName SupplierID
Title CompanyName
Address State
BirthDate
Address PostalCode StateName
City EnglishStateName
Region StateType
Customer
Supervisor
PostalCode StateCode
Geography
Country CustomerID StateCapital
Supervision CompanyName
Subordinate Address
PostalCode
Region
Date RegionName
Date Sales RegionCode
DayNbWeek
DayNameWeek Quantity
DayNbMonth OrderDate UnitPrice: Avg +!
Country
DayNbYear DueDate Discount: Avg +!
WeekNbYear SalesAmount CountryName
ShippedDate Freight
Calendar CountryCode
/NetAmount CountryCapital
Population
Month Subdivision
Product
MonthNumber
ProductID
MonthName Continent
ProductName
QuantityPerUnit ContinentName
UnitPrice
Quarter
Categories
Quarter Shipper
ShipperID
Semester Category CompanyName
Semester CategoryID
CategoryName Order
Description
Year OrderNo
OrderLineNo
Year
continuous lifespan; for example, a closed category cannot be sold again. The
usual levels without temporality are called conventional levels.
A temporal attribute keeps track of the changes in its values and the
time when they occur. For instance, the temporality for UnitPrice in the
388 11 Temporal and Multiversion Data Warehouses
Product level indicates that the history of changes in this attribute is kept.
As shown in Fig. 11.12, a level may be temporal independently of the fact
that it has temporal attributes.
Fig. 11.12 Types of temporal support for a level. (a) Temporal level; (b) Temporal level
with a temporal attribute; (c) Conventional level with a temporal attribute
any time instant, but may belong to many categories over its lifespan, that is,
its lifespan cardinality will be many-to-many. On the other hand, if the life-
span cardinality is many-to-one, that would mean that a product is assigned
to a single category throughout its lifespan.
A temporal hierarchy is a hierarchy that has at least one temporal level
or one temporal parent-child relationship. Thus, temporal hierarchies can
combine temporal and conventional levels. Similarly, a temporal dimension
is a dimension that has at least one temporal hierarchy. The usual dimensions
and hierarchies without temporality are called conventional dimensions
and conventional hierarchies.
Two temporal levels related in a hierarchy define a synchronization con-
straint, which we studed in Sect. 11.2.2. To represent them, we use the
pictograms shown in Fig. 11.10. By default we assume the overlaps synchro-
nization constraint, which indicates that the lifespans of a child member and
its parent member overlap. For example, in Fig. 11.11 the lifespan of each
product overlaps the lifespan of its corresponding category, that is, at each
point in time a valid product belongs to a valid category. As we shall see in
Sect. 11.2.4, other synchronization relationships may exist.
A temporal fact links one or more temporal levels and may involve a
synchronization relationship: this is indicated by an icon in the fact. For
example, the temporal fact Sales in Fig. 11.11 relates two temporal levels:
Product and Employee. The overlaps synchronization icon in the fact indicates
that a product is related to an employee in an instance of the fact provided
that their lifespans overlap. If a synchronization icon is not included in a fact,
there is no particular synchronization constraint on its instances.
A temporal measure keeps track of the changes in its values and the
time when they occur. For instance, in Fig. 11.18a, the temporality for the
measure InterestRate indicates that the history of changes in this measure is
kept. As seen in the figure, a fact may be temporal independently of whether
it has temporal measures.
Notice that in the MultiDim model, the temporal support for dimensions,
hierarchies, levels, and measures differs from the traditional temporal support
for facts, where the latter are related explicitly to a time dimension.
In the next section we elaborate on the temporal components of the Multi-
Dim model. For simplicity, we only discuss valid time and lifespan; the results
may be generalized to transaction time.
Given two related levels in a hierarchy, the levels, the relationship between
them, or both may be temporal. We study these situations next.
Figure 11.13a shows an example of temporal levels associated with a con-
ventional relationship. In this case, the relationship only keeps current links
390 11 Temporal and Multiversion Data Warehouses
Categories
Categories
ProductKey CategoryKey ProductKey CategoryKey
ProductName CategoryName ProductName CategoryName
... Description ... Description
(a) (b)
Product Category
Categories
ProductKey CategoryKey
ProductName CategoryName
... Description
(c)
Fig. 11.13 Various cases of temporal support for relationships, depending on whether
the levels and/or the relationship are temporal or not.
parent member, and the lifespan of a parent member is equal to the union of
the lifespans of all its links to a child member.
Repair Insurance
Risk
Coverage
Fig. 11.15 Relational schema of the temporal Northwind data warehouse in Fig. 11.11
parent level (TP ), respectively, that is, there is a foreign key in the
fact or child table pointing to the other table.
Rule 3c: If the relationship is many-to-many, a new bridge table TB
is created, containing as attributes the surrogate keys of the tables
corresponding to the fact (TF ) and the dimension level (TL ), or the
parent (TP ) and child levels (TC ), respectively. The key of the table
is the combination of the two surrogate keys. If the relationship has a
distributing attribute, an additional attribute is added to the table.
If the relationship is temporal, we can only use Rule 3c above indepen-
dently of the cardinalities. In addition, two attributes FromDate and To-
Date must be added to the table TB , and the key of the table is composed
of the combination of the surrogate keys and FromDate.
The above rules suppose that the temporal features have a day granular-
ity and thus, the columns AtDate, FromDate, and ToDate are of type Date.
If the granularity were timestamp instead, then the corresponding columns
are called AtTime, FromTime, and ToTime, and they are of type Timestamp.
Notice also that the above rules have not considered the synchronization con-
straints present in the schema. As we will see in Sect. 11.4.3, a set of triggers
must ensure that these constraints are satisfied.
For example, as shown in Fig. 11.15, the level Product is mapped to the
table with the same name, its temporal attribute UnitPrice is mapped to the
table ProductUnitPrice, and its lifespan is stored in table ProductLifespan,
since, as indicated by the corresponding icon in Fig. 11.11, the lifespan of
products is not continuous. On the other hand, the lifespan of categories is
stored in the attributes FromDate and ToDate of table Category, since, as indi-
cated by the corresponding icon, the lifespan of categories is continuous. Re-
garding temporal parent-child relationships, table ProductCategory represents
the relationship between products and categories, while EmployeeSupervision
represents the relationship between employees and their supervisors.
As we have seen in this section, the relational representation of temporal
data warehouses produces a significant number of tables, which may cause
performance issues owing to the multiple join operations required for re-
assembling this information. In addition, as we will show in the next section,
the resulting relational schema has to be supplemented with many integrity
constraints that encode the underlying semantics of time-varying data.
11.4 Implementation Considerations 395
These tables compute, for each granularity, the beginning and end dates in a
closed-open representation as well as the number of days in the period. Such
tables are needed to roll-up measures through the time dimension. Note that
these tables can be defined as views to ensure that when the Date is updated,
e.g., to expand the underlying time frame of the data warehouse, they contain
up-to-date information. However, declaring these tables as views will make
more complex the work of the query optimizer when executing OLAP queries.
The first constraint complements the primary key of the table to ensure
that in the lifespan of a product there are no two tuples with the same value of
ToDate. The second constraint ensures that the periods are well constructed.
In addition, we need the following trigger to ensure that all periods defining
the lifespan of a product are disjoint.
CREATE TRIGGER ProductLifespan_OverlappingPeriods ON ProductLifespan
AFTER INSERT, UPDATE AS
IF EXISTS (
SELECT *
FROM INSERTED P1
WHERE 1 < (
SELECT COUNT(*)
FROM ProductLifespan P2
WHERE P1.ProductKey = P2.ProductKey AND
P1.FromDate < P2.ToDate AND P2.FromDate < P1.ToDate ) )
BEGIN
RAISERROR ('Overlapping periods in lifespan of product', 1, 1)
ROLLBACK TRANSACTION
END
11.4 Implementation Considerations 397
The condition in the WHERE clause ensures that two periods denoted by P1
and P2 for the same product overlap. Since this condition is satisfied if P1
and P2 are the same tuple, the value of COUNT(*) in the inner query must
be greater than 1 to ensure that two distinct periods overlap.
Suppose now that a constraint states that the time frame of the temporal
attribute UnitPrice must be included in the lifespan of Product, or in other
words, the evolution of the attribute values is only kept while a product is in
the catalog. The following trigger ensures this constraint.
CREATE TRIGGER ProductUnitPrice_PeriodInLifespan ON ProductUnitPrice
AFTER INSERT, UPDATE AS
IF EXISTS (
SELECT *
FROM INSERTED P1
WHERE NOT EXISTS (
SELECT *
FROM ProductLifespan P2
WHERE P1.ProductKey = P2.ProductKey AND
P2.FromDate <= P1.FromDate AND
P1.ToDate <= P2.ToDate ) )
BEGIN
RAISERROR ('Period of unit price is not contained in lifespan of product', 1, 1)
ROLLBACK TRANSACTION
END
The condition in the WHERE clause ensures that the period of the attribute
value denoted by P1 is included in the lifespan of its product denoted by P2.
As said in Sect. 11.2.4, in temporal parent-child relationships the lifespan
of a relationship instance must be covered by the intersection of the lifespans
of the participating members. The following triggers ensure this constraint
for the relationship between products and categories.
CREATE TRIGGER ProductCategory_LifespanInProduct_1 ON ProductCategory
AFTER INSERT, UPDATE AS
IF EXISTS (
SELECT *
FROM INSERTED PC
WHERE NOT EXISTS (
SELECT *
FROM ProductLifespan P
WHERE PC.ProductKey = P.ProductKey AND
P.FromDate <= PC.FromDate AND PC.ToDate <= P.ToDate ) )
BEGIN
RAISERROR ('Lifespan of relationship is not contained in lifespan of product', 1, 1)
ROLLBACK TRANSACTION
END
CREATE TRIGGER ProductCategory_LifespanInProduct_2 ON ProductLifespan
AFTER UPDATE, DELETE AS
IF EXISTS (
SELECT *
FROM ProductCategory PC
WHERE PC.ProductKey IN ( SELECT ProductKey FROM DELETED )
398 11 Temporal and Multiversion Data Warehouses
Since the lifespan of products is discontinuous, the above query ensures that
every period composing the lifespan is found in the links and vice versa.
Synchronization constraints restrict the lifespan of related objects. Con-
sider in Fig. 11.11 the synchronization constraint between Product and Cate-
gory, which checks that the lifespans of a product and its associated category
overlap. A trigger that ensures this constraint is given next.
CREATE TRIGGER ProductOverlapsCategory ON ProductCategory
AFTER INSERT, UPDATE AS
IF EXISTS (
SELECT *
FROM INSERTED PC
WHERE NOT EXISTS (
SELECT *
FROM ProductLifespan P, CategoryLifespan C
WHERE P.ProductKey = PC.ProductKey AND
C.CategoryKey = PC.CategoryKey AND
11.4 Implementation Considerations 399
In the above trigger, the condition in the WHERE clause ensures that the two
periods denoted by the facts B1 and B2 for the same account overlap, which
constitutes a violation of the constraint.
Enforcing temporal integrity constraints is a complex and costly operation.
We have given examples of triggers that monitor sometimes the insert and
update operations, sometimes the update and delete operations. Additional
triggers must be introduced to monitor all potential insert, update, and delete
operations in all involved tables.
From
Organization
Date Account Branch
Date Balance
AccountNo BranchNo
Date Amount
To ... Type Address
… ... ...
Date
(a)
Balance Date
FromDateKey ToDateKey AccountNo Amount ... DateKey Date ...
Account
AccountNo ... BranchNo
(b)
Values in table
(A1, … , B1) (A2, … , B1) (A3, … , B2)
Account
M1 M2 M3
Date
A1 5 8 4 10 3
Values of Amount A2 3 2 5 8
in table Balance
A3 5 4 7 6 9
8 11 10 10 6 9 15 18 11
Sum of Amount by B1
Branch and Month B2 5 4 4 7 6 6 9
(c)
Fig. 11.16 Aggregation of an atelic state measure. (a) Excerpt of a schema for analyzing
bank accounts; (b) Three tables of the corresponding relational schema; (c)
Aggregation of Amount by branch and month
From
Date Vehicle Fleet
Date Delivery
Fleets
VehicleNo FleetNo
Date NoKm
To FuelCons Type NoVehicles
… ... ...
Date
(a)
Delivery Date
FromDateKey ToDateKey VehicleNo NoKm FuelCons DateKey Date ...
Vehicle
VehicleNo ... FleetNo
(b)
ì (V1, … , F1)
Values in table
í (V2, … , F1)
Vehicle
î (V3, … , F2)
M1 M2 M3
Date
ì V1 20 30 10
Values of NoKm ï
í V2 12 40
in table Delivery ï
î V3 20 30 18
22 75 15
Sum of NoKm by ì F1
Fleet í 30 20 18
î F2
(c)
Fig. 11.17 Aggregation of a telic state measure. (a) Excerpt of a schema for analyzing
deliveries; (b) Three tables of the corresponding relational schema; (c) Ag-
gregation of NoKm by fleet and month
The above query performs a temporal join of tables Delivery and Month. This
allows us to split the periods in the fact table by month and to aggregate
the measure by month and fleet. Then, the SELECT clause computes the
percentage of measure NoKm pertaining to a particular fleet and month.
This is obtained by multiplying the value of the measure by the duration
in days of the intersection between the periods associated with the measure
11.4 Implementation Considerations 403
and the month, divided by the duration of the period associated with the
measure.
Obviously, the discussion above also applies for temporal data warehouses.
For example, suppose the schemas in Figs. 11.16a and 11.17a are temporal,
where the parent-child relationships that associate accounts to branches and
vehicles to fleets are temporal. In this case, the above SQL queries that
aggregate measures Amount and NoKm must be extended to take into account
the temporal parent-child relationships. These queries are left as an exercise.
Figure 11.18a shows a schema for analyzing bank loans. The loan fact has
measures Amount, RateType, and InterestRate, where the latter is a slowly
changing measure, that is, the value of the measure may change during
the duration of the fact. Indeed, depending on the value of the RateType,
the InterestRate may be fixed throughout the whole duration of the loan, or
can be variable and thus re-evaluated periodically (for example, every year)
according to market conditions. As shown in Fig. 11.18b, the evolution of the
interest rate for loans is kept in table LoanInterestRate.
If the interest rate of loans only changes by month, the following query
computes the time-weighted average of the measure InterestRate by branch.
WITH LoanRateDuration AS (
SELECT *, DateDiff(mm, StartDate, EndDate) AS Duration
FROM LoanInterestRate ),
LoanDuration AS (
SELECT LoanKey, DateDiff(mm, D1.Date, D2.Date) AS Duration
FROM Loan L, Date D1, Date D2
WHERE L.StartDateKey = D1.DateKey AND L.EndDateKey = D2.DateKey
GROUP BY LoanKey ),
LoanAvgInterestRate AS (
SELECT R.LoanKey, SUM(R.InterestRate * R.Duration) / MAX(L.duration)
FROM LoanRateDuration R, LoanDuration L
WHERE R.LoanKey = L.LoanKey
GROUP BY R.LoanKey )
SELECT B.BranchID, AVG(InterestRate) AS AvgInterestRate
FROM Loan L, LoanAvgInterestRate A, Branch B
WHERE L.LoanKey = A.LoanKey AND L.BranchKey = B.BranchKey
GROUP BY B.BranchID
ORDER BY B.BranchID
Date
Date
…
Start End
Date Date
Branch Client
Loan
BranchID ClientID
Amount
Address FirstName
RateType
City LastName
InterestRate
...
(a)
Loan
LoanKey BranchKey StartDateKey EndDateKey Amount RateType
LoanInterestRate Branch
LoanKey InterestRate StartDate EndDate BranchKey Address City
(b)
Fig. 11.18 An example of a slowly changing measure. (a) Excerpt of a schema for
analyzing bank loans; (b) Tables of the corresponding relational schema
varies over time. These assignments are stored in table ProductCategory. The
last line of the WHERE clause ensures that products are rolled up to the
category they were associated to at the order date. Queries with this inter-
pretation are called temporally consistent queries. If, instead, we want
products to be rolled up to their current category, in the last line of the
WHERE clause, D.Date should be replaced by GetDate(). The situation is
similar if we want products to be rolled up to the category they were associ-
ated to at a particular date (e.g., 1/1/2017). Queries with this interpretation
are called time-slice queries because they extract the state of the data
warehouse at a particular point in time. As a result, the sales of products
that are not valid at time t will not be aggregated in the result. For exam-
ple, if we want products to be rolled up to their current category, sales of
discontinued products will not satisfy the above condition in the WHERE
clause.
Query 11.3. For each employee, total sales amount of products she sold with
unit price greater than $30 at the time of the sale.
SELECT FirstName + ' ' + LastName AS EmployeeName,
SUM(SalesAmount), AS SalesAmount
FROM Sales S, Date D, Employee E, Product P, ProductUnitPrice U
WHERE S.OrderDateKey = D.DateKey AND S.EmployeeKey = E.EmployeeKey AND
S.ProductKey = P.ProductKey AND P.ProductKey = U.ProductKey AND
U.FromDate <= D.Date AND D.Date < U.ToDate AND U.UnitPrice > 30
GROUP BY FirstName + ' ' + LastName
ORDER BY FirstName + ' ' + LastName
406 11 Temporal and Multiversion Data Warehouses
This query implements a temporal selection taking into account that the unit
price of products varies over time. These values are stored in table Produc-
tUnitPrice. The query selects the price of a product at the order date and
verifies that it is greater than $30. Then, the query aggregates the result by
employee to compute the total sales amount.
Query 11.4. For each product and month, list the unit price and the total
sales amount at that price.
SELECT P.ProductName, M.Year, M.MonthNumber, U.UnitPrice,
SUM(SalesAmount) AS SalesAmount,
dbo.MaxDate(M.FromDate, U.FromDate) AS FromDate,
dbo.MinDate(M.ToDate, U.ToDate) AS ToDate
FROM Sales S, Date D, Product P, ProductUnitPrice U, Month M
WHERE S.OrderDateKey = D.DateKey AND S.ProductKey = P.ProductKey AND
P.ProductKey = U.ProductKey AND
dbo.MaxDate(M.FromDate, U.FromDate) <
dbo.MinDate(M.ToDate, U.ToDate) AND
dbo.MaxDate(M.FromDate, U.FromDate) <= D.Date AND
D.Date < dbo.MinDate(M.ToDate, U.ToDate)
GROUP BY P.ProductName, U.UnitPrice, M.FromDate, U.FromDate, M.ToDate, U.ToDate
ORDER BY P.ProductName, dbo.MaxDate(M.FromDate, U.FromDate)
This query performs a temporal join of the tables ProductUnitPrice and Month
to split the months according to the variation of the unit price of products.
Combining this with a traditional join of the other tables we compute the
total sales amount in the periods obtained.
Query 11.6. Personal sales amount made by an employee compared with the
total sales amount made by herself and her subordinates during 2017.
WITH Supervision AS (
SELECT EmployeeKey, SupervisorKey, FromDate, ToDate
FROM EmployeeSupervision
WHERE SupervisorKey IS NOT NULL
UNION ALL
SELECT E.EmployeeKey, S.SupervisorKey,
dbo.MaxDate(S.FromDate, E.FromDate) AS FromDate,
dbo.MinDate(S.ToDate, E.ToDate) AS ToDate
FROM Supervision S, EmployeeSupervision E
WHERE S.EmployeeKey = E.SupervisorKey AND
dbo.MaxDate(S.FromDate, E.FromDate) <
dbo.MinDate(S.ToDate, E.ToDate) ),
SalesEmp2017 AS (
SELECT EmployeeKey, SUM(S.SalesAmount) AS PersonalSales
FROM Sales S, Date D
WHERE S.OrderDateKey = D.DateKey AND D.Year = 2017
GROUP BY EmployeeKey ),
SalesSubord2017 AS (
SELECT SupervisorKey AS EmployeeKey,
SUM(S.SalesAmount) AS SubordinateSales
FROM Sales S, Supervision U, Date D
WHERE S.EmployeeKey = U.EmployeeKey AND
S.OrderDateKey = D.DateKey AND D.Year = 2017 AND
U.FromDate <= D.Date AND D.Date < U.ToDate
GROUP BY SupervisorKey )
SELECT FirstName + ' ' + LastName AS EmployeeName, S1.PersonalSales,
COALESCE(S1.PersonalSales + S2.SubordinateSales, S1.PersonalSales)
AS PersSubordSales
FROM Employee E JOIN SalesEmp2017 S1 ON E.EmployeeKey = S1.EmployeeKey
LEFT OUTER JOIN SalesSubord2017 S2 ON
S1.EmployeeKey = S2.EmployeeKey
ORDER BY EmployeeName
Query 11.8. For each employee and supervisor pair, list the number of days
that the supervision lasted.
WITH EmpSup AS (
SELECT E.FirstName + ' ' + E.LastName AS EmployeeName,
U.FirstName + ' ' + U.LastName AS SupervisorName,
DateDiff(day, S.FromDate, dbo.MinDate(GetDate(), S.ToDate))
AS NoDays
FROM EmployeeSupervision S, Employee E, Employee U
WHERE S.EmployeeKey = E.EmployeeKey AND
S.SupervisorKey = U.EmployeeKey )
SELECT EmployeeName, SupervisorName, SUM(NoDays) AS NoDays
FROM EmpSup
GROUP BY EmployeeName, SupervisorName
ORDER BY EmployeeName, SupervisorName
This query uses the DateDiff function to compute the number of days in a
period. The function MinDate obtains the minimum date between the current
date (obtained with the function GetDate) and the ToDate of a period. In
this way, if the period associated with a supervision ends in the future (as is
the case when the supervision is current), then the number of days will be
computed from the beginning of the period until the current date.
Query 11.9. Total sales amount for employees during the time they were
assigned to only one city.
WITH EmpOneCity(EmployeeKey, FromDate, ToDate) AS (
-- Case 1
SELECT T1.EmployeeKey, T1.FromDate, T2.FromDate
11.5 Querying the Temporal Northwind Data Warehouse in SQL 409
This query performs a temporal difference to obtain the periods during which
an employee is assigned to a single city. In table EmpOneCity, the four inner
queries after the NOT EXISTS predicate verify that no other assignment
overlaps the period during which an employee is assigned to a single city.
Table EmpOneCityCoaslesced coalesces the previous table. Finally, the main
query selects the sales whose order date belongs to the period when the
employee is assigned to a single city. Then, it groups the results by employee
to compute the total sales amount.
Query 11.10. For each employee, compute the total sales amount and num-
ber of cities to which she is assigned.
WITH CityChanges(EmployeeKey, Day) AS (
SELECT EmployeeKey, FromDate
FROM Territories
UNION
SELECT EmployeeKey, ToDate
FROM Territories ),
CityPeriods(EmployeeKey, FromDate, ToDate) AS (
SELECT T1.EmployeeKey, T1.Day, T2.Day
FROM CityChanges T1, CityChanges T2
WHERE T1.EmployeeKey = T2.EmployeeKey AND
T1.Day < T2.Day AND NOT EXISTS (
SELECT *
FROM CityChanges T3
WHERE T1.EmployeeKey = T3.EmployeeKey AND
T1.Day < T3.Day AND T3.Day < T2.Day ) ),
CityCount(EmployeeKey, NoCities, FromDate, ToDate) AS (
SELECT P.EmployeeKey, COUNT(CityKey), P.FromDate, P.ToDate
FROM Territories T, CityPeriods P
WHERE T.EmployeeKey = P.EmployeeKey AND
T.FromDate <= P.FromDate AND P.ToDate <= T.ToDate
GROUP BY P.EmployeeKey, P.FromDate, P.ToDate ),
CityCountCoalesced(EmployeeKey, NoCities, FromDate, ToDate) AS (
-- Coalescing the table CityCount above
... )
SELECT FirstName + ' ' + LastName AS EmployeeName,
SUM(SalesAmount) AS TotalSales, NoCities,
dbo.MaxDate(C.FromDate, S.FromDate) AS FromDate,
dbo.MinDate(C.ToDate, S.ToDate) AS ToDate
FROM Sales F, Date D, Employee E, CityCountCoalesced C, StateCountCoalesced S
WHERE F.OrderDateKey = D.DateKey AND F.EmployeeKey = E.EmployeeKey AND
F.EmployeeKey = C.EmployeeKey AND F.EmployeeKey = S.EmployeeKey AND
dbo.MaxDate(C.FromDate, S.FromDate) < dbo.MinDate(C.ToDate, S.ToDate)
AND dbo.MaxDate(C.FromDate, S.FromDate) <= D.Date AND
D.Date < dbo.MinDate(C.ToDate, S.ToDate)
GROUP BY FirstName + ' ' + LastName, dbo.MaxDate(C.FromDate, S.FromDate),
dbo.MinDate(C.ToDate, S.ToDate), NoCities
ORDER BY FirstName + ' ' + LastName, dbo.MaxDate(C.FromDate, S.FromDate)
Therefore, the result of the query must include the period during which an
employee is assigned to a given number of cities. Table CityChanges computes
for each employee the days when her assignment to a city starts or ends,
and thus her number of cities could change on those days. Table CityPeriods
computes the periods from the days in the table CityChanges. Table CityCount
counts the number of cities assigned to an employee for each of the periods of
the table CityPeriods. Table CityCountCoalesced coalesces the table CityCount
since there may be adjacent periods in the latter table that have the same
number of cities. Finally, the main query performs a temporal join of the
tables, keeping the city and state count, and a traditional join of the other
three tables and verifies that the order date is included in the period to
be output. Then, the query groups the result by employee and period to
obtain the total sales amount. Notice that in this query the double-counting
problem (see Sect. 4.2.6) does not arise since a sales fact is joined to a single
period in the table resulting from the temporal join of CityCountCoalesced
and StateCountCoalesced; this is verified in the last two lines of the WHERE
clause.
Query 11.11. Total sales per category, for categories in which all products
have a price greater than $7.
WITH CatUnitPrice(CategoryKey, UnitPrice, FromDate, ToDate) AS (
SELECT P.CategoryKey, U.UnitPrice,
dbo.MaxDate(P.FromDate, U.FromDate),
dbo.MinDate(P.ToDate, U.ToDate)
FROM ProductCategory P, ProductUnitPrice U
WHERE P.ProductKey = U.ProductKey AND
dbo.MaxDate(P.FromDate, U.FromDate) <
dbo.MinDate(P.ToDate, U.ToDate) ),
CatChanges(CategoryKey, Day) AS (
SELECT CategoryKey, FromDate
FROM CatUnitPrice
UNION
SELECT CategoryKey, ToDate
FROM CatUnitPrice
UNION
SELECT CategoryKey, FromDate
FROM Category
UNION
SELECT CategoryKey, ToDate
FROM Category ),
CatPeriods(CategoryKey, FromDate, ToDate) AS (
SELECT P1.CategoryKey, P1.Day, P2.Day
FROM CatChanges P1, CatChanges P2
WHERE P1.CategoryKey = P2.CategoryKey AND
P1.Day < P2.Day AND NOT EXISTS (
SELECT *
FROM CatChanges P3
WHERE P1.CategoryKey = P3.CategoryKey AND
P1.Day < P3.Day AND P3.Day < P2.Day ) ),
TempDiv(CategoryKey, FromDate, ToDate) AS (
412 11 Temporal and Multiversion Data Warehouses
We have seen in Chap. 5 that slowly changing dimensions have been proposed
as an approach to cope with dimensions that evolve over time. In this section,
we compare the approach for temporal data warehouses we have presented in
this chapter with slowly changing dimensions. We only consider type 2 slowly
changing dimensions, since this is the most-used approach.
Figure 11.19 shows the relational representation of the temporal North-
wind conceptual schema given in Fig. 11.11 using type 2 slowly changing di-
mensions. In the schema, tables Product, Category, Territories, and Employee
are the only temporal tables. Thus, for example, information about products,
their lifespan, unit price, and category are kept in table Product, while these
are kept in four separate tables in the schema of Fig. 11.15.
11.6 Temporal Data Warehouses versus Slowly Changing Dimensions 413
Fig. 11.19 Relational schema of the temporal Northwind data warehouse in Fig. 11.11
using type 2 slowly changing dimensions
We revisit next some queries of the previous section with this version of
the temporal Northwind data warehouse. To facilitate our discussion, the
schema in Fig. 11.15 is referred to as the temporal version while the schema
in Fig. 11.19 is referred to as the SCD (slowly changing dimension) version.
414 11 Temporal and Multiversion Data Warehouses
Query 11.2. Total sales amount per customer, year, and product category.
SELECT C.CompanyName, D.Year, A.CategoryName,
SUM(SalesAmount) AS SalesAmount
FROM Sales S, Customer C, Date D, Product P, Category A
WHERE S.CustomerKey = C.CustomerKey AND S.OrderDateKey = D.DateKey AND
S.ProductKey = P.ProductKey AND P.CategoryKey = A.CategoryKey AND
P.FromDate <= D.Date AND D.Date < P.ToDate
GROUP BY C.CompanyName, D.Year, A.CategoryName
The previous version of the query had an additional join with table Product-
Category to select the category of a product valid at the order date.
Query 11.9. For each product, list the name, unit price, and total sales
amount by month.
WITH ProdUnitPrice(ProductID, UnitPrice, FromDate, ToDate) AS (
SELECT ProductID, UnitPrice, FromDate, ToDate
FROM Product ),
ProdUnitPriceCoalesced AS (
-- Coalescing the table ProdUnitPrice above
... ),
SELECT P.ProductName, U.UnitPrice, SUM(SalesAmount)
AS SalesAmount,
dbo.MaxDate(M.FromDate, U.FromDate) AS FromDate,
dbo.MinDate(M.ToDate, U.ToDate) AS ToDate
FROM Sales S, Date D, Product P, ProdUnitPriceCoalesced U, Month M
WHERE S.OrderDateKey = D.DateKey AND S.ProductKey = P.ProductKey AND
P.ProductID = U.ProductID AND dbo.MaxDate(M.FromDate, U.FromDate) <
dbo.MinDate(M.ToDate, U.ToDate) AND
dbo.MaxDate(M.FromDate, U.FromDate) <= D.Date AND
D.Date < dbo.MinDate(M.ToDate, U.ToDate)
GROUP BY P.ProductName, U.UnitPrice, M.FromDate, U.FromDate, M.ToDate, U.ToDate
ORDER BY P.ProductName, dbo.MaxDate(M.FromDate, U.FromDate)
Query 11.11. Total sales per category, for categories in which all products
have a price greater than $7.
WITH CatUnitPrice(CategoryID, UnitPrice, FromDate, ToDate) AS (
SELECT DISTINCT CategoryID, UnitPrice, P.FromDate, P.ToDate
FROM Product P, Category C
WHERE P.CategoryKey = C.CategoryKey),
CatUnitPriceCoalesced(CategoryID, UnitPrice, FromDate, ToDate) AS (
-- Coalescing the table CatUnitPrice above
... ),
CatLifespan(CategoryID, FromDate, ToDate) AS (
SELECT DISTINCT CategoryID, FromDate, ToDate
FROM Category C ),
CatLifespanCoalesced AS (
-- Coalescing the table CatLifespan above
... ),
CatChanges(CategoryID, Day) AS (
SELECT CategoryID, FromDate
FROM CatUnitPriceCoalesced
UNION
SELECT CategoryID, ToDate
FROM CatUnitPriceCoalesced
UNION
SELECT CategoryID, FromDate
FROM CatLifespanCoalesced
UNION
SELECT CategoryID, ToDate
FROM CatLifespanCoalesced ),
CatPeriods(CategoryID, FromDate, ToDate) AS (
-- As in the previous version of this query
... ),
TempDiv(CategoryID, FromDate, ToDate) AS (
SELECT P.CategoryID, P.FromDate, P.ToDate
FROM CatPeriods P
WHERE NOT EXISTS (
SELECT *
FROM CatUnitPriceCoalesced U
WHERE P.CategoryID = U.CategoryID AND
U.UnitPrice <= 7 AND
U.FromDate <= P.FromDate AND
P.ToDate <= U.ToDate ) ),
TempDivCoalesced(CategoryKey, FromDate, ToDate) AS (
-- Coalescing the table TempDiv above
... )
SELECT ... -- As in the previous version of this query
In the previous version of this query, a temporal join between tables Pro-
ductCategory and ProductUnitPrice is needed for computing the CatUnitPrice
temporary table while in this version, this information is available in table
Product. Nevertheless, if other attributes of products were temporal (e.g.,
QuantityPerUnit), the SCD version would require a temporal projection of
the Product table to obtain the joint evolution of the unit price and category
independently of the evolution of the other temporal attributes.
416 11 Temporal and Multiversion Data Warehouses
We compare the temporal (Fig. 11.15) and the SCD (Fig. 11.19) versions
of the schema with respect to querying using the Product dimension. As we
have seen, information about products, their lifespan, unit price, and cate-
gory are kept in a single-table Product in the SCD version, while these are
kept in four separate tables in the temporal version. Therefore, to obtain the
lifespan, the unit price, or the category of products, a temporal projection
(with coalescing) of table Product is required in the SCD version, while this
information is already present in the temporal version. As has been said, co-
alescing is a complex and costly operation. On the other hand, for operations
such as temporal slice or temporal roll-up, the SCD version requires fewer
joins, since in the temporal version we have to join the Product table with the
ProductUnitPrice or the ProductCategory tables. As has been said, temporal
join is a relatively simple operation that can be efficiently implemented.
In conclusion, regardless the version of the logical schema, querying tem-
poral data warehouses requires complex SQL queries to express temporal
algebraic operations. Further, such queries are very inefficient. The only way
to solve this problem is for the DBMS to enable such temporal operations to
be expressed natively in SQL, as has been suggested by many researchers in
the temporal database community. Similarly, MDX and DAX should also be
extended with temporal versions of the OLAP operators.
Employee City
Territories
EmployeeID Store CityName
Geography
FirstName
LastName StoreNo
Title StoreName
StoreType State
BirthDate
HireDate StoreSqft StateName
Address Address EnglishStateName
City PostalCode StateType
Region StateCode
Supervisor
PostalCode StateCapital
Country Customer
Geography
Supervision CustomerID
CompanyName
Subordinate Region
Address
PostalCode RegionName
Date
RegionCode
Date
DayNbWeek Sales
DayNameWeek
DayNbMonth Quantity Country
DayNbYear OrderDate UnitPrice: Avg +!
WeekNbYear Discount: Avg +! CountryName
DueDate SalesAmount CountryCode
Calendar CountryCapital
Population
Product Subdivision
Month
ProductID
Categories
MonthNumber ProductName
MonthName QuantityPerUnit Continent
UnitPrice ContinentName
Discontinued
Quarter Suppliers
Order
Quarter
OrderNo
Supplier OrderLineNo
Semester SupplierID
CompanyName Category
Semester
Address
City CategoryID
State CategoryName
Year Description
PostalCode
Year Country
(a)
Fig. 11.20 Conceptual schemas of the multiversion Northwind data warehouse. (a)
Initial version V1
11.7 Conceptual Design of Multiversion Data Warehouses 419
Supervision
Geography
Territories
Employee City
Supervisor
Store EmployeeID CityName
StoreNo ...
StoreName Geography
StoreType Subordinate
StoreSqft Customer
Manager
Sales CustomerID
Address
PostalCode Quantity ...
UnitPrice: Avg +!
Date OrderDate Discount: Avg +! Order
SalesAmount
Date DueDate Freight OrderNo
... ShippedDate /NetAmount OrderLineNo
Calendar
Shipper
Categories
Product
ProductID Suppliers ShipperID
CompanyName
...
(b)
Supervision
Geography
Territories
Employee City
Supervisor
Sales CustomerID
...
Quantity
Date OrderDate UnitPrice: Avg +!
Order
Date DueDate Discount: Avg +!
... SalesAmount OrderNo
ShippedDate Freight
Calendar OrderLineNo
/NetAmount
Shipper
Categories
Product
ShipperID
ProductID CompanyName
...
(c)
Fig. 11.20 Conceptual schemas of the multiversion Northwind data warehouse (cont.)
(b) Second version V2 (excerpt); (c) Third version V3 (excerpt)
420 11 Temporal and Multiversion Data Warehouses
V1 V2 V3
T1 t1 T2 t2 T3 t3 now Time
Fig. 11.21 Time line for the creation of the data warehouse versions in Fig. 11.20
Fig. 11.22 Contents of the versions of the Store dimension. (a) Store in V1 at time t1 ;
(b) Store in V2 at time t2 ; (c) Store in V1 at time t2
Fig. 11.23 Contents of the versions of the Sales fact. (a) Sales in V1 at time t1 ; (b)
Sales in V2 at time t2 ; (c) Sales in V3 at time t3 ; (d) Sales in V1 at time t3
Consider now Fig. 11.23a, which shows the state of the Sales fact in version
V1 at time t1 . After the first schema change, Fig. 11.23b shows the state of
the fact in V2 at time t2 where the last fact member is added to the current
version. The figure shows the new dimension Shipper and the new measures
Freight and NetAmount. As the shipper information is not available for the
previous facts, they roll-up to an unknown shipper. Consider now Fig. 11.23c,
which shows the state of the fact in V3 at time t3 where the last member is
added to the current version. The figure shows a new dimension Supplier.
Since in the previous versions level Product rolled-up to level Supplier, it is
possible to obtain the SupplierID values for the existing fact members. For
this, the fact members related to the same supplier should be aggregated
422 11 Temporal and Multiversion Data Warehouses
provided that the value of the other dimensions are the same. For example, as
the first two fact members are related to the same supplier u1 and the values
of the other dimensions are the same, these fact members are combined and
represented as a single member in the new version of the fact. If the user
accesses fact Sales in V1 at time t3 , the value of StoreID for the last fact
member will not be available as this attribute is not present in the new
version of the fact. This situation is depicted in Fig. 11.23d.
Notice that in some cases default values can be used instead of null values
for information that was not captured in a particular version. Suppose that
in version V2 the Sales fact has an additional dimension Promotion and that
the measure Discount is not in V1 . If there were no promotions or discounts in
version V1 , instead of having null values for the facts introduced in V1 , these
facts can be related to a member ’No Promotion’ in the Promotion dimension
and have a value 0 for Discount. This will allow the data warehouse to better
capture the application semantics.
Figure 11.24 shows the logical schemas of the three versions of the Northwind
data warehouse given in Fig. 11.20. In the figures, the tables that changed
from the previous version are shown with all their attributes, while the tables
that did not change from the previous version are shown with only their
primary key. Since any of these schemas can be used to address the data
warehouse, a user must chose one of them prior to any interaction with it.
After that, she can use the data warehouse like a traditional one. We describe
next two approaches to implement such a multiversion data warehouse.
In the single-table version (STV) approach, the newly added attributes
are appended to the existing ones and the deleted attributes are not dropped
from the table. A default or null value is stored for unavailable attributes.
This approach is preferred for dimension tables since they usually have fewer
records as compared to the number of records in the fact tables. Figure 11.25
shows the state of the Store table after the first schema change where attribute
Manager is added and StoreSqft is deleted. Records s1, s2, and s3 have null
values for attribute Manager because its value is unknown for these records.
Since attribute StoreSqft has been deleted, all newly added records such as
s4 will have null values for it, which may result in a space overhead in the
presence of a huge amount of dimension data.
The members of level Store in versions V1 and V2 can be accessed using
the following views.
CREATE VIEW StoreV1 AS
SELECT StoreKey, StoreName, StoreType, StoreSqft, Address, PostalCode, CityKey
FROM Store
11.8 Logical Design of Multiversion Data Warehouses 423
(a)
Fig. 11.24 Logical schemas of the multiversion Northwind data warehouse. (a) Initial
version V1
These views select the columns pertaining to each version of the Store level.
On the other hand, in the multiple-table version (MTV) approach, each
change in the schema of a table produces a new version. This approach is
preferred for fact tables since they typically contain many more records than
dimension tables and records are added to them more frequently. Figure 11.26
illustrates this approach for the Sales fact table. As can be seen, records are
424 11 Temporal and Multiversion Data Warehouses
Supplier Store
Sales
SupplierKey StoreKey
... CustomerKey StoreName
EmployeeKey StoreType
OrderDateKey Manager
Product DueDateKey Address
ShippedDateKey PostalCode City
ProductKey ShipperKey CityKey
... ProductKey CityKey
StoreKey ...
OrderNo Employee
Category OrderLineNo
EmployeeKey Customer
CategoryKey UnitPrice
Quantity ... CustomerKey
...
Discount ...
SalesAmount
Date Freight Shipper
(b)
(c)
Fig. 11.24 Logical schemas of the multiversion Northwind data warehouse (cont.). (b)
Second version V2 (excerpt); (c) Third version V3 (excerpt)
added to the version of the fact table that is current at the time of the
insertion. An advantage of the MTV approach over the STV one is that
it does not require null values in the new or deleted columns, and thus it
prevents the storage space overhead. On the other hand, a disadvantage of
11.8 Logical Design of Multiversion Data Warehouses 425
Fig. 11.25 Table storing all versions of the Store level in the STV approach.
Fig. 11.26 Tables storing the versions of the Sales fact in the MTV approach. (a) Table
SalesT1 for V1 ; (b) Table SalesT2 for V2 ; (c) Table SalesT3 for V3
the MTV approach is that the views needed for accessing the data of a fact
table require gathering the records stored in several tables. Depending on the
size of the fact table and the number of existing versions, these operations
may negatively impact query performance.
For example, the following view returns the fact members in version V1 .
This view selects the columns pertaining to version V1 of the Sales fact. Notice
that all facts stored in table SalesT3 will have a null value for StoreKey. The
view that returns the fact members in version V2 is similar.
Finally, the view that returns the fact members in version V3 is as follows.
CREATE VIEW SalesV3 AS
SELECT CustomerKey, EmployeeKey, OrderDateKey, DueDateKey, NULL, NULL,
S.ProductKey, P.SupplierKey,
CASE WHEN COUNT(*) = 1 THEN MAX(OrderNo) ELSE NULL END,
CASE WHEN COUNT(*) = 1 THEN MAX(OrderLineNo) ELSE NULL
END, AVG(S.UnitPrice), SUM(Quantity), AVG(Discount),
SUM(SalesAmount), NULL
FROM SalesT1 S JOIN Store T ON S.StoreKey = T.StoreKey JOIN
Product P ON S.ProductKey = P.ProductKey
GROUP BY CustomerKey, EmployeeKey, OrderDateKey, DueDateKey,
S.ProductKey, P.SupplierKey
UNION ALL
SELECT CustomerKey, EmployeeKey, OrderDateKey, DueDateKey,
ShippedDateKey, ShipperKey, S.ProductKey, P.SupplierKey,
CASE WHEN COUNT(*) = 1 THEN MAX(OrderNo) ELSE NULL END,
CASE WHEN COUNT(*) = 1 THEN MAX(OrderLineNo) ELSE NULL
END, AVG(S.UnitPrice), SUM(Quantity), AVG(Discount),
SUM(SalesAmount), SUM(Freight)
FROM SalesT2 S JOIN Store T ON S.StoreKey = T.StoreKey JOIN
Product P ON S.ProductKey = P.ProductKey
GROUP BY CustomerKey, EmployeeKey, OrderDateKey, DueDateKey,
ShippedDateKey, ShipperKey, S.ProductKey, P.SupplierKey
UNION ALL
SELECT CustomerKey, EmployeeKey, OrderDateKey, DueDateKey,
ShippedDateKey, ShipperKey, ProductKey, SupplierKey, OrderNo,
OrderLineNo, UnitPrice, Quantity, Discount, SalesAmount, Freight
FROM SalesT3
Recall that dimension Supplier was added to the fact in V3 while in versions
V1 and V2 it was a level in dimension Product. Therefore, the fact members
stored in tables SalesT1 and SalesT2 must be aggregated with respect to
the Supplier level of the Product dimension. In addition to aggregating the
measures, we must take care of how to display the OrderNo and OrderLineNo
values. The CASE statement in the SELECT clause does this. For facts that
are not aggregated, that is, the count of the group is equal to 1, the values
of OrderNo and OrderLineNo are displayed (although a MAX function must
be used to comply with the SQL syntax). Aggregated facts will have several
values for OrderNo and OrderLineNo and therefore, a null value is displayed.
11.9 Querying the Multiversion Northwind Data Warehouse in SQL 427
As already said, to query a multiversion data warehouse, the user must first
specify which version of the warehouse she wants to use. Then, thanks to
the views defined on the warehouse, querying is done in a similar way to
traditional data warehouses. However, the SQL code for a user query may
vary across versions. Further, a user query may not be valid in all versions.
Also, the result of a query may only be partial since it requires information
which is not available throughout the overall lifespan of the data warehouse.
We discuss these issues with the help of examples.
Query 11.12. Compute the yearly sales amount per supplier country.
This query is valid in all schema versions. The query for V1 is as follows.
SELECT SO.CountryName AS SupplierCountry, D.Year,
SUM(SalesAmount) AS SalesAmount
FROM SalesV1 F, Product P, Supplier S, City SC, State SS, Country SO, Date D
WHERE F.ProductKey = P.ProductKey AND P.SupplierKey = S.SupplierKey AND
S.CityKey = SC.CityKey AND SC.StateKey = SS.StateKey AND
SS.CountryKey = SO.CountryKey AND F.OrderDateKey = D.DateKey
GROUP BY SO.CountryName, D.Year
ORDER BY SO.CountryName, D.Year
The differences between the above queries (which are underlined) come
from the fact that in V1 suppliers are represented as a hierarchy in the Product
dimension, while in V3 they are represented as a dimension in the Sales fact.
Therefore, the joins must be implemented differently in the two queries above.
Query 11.13. Total sales amount and total freight by category and year.
Since the Freight measure was introduced in version V2 of the schema, this
query cannot be answered in V1 . The query for V3 is as follows.
SELECT CategoryName, D.Year, SUM(SalesAmount) AS SalesAmount,
SUM(Freight) AS Freight
FROM SalesV3 F, Product P, Category C, Date D
WHERE F.ProductKey = P.ProductKey AND P.CategoryKey = C.CategoryKey AND
F.OrderDateKey = D.DateKey
GROUP BY CategoryName, D.Year
ORDER BY CategoryName, D.Year
428 11 Temporal and Multiversion Data Warehouses
Suppose that version V2 was implemented on January 1st, 2017. Then, the
sales facts previous to that date will have a null value for Freight.
Query 11.14. Total sales amount by category for stores that have more than
30,000 square feet.
This query can only be answered in V1 since the attribute StoreSqft only
exists in that version. The query is as follows.
SELECT CategoryName, SUM(SalesAmount) AS SalesAmount
FROM SalesV1 F, Product P, Category C, Store S
WHERE F.ProductKey = P.ProductKey AND P.CategoryKey = C.CategoryKey AND
F.StoreKey = S.StoreKey AND StoreSqft > 30000
GROUP BY CategoryName
ORDER BY CategoryName
This query illustrates that data introduced in later versions is also available
in V1 . However, stores introduced in V2 or V3 have a null value for StoreSqft
and sales introduced in V3 have no associated Store. Due to these reasons,
the result of this query will be partial since the information required by the
query is not available throughout the lifespan of the data warehouse.
11.10 Summary
solutions for each of them. As we have stated, the problem of dealing with
both multiversion data warehouses and temporal data warehouses can be
solved by combining these solutions. We proposed a conceptual multiversion
data warehouse model and provided a logical implementation of it using the
single-table approach for dimension tables and the multiple-table approach
for fact tables. We showed how views can be used as an efficient mechanism
for accommodating warehouse data across versions, and finally, we addressed
several issues that arise when querying multiversion data warehouses in SQL.
11.18 Give an example scenario that motivates the need to maintain several
versions of a data warehouse.
11.19 Describe with the scenario of the previous question the model for
multiversion data warehouses presented in this chapter.
11.20 How is a time frame associated to the versions of a data warehouse?
11.21 What possible changes may occur in a multidimensional schema?
11.22 Compare the two approaches for implementing a multiversion data
warehouse, identifying their advantages and disadvantages.
11.23 Why are views needed for multiversion data warehouses? Give two
examples of views that differ significantly in their complexity.
11.24 Describe the various issues that be must taken into account when
querying a multiversion data warehouse.
11.13 Exercises
Exercise 11.1. Using the example schema in Fig. 11.1, write an SQL query
expressing the maximum salary by department.
Exercise 11.4. Consider the conceptual schema for the university applica-
tion obtained in the solution of Ex. 4.4. Extend the conceptual schema to
keep the following information.
• The lifecycles of courses, professors, departments, and projects.
• The evolution of dean and number of students for departments.
• The evolution of status for professors.
• The evolution of the affiliation of professors to departments.
• The evolution of teaching.
Exercise 11.5. Translate the temporal MultiDim schema obtained for the
university application in the previous exercise into the relational model.
Exercise 11.6. In the relational schema obtained for Ex. 11.5, enforce the
following integrity constraints.
a. The intervals defining the lifespan of a professor are disjoint.
b. The time frame of the temporal attribute Status is included in the lifespan
of Professor.
c. The lifespan of the relationship between professors and departments is
covered by the intersection of the lifespans of the participating members.
d. The lifespan of a professor is equal to the union of the lifespans of her
links to all departments.
11.13 Exercises 433
Exercise 11.7. Translate the schema in Fig. 11.27 into the relational model
and write the SQL query that computes the average quantity by category,
branch, and month.
Product Category
Categories
ProductID CategoryID
ProductName CategoryName
UnitPrice Description
Branches
Date StoreID BranchID
Quantity
… StoreName BranchName
...
Address Manager
Exercise 11.8. Translate the schema in Fig. 11.28 into the relational model
and write the SQL query that computes the average dosage of drugs by
diagnosis and age group.
AgeGroup
Drug AgeGroupID
DrugID MinValue
MinDosage MaxValue
...
MaxDosage
...
AgeGroups
From
Date Patient Diagnosis
Date Prescription
Categories
PatientNo DiagnosisID
Date Dosage
To FirstName DiagnosisName
…
Date LastName CategoryType
... ...
Exercise 11.9. Translate the schema in Fig. 11.29 into the relational model
and write the SQL query that computes the total cost by department and
month from the measures TotalDays and DailyRate.
Project
ProjNo
ProjName
...
Employee Department
Organization
From
Date Employee EmpNo DeptNo
Date Project
FirstName Manager
Date TotalDays LastName NoEmps
… To DailyRate ... ...
Date
Exercise 11.10. Consider the conceptual schema given in Fig. 11.30, which
is used for analyzing car insurance policies. Translate the conceptual schema
into a relational one and write the following SQL queries.
a. Number of policies by coverage as of January 1st, 2015.
b. Total policy amount per coverage and month.
c. Monthly year-to-date policy amount for each coverage.
d. Number of policies sold by an employee in 2014 compared with the num-
ber of policies sold by herself and her subordinates in 2014.
e. For each policy type, number of days of its lifespan compared with the
number of days that it contained the coverage “Personal Injury”.
f. For each customer, total policy amount for current policies covering ve-
hicles with appraised value at the begining of the policy greater than
$10,000.
g. For each vehicle, give the periods during which it was covered by a policy.
h. Total policy amount for policies with coverages “Collision” or “Personal
Injury”.
i. For each employee, monthly number of policies sold by month and de-
partment.
j. For each vehicle, total policy amount in the periods during which the
vehicle was assigned to only one driver.
k. Monthly number of policies by customer.
l. Total amount for policy types during the periods in which all their cov-
erages have a limit greater than $7,000.
Driver Semester
Coverages
DriverId
FirstName PolicyType
Quarter
LastName PolicyTypeNo
BirthDate Quarter
PolicyTypeName
Gender Description
Month
From
Drivers Date
Policy MonthNo
To MonthName
Vehicle Date
Amount
VehicleID NoInstallments
VehicleType Calendar
Manufacturer
Date
Model
Customer
Year Date
EngineSize CustomerID DayNbWeek
AppraisedValue CustomerName DayNameWeek
CustomerType DayNbMonth
Address DayNbYear
Employee
City WeekNbYear
EmployeeID State
PostalCode
Affiliation
FirstName
LastName Country PolicyNo
Supervisor
Title PolicyNo
BirthDate TransactionDate
Department
Supervision
DepartmentNo
Subordinate
Manager
consider adding and removing the following concepts across versions: dimen-
sions, measures, level attributes, and hierarchies.
Exercise 11.12. Translate into the relational model the conceptual schemas
of the multiversion Foodmart cube obtained in Ex. 11.11.
Exercise 11.13. Given the data warehouse obtained as answer in Ex. 11.12,
create the views for accessing the content of the warehouse in the various
versions.
Exercise 11.14. Write in SQL the following queries for the data warehouse
obtained as answer in Ex. 11.12. For each query, determine in which versions
436 11 Temporal and Multiversion Data Warehouses
Spatial databases have been used for several decades for storing and ma-
nipulating the spatial properties of real-world phenomena. There are two
complementary ways of consider these phenomena. Discrete phenomena
correspond to recognizable objects with an associated spatial extent. Exam-
ples are bus stops, roads, and states, whose spatial extents are represented,
respectively, by a point, a line, and a surface. Continuous phenomena
vary over space (and possibly time) and associate to each point within their
spatial and/or temporal extent a value that characterizes the phenomenon at
that point. Examples include temperature, soil composition, and elevation.
These concepts are not mutually exclusive. In fact, many phenomena may be
viewed alternatively as discrete or continuous. For example, while a road is
a discrete entity, the speed limit and the number of lanes may vary from one
position to another of the road.
In this section, we present the spatial extension of the MultiDim model.
The graphical notation of the model is described in Appendix A. We start by
describing next the data types for representing, at a conceptual level, discrete
and continuous phenomena.
Spatial data types are used to represent the spatial extent of real-world
objects. At the conceptual level, we use the spatial data types shown in
Fig. 12.1. These data types provide support for two-dimensional features.
Point represents zero-dimensional geometries denoting a single location in
space. A point can be used to represent, for instance, a village in a country.
Line represents one-dimensional geometries denoting a set of connected
points defined by a continuous curve in the plane. A line can be used to
represent, for instance, a road in a road network. A line is closed if it has no
identifiable extremities (i.e., its start point is equal to its end point).
OrientedLine represents lines whose extremities have the semantics of a
start point and an end point (the line has a given direction from the start
point to the end point). It can be used to represent, for instance, a river in a
hydrographic network.
Surface represents two-dimensional geometries denoting a set of connected
points that lie inside a boundary formed by one or more disjoint closed lines.
If the boundary consists of more than one closed line, one of the closed lines
contains all the others, and the latter represent holes in the surface defined
by the former line. In simpler words, a surface may have holes but no islands
(no exterior islands and no islands within a hole).
SimpleSurface represents surfaces without holes. For example, the extent
of a lake without islands may be represented by a simple surface.
12.1 Conceptual Design of Spatial Data Warehouses 439
Geo
(total,exclusive)
Simple s Complex
Geo Geo c
(total,exclusive) (partial,exclusive)
Point PointSet
Surface
Line Surface LineSet Set
SimpleGeo is a generic spatial data type that generalizes the types Point,
Line, and Surface. SimpleGeo is an abstract type, that is, it is never instanti-
ated as such: Upon creation of a SimpleGeo value, it is necessary to specify
which of its subtypes characterizes the new element. A SimpleGeo value can
be used, for instance, to represent cities, whereas a small city may be repre-
sented by a point and a bigger city by a simple surface.
Several spatial data types are used to describe spatially homogeneous sets.
PointSet represents sets of points, for instance, tourist points of interest.
LineSet represents sets of lines, for example, a road network. OrientedLineSet
represents a set of oriented lines, for example, a river and its tributaries.
SurfaceSet and SimpleSurfaceSet represent sets of surfaces with or without
holes, respectively, for example, administrative regions.
ComplexGeo represents any heterogeneous set of geometries that may in-
clude sets of points, sets of lines, and sets of surfaces. ComplexGeo may be used
to represent a water system consisting of rivers (oriented lines), lakes (sur-
faces), and reservoirs (points). ComplexGeo has PointSet, LineSet, Oriented-
LineSet, SurfaceSet, and SimpleSurfaceSet as subtypes.
Finally, Geo is the most generic spatial data type, generalizing the types
SimpleGeo and ComplexGeo; its semantics is “this element has a spatial ex-
tent” without any commitment to a specific spatial data type. Like SimpleGeo,
Geo is an abstract type. It can be used, for instance, to represent the admin-
istrative regions of a country, which may be either a Surface or a SurfaceSet.
It is worth noting that empty geometries are allowed, that is, geometries
representing an empty set of points. This is needed in particular to express
the fact that the intersection of two disjoint geometries is also a geometry,
although it may be an empty one.
440 12 Spatial and Mobility Data Warehouses
Intersects/Overlaps Disjoint
Equals Contains/Within
Covers/CoveredBy Touches
Crosses
Crosses: One geometry crosses another if they intersect and the dimension of
this intersection is less than the greatest dimension of the geometries.
We explain next the spatial extension of the MultiDim model. For this, we use
as example the GeoNorthwind data warehouse, which is the Northwind data
warehouse extended with spatial types. As shown in the schema in Fig. 12.3,
pictograms are used to represent spatial information.
A spatial level is a level for which the application needs to store spatial
characteristics. A spatial level is represented using the icon of its associated
geometry to the right of the level name, which is represented using one of
the spatial data types described in Sect. 12.1.1. For example, in Fig. 12.3,
City and State are spatial levels, while Product and Date are nonspatial levels.
A spatial attribute is an attribute whose domain is a spatial data type.
For example, CapitalGeo is a spatial attribute of type point, while Elevation
is a spatial attribute of type field of reals. Attributes representing continuous
fields are identified by the ‘f()’ pictogram.
A level may be spatial independently of the fact that it has spatial at-
tributes. For example, as shown in Fig. 12.4, depending on application re-
442 12 Spatial and Mobility Data Warehouses
Employee City
Territories
EmployeeID Supplier CityName
Geography
FirstName
LastName SupplierID
Title CompanyName
Address State
BirthDate
HireDate PostalCode StateName
Address EnglishStateName
City Customer StateType
Geography
Region StateCode
Supervisor
Category Shipper
Semester
CategoryID ShipperID
Semester CategoryName CompanyName
Description
Order
Year
OrderNo
Year
OrderLineNo
quirements, a level such as State may be spatial or not and may have spatial
attributes such as CapitalGeo.
12.1 Conceptual Design of Spatial Data Warehouses 443
Fig. 12.4 Examples of levels with spatial characteristics. (a) Spatial level; (b) Spatial
level with a spatial attribute; (c) Nonspatial level with a spatial attribute
State
Highway StateName
StateArea
HighwayNo Capital
... ...
RoadCoating
Fig. 12.5 A spatial data warehouse for analyzing the maintenance of highways
functions are the center of n points and the center of gravity. Finally, exam-
ples of spatial holistic functions are equipartition and nearest-neighbor. By
default, the MultiDim model uses sum for aggregating numerical measures
and spatial union for aggregating spatial measures. For example, in Fig. 12.5,
when users roll-up from the County to the State level, for each state the mea-
sure Length of the corresponding counties will be summed, while the Com-
monArea measure will be a LineSet resulting from the spatial union of the
lines representing highway segments for the corresponding counties.
Spatial measures allow richer analysis than nonspatial measures do. Con-
sider Fig. 12.6, which is used for analyzing the locations of road accidents
according to insurance categories (e.g., full vs. partial coverage). The schema
includes a spatial measure representing the locations of accidents. Spatial
union can be used to roll-up to the InsuranceCategory level to display the ac-
cident locations corresponding to each category represented as a set of points.
Other aggregation functions can also be used, such as the center of n points.
On the other hand, Fig. 12.7 shows an alternative schema for the analysis of
road accidents. This schema has no spatial measure: the focus of analysis has
been changed to the amount of insurance payments according to the various
geographic locations as reflected by the spatial dimension Location.
We compare Figs. 12.6 and 12.7 with respect to the different analyses that
can be performed when a location is represented as a spatial measure or as
a spatial hierarchy. In Fig. 12.6, the locations of accidents can be aggregated
(by using spatial union) when a roll-up operation over the Date or Insur-
ance hierarchies is executed. However, this aggregation cannot be done with
12.2 Implementation Considerations for Spatial Data 445
Insurance
AgeGroup Category Year
GroupName CategoryName Year
MinValue ...
MaxValue
... Quarter
InsuranceType
Quarter
Age Insurance
the schema in Fig. 12.7. The dimensions are independent, and traversing a
hierarchy along one of them does not aggregate data in another hierarchy.
Further, an analysis of the amounts of insurance payments made in different
geographic zones is not supported in Fig. 12.6 since in this case only the exact
locations of the accidents are known.
The discrete and continuous approaches that we presented in Sect. 12.1 are
used to represent spatial data at a conceptual level. Two common implementa-
tions of these models are, respectively, the vector model and the raster model.
In this section, we study how these models are implemented in PostGIS, a
spatial extension of the open-source DBMS PostgreSQL.
The Earth is a complex surface whose shape and dimension cannot be de-
scribed with mathematical formulas. There are two main reference surfaces
to approximate the shape of the Earth: the geoid and the ellipsoid.
The geoid is a reference model for the surface of the Earth that coincides
with the mean sea level and its imaginary extension through the continents.
446 12 Spatial and Mobility Data Warehouses
Insurance
AgeGroup Category Year
GroupName CategoryName Year
MinValue ...
MaxValue
... Quarter
InsuranceType
Quarter
Age Insurance
The spatial data types given in Sect. 12.1.1 describe spatial features at a con-
ceptual level, without taking into account their implementation. The vector
model provides a collection of data types for representing spatial objects into
the computer. Thus, for example, while at a conceptual level a linear object
is defined as an infinite collection of points, at the implementation level such
a line must be approximated using points, lines, and curves as primitives.
The standard ISO/IEC 13249 SQL/MM is an extension of SQL for man-
aging multimedia and application-specific packages. Part 3 of the standard
is devoted to spatial data. It defines how zero-, one-, or two-dimensional spa-
tial data values are represented on a two-dimensional (R2 ), three-dimensional
(R3 ) or four-dimensional (R4 ) coordinate space. We describe next the spatial
data types defined by SQL/MM, which are used and extended in PostGIS.
Figure 12.8 shows the type hierarchy defined in the SQL/MM standard
for geometric features. ST_Geometry is the root of the hierarchy, which is an
abstract type. ST_Point represents zero-dimensional geometries. ST_Curve is
an abstract type representing one-dimensional geometries. Several subtypes
of ST_Curve are defined according to type of interpolation function used.
For example, ST_LineString and ST_CircularString represent line segments
defined by a sequence of points using linear, respectively, circular interpola-
tion, while ST_Circle represents circles defined by three noncollinear points.
ST_Surface is an abstract type representing two-dimensional geometries
composed of simple surfaces consisting of a single patch whose boundary
is specified by one exterior ring and zero or more interior rings if the sur-
face has holes. In ST_CurvePolygon, the boundaries are any curve, while in
ST_Polygon, the boundaries are linear strings. ST_Triangle represents poly-
448 12 Spatial and Mobility Data Warehouses
ST_Geometry
ST_Triangle ST_TIN
There are also methods that generate new geometries from other ones. The
newly generated geometry can be the result of a set operation on two geome-
tries (e.g., ST_Difference, ST_Intersection, ST_Union) or can be calculated
by some algorithm applied to a single geometry (e.g., ST_Buffer).
To conclude this section, it is important to remark that several DBMSs,
such as SQL Server and PostGIS, provide two kinds of data types: the geom-
etry data type and the geography data type. The former is the most used one.
It represents a feature in the Euclidean space. All spatial operations over this
type use units of the Spatial Reference System the geometry is in. The geog-
raphy data type uses geodetic (i.e., spherical) coordinates instead of Cartesian
(i.e., planar) coordinates. This type allows storing data in longitude/latitude
coordinates. However, the functions for manipulating geography values (e.g.,
distance and area) are more complex and take more time to execute. Further-
more, current systems typically provide fewer functions defined on geography
than there are on geometry.
The continuous fields presented in Sect. 12.1.3 are used to represent spatio-
temporal phenomenon at a conceptual level. They are represented at a log-
ical level by coverages. The ISO 19123:2005 standard provides an abstract
model of coverages which is concretized by the OGC Coverage Implementa-
tion Schema in various implementations such as spatio-temporal regular and
irregular grids, point clouds, and general meshes. A coverage consists of four
components, described next.
The domain set defines the spatial and/or temporal extent for the cov-
erage, where the coordinates are expressed with respect to a Coordinate Ref-
erence System (CRS). In the case of spatio-temporal coverages, as in a time-
series of satellite images, the CRS combines a spatial reference system such
as WGS 84 and a temporal coordinate system such as ISO 8601, which repre-
sents dates and times in the Gregorian calendar. The spatial extent is defined
by a set of geometric objects, which may be extended to the convex hull of
these objects. Commonly used domains include point sets, grids, and collec-
tions of closed rectangles. The geometric objects may exhaustively partition
the domain, and thereby form a tessellation as in the case of a grid or a
Triangulated Irregular Network (TIN). Coverage subtypes may be defined
in terms of their domains. The range set is the set of stored values of the
coverage. The range type describes the type of the range set values. In the
case of images, it corresponds to the pixel data type, which often consists
of one or more fields (also referred to as bands or channels). However, since
coverages often model many associated functions sharing the same domain,
the range type can be any record type. As example, a coverage that assigns
to each point in a city at a particular date and time the temperature, pres-
450 12 Spatial and Mobility Data Warehouses
sure, humidity, and wind velocity will define the range type as a record of
four fields. Finally, the metadata component represents an extensible slot
for storing any kind of application-specific metadata structures.
Coverages can be of two types. A discrete coverage has a domain defined
by a finite collection of geometric objects. A discrete coverage maps each geo-
metric object to a single record of attribute values. An example is a coverage
that maps a set of polygons to the soil type found within each polygon. On
the other hand, in a continuous coverage both the domain and the range
may take infinitely many different values. In most cases, a continuous cover-
age is associated with a discrete coverage that provides a set of control values
to be used as a basis for evaluating the continuous coverage. Evaluation of
the continuous coverage at other positions is done by interpolating between
the geometry and value pairs of the control set. For example, a continuous
coverage representing the temperature in a city at a particular date and time
would be associated with a discrete coverage that holds the temperature val-
ues observed at a set of weather stations, from which the temperature at any
point within the city would be calculated. As another example, a triangulated
irregular network involves interpolation of values within a triangle composed
of three neighbouring point and value pairs.
The most popular coverage type is the regular grid, which supports the
raster model. This model is structured as an array of cells, where each cell
represents the value of an attribute for a real-world location. In PostGIS,
the raster data type can be used for storing raster data in a binary format.
Rasters are composed of bands, also called channels. Although rasters can
have many bands, they are normally limited to four, each one storing integers.
For example, a picture such as a JPEG, PNG, or TIFF is generally composed
of one to four bands, expressed as the typical red green blue alpha (RGBA)
channels. A pixel in raster data is generally modeled as a rectangle with a
value for each of its bands. Each rectangle in a raster has a width and a
height, both representing units of measure (such as meters or degrees) of the
geographic space in the corresponding SRS. We describe next some of the
functions provided by PostGIS to manipulate raster data. Please refer to the
PostGIS documentation for the full set of functions
Several functions allow to query the properties of a raster. For example, the
function ST_BandNoDataValue returns the value used in a band to represent
cells whose values are unknown, referred to as no data. Function ST_Value
returns the value in a location of the raster for a given band.
Other functions convert between rasters and external data formats. For
example, the function ST_AsJPEG returns selected bands of the raster as a
single JPEG image. Analogously, the function ST_AsBinary return the binary
representation of the raster.
Another group of functions converts between rasters and vector formats.
The function ST_AsRaster converts a geometry to a RASTER. To con-
vert a raster to a polygon, the function ST_Polygon is used. The function
12.3 Logical Design of Spatial Data Warehouses 451
ST_Envelope returns the minimum bounding box of the extent of the raster,
represented as a polygon.
Other functions compare two rasters or a raster and a geometry with
respect to their spatial relation. For example, function ST_Intersects returns
true if two raster bands intersect or if a raster intersects a geometry.
Finally, another group of functions generates new rasters or geometries
from other ones. For example, ST_Intersection takes two rasters as arguments
and returns another raster. Also, the ST_Union function returns the union
of a set of raster tiles into a single raster composed of one band. The extent
of the resulting raster is the extent of the whole set.
Fig. 12.9 Logical representation of the GeoNorthwind data warehouse in Fig. 12.3
the spatial extent of its members. Note that additional attributes will be
added to this table when mapping relationships using Rule 3 below.
Rule 2: A fact F is mapped to a table TF that includes as attributes all
measures of the fact. Further, a surrogate key may be added to the table.
Spatial measures must be mapped to attributes having a spatial type. In
addition, if the fact has an associated topological constraint, a trigger may
be added to ensure that the constraint is satisfied for all fact members.
Note that additional attributes will be added to this table when mapping
relationships using Rule 3 below.
12.3 Logical Design of Spatial Data Warehouses 453
queries involving the Country level. Moreover, most of those queries will not
require the elevation information. Notice that this approach can be used for
all spatial attributes. The table CountryElevation can be created as follows:
CREATE TABLE CountryElevation (
CountryKey INTEGER, Elevation RASTER,
FOREIGN KEY (CountryKey) REFERENCES Country(CountryKey));
The table contains a foreign key to the Country dimension table and an at-
tribute of the RASTER data type. This attribute will store a raster that covers
the spatial extent of each country.
As another example, applying the above rules to the spatial fact Main-
tenance given in Fig. 12.5 will result in a table that contains the surrogate
keys of the four dimensions Segment, RoadCoating, County, and Date, as well
as the corresponding referential integrity constraints. Further, the table con-
tains attributes for the measures Length and CommonArea, where the latter
is a spatial attribute. The table can be created as follows:
CREATE TABLE Maintenance (
SegmentKey INTEGER NOT NULL,
RoadCoatingKey INTEGER NOT NULL,
CountyKey INTEGER NOT NULL,
DateKey INTEGER NOT NULL,
Length INTEGER NOT NULL,
CommonArea geometry(LINESTRING, 4326),
FOREIGN KEY (SegmentKey) REFERENCES Segment(SegmentKey),
/* Other foreign key constraints */ );
Notice that in the above example, the topological constraint involves only two
spatial levels. It is somewhat more complex to enforce a topological constraint
that involves more than two spatial dimensions.
A topological constraint between spatial levels can be enforced either at
each insertion of a child member or after the insertion of all children members.
The choice among these two solutions depends on the kind of topological con-
straint. For example, a topological constraint stating that a region is located
inside the geometry of its country can be enforced each time a city is inserted,
while a topological constraint stating that the geometry of a country is the
spatial union of all its composing regions must be enforced after all of the
regions and the corresponding country have been inserted.
As an example of the first solution, a trigger can be used to enforce the
CoveredBy topological constraint between the Region and Country levels in
Fig. 12.3. This trigger should raise an error if the geometry of a region member
is not covered by the geometry of its related country member. Otherwise, it
should insert the new data into the Country table. The trigger is as follows:
CREATE OR REPLACE FUNCTION RegionInCountry()
RETURNS TRIGGER AS $$
DECLARE
CountryGeo geometry;
BEGIN
/* Retrieve the geometry of the associated country */
CountryGeo = (SELECT C.CountryGeo FROM Country C
WHERE NEW.CountryKey = C.CountryKey);
/* Raise error if the topological constraint is violated */
IF NOT ST_COVERS(CountryGeo, NEW.RegionGeo) THEN
RAISE EXCEPTION 'A region cannot be outside its country';
456 12 Spatial and Mobility Data Warehouses
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Query 12.1. Total sales in 2017 to customers located in cities that are within
an area whose extent is a polygon drawn by the user.
SELECT C.CompanyName AS CustomerName, SUM(S.SalesAmount) AS SalesAmount
FROM Sales S, Customer C, City Y, Date D
WHERE S.CustomerKey = C.CustomerKey AND C.CityKey = Y.CityKey AND
S.OrderDateKey = D.DateKey AND D.Year = 2017 AND
ST_Covers(ST_GeographyFromText('POLYGON(
(200.0 50.0,300.0 50.0, 300.0 80.0, 200.0 80.0, 200.0 50.0))'), Y.CityGeo)
GROUP BY C.CompanyName
The above query uses the spatial predicate ST_Covers to filter customer
cities according to their location. The polygon given as argument to the
ST_GeomFromText function will be defined by the user with the mouse in a
graphical interface showing a map.
12.5 Querying the GeoNorthwind Data Warehouse in SQL 457
Query 12.2. Total sales to customers located in a state that contains the
capital city of the country.
SELECT C.CompanyName AS CustomerName, SUM(S.SalesAmount) AS SalesAmount
FROM Sales S, Customer C, City Y, State A, Country O
WHERE S.CustomerKey = C.CustomerKey AND C.CityKey = Y.CityKey AND
Y.StateKey = A.StateKey AND A.CountryKey = O.CountryKey AND
ST_Covers(A.StateGeo, O.CapitalGeo)
GROUP BY C.CompanyName
The above query uses the function ST_Covers to verify that the geometry of
a state covers the geometry of the capital of its country.
Query 12.3. Spatial union of the states in the USA where at least one cus-
tomer placed an order in 2017.
SELECT ST_Union(DISTINCT A.StateGeo::Geometry)
FROM Sales S, Customer C, City Y, State A, Country O, Date D
WHERE S.CustomerKey = C.CustomerKey AND C.CityKey = Y.CityKey AND
Y.StateKey = A.StateKey AND A.CountryKey = O.CountryKey AND
O.CountryName = 'United States' AND S.OrderDateKey = D.DateKey AND
D.Year = 2017
Here, the function ST_Union performs the spatial union of all the states in the
USA that satisfy the query condition. The second argument of the function
states the name of the property (i.e., StateGeo) containing the geometries
that will be aggregated. Notice that we use StateGeo::Geometry to convert
the geographies to geometries prior to applying the spatial union.
Query 12.4. Distance between the customers’ locations and the capital of the
state in which they are located.
SELECT C.CompanyName AS CustomerName,
ST_Distance(C.CustomerGeo, A.CapitalGeo) AS Distance
FROM Customer C, City Y, State A
WHERE C.CityKey = Y.CityKey AND Y.StateKey = A.StateKey
ORDER BY C.CompanyName
The above query computes, using the function ST_Distance, the distance
between the geometry of the customer and the capital of its state.
Query 12.5. Total sales amount of each customer to its closest supplier.
SELECT C.CompanyName AS CustomerName, SUM(S.SalesAmount) AS SalesAmount
FROM Sales S, Customer C, Supplier U
WHERE S.CustomerKey = C.CustomerKey AND S.SupplierKey = U.SupplierKey AND
ST_Distance(C.CustomerGeo, U.SupplierGeo) <= (
SELECT MIN(ST_Distance(C.CustomerGeo, U1.SupplierGeo))
FROM Supplier U1 )
GROUP BY C.CompanyName
In the above query, we use the inner query to compute the minimum distance
of a given customer to all its suppliers.
458 12 Spatial and Mobility Data Warehouses
Query 12.6. Total sales amount for customers that have orders delivered by
suppliers such that their locations are less than 200 km from each other.
SELECT C.CompanyName AS CustomerName, SUM(S.SalesAmount) AS SalesAmount
FROM Sales S, Customer C, Supplier U
WHERE S.CustomerKey = C.CustomerKey AND S.SupplierKey = U.SupplierKey AND
ST_Distance(C.CustomerGeo, U.SupplierGeo) < 200000
GROUP BY C.CompanyName
This query selects, for each customer, the suppliers related to the customer
through at least one order and located less than 200 km from the customer.
Query 12.7. Distance between the customer and supplier for customers that
have orders delivered by suppliers of the same country.
SELECT DISTINCT C.CompanyName AS CustomerName,
U.CompanyName AS SupplierName,
ST_Distance(C.CustomerGeo, U.SupplierGeo) AS Distance
FROM Sales S, Customer C, City CC, State CS, Supplier U, City SC, State SS
WHERE S.CustomerKey = C.CustomerKey AND
C.CityKey = CC.CityKey AND CC.StateKey = CS.StateKey AND
S.SupplierKey = U.SupplierKey AND U.CityKey = SC.CityKey AND
SC.StateKey = SS.StateKey AND SS.CountryKey = CS.CountryKey
ORDER BY C.CompanyName, U.CompanyName
In the above query we obtain for each customer the suppliers located in the
same country, provided that they are both involved in at least one order.
In this query, the function ST_Area computes the area of a country, which is
then expressed in square kilometers. Then, we count the number of customers
located in each of the European countries whose area is greater than 50,000.
Query 12.9. For each supplier, number of customers located at more than
100 km from the supplier.
SELECT U.CompanyName, COUNT(DISTINCT C.CustomerKey) AS NbCustomers
FROM Sales S, Supplier U, Customer C
WHERE S.SupplierKey = U.SupplierKey AND S.CustomerKey = C.CustomerKey AND
ST_Distance(U.SupplierGeo, C.CustomerGeo) > 100000
GROUP BY U.CompanyName
12.5 Querying the GeoNorthwind Data Warehouse in SQL 459
Query 12.10. For each supplier, distance between the location of the supplier
and the centroid of the locations of all its customers.
SELECT U.CompanyName, ST_Distance(U.SupplierGeo, ST_Centroid(
ST_Union(DISTINCT C.CustomerGeo::Geometry))::Geography) AS Distance
FROM Sales S, Supplier U, Customer C
WHERE S.SupplierKey = U.SupplierKey AND S.CustomerKey = C.CustomerKey
GROUP BY U.CompanyName, U.SupplierGeo
This query uses the ST_Value function to obtain the elevation value of cus-
tomers.
Query 12.13. Total sales of customers located in states of Belgium that have
at least 70% of their extent at an altitude smaller than 100 m.
SELECT C.CompanyName AS CustomerName, SUM(F.SalesAmount) AS SalesAmount
FROM Sales F, Customer C, City Y, State S, Country O
WHERE F.CustomerKey = C.CustomerKey AND C.CityKey = Y.CityKey AND
Y.StateKey = S.StateKey AND S.CountryKey = O.CountryKey AND
O.CountryName = 'Belgium' AND ST_Area(S.StateGeo) * 0.7 < (
SELECT ST_Area(ST_Union(SE.Geom))
FROM ( SELECT (ST_Intersection(Elevation, S.StateGeo::Geometry)).*
FROM CountryElevation E
WHERE O.CountryKey = E.CountryKey ) AS SE
WHERE SE.Val < 100 )
GROUP BY C.CompanyName
This query uses the ST_Area function to compare the area of a state, with
the area of the state at an altitude less than 100 m. For obtaining the latter,
in the innermost query the ST_Intersection function is used to intersect the
geometry of the state with the raster containing the elevation of Belgium.
This results in a set of (Geom, Val) pairs. Then, the pairs with value less
than 100 are selected in the outer query and the ST_Union, and ST_Area
functions are applied to the corresponding geometries.
The previous sections of this chapter focused on the analysis of the spatial
features of static objects, that is, objects whose spatial features do not
change (or change exceptionally) over time. For example, the location of a
store can change at a certain instant. Similarly, the borders of a state or
a country can change over time. In the remainder of this chapter we focus
on moving objects, that is, objects whose spatial extent change over time.
We only consider moving points, which are typically used to represent the
location of cars and persons. However, many applications must also deal with
moving geometries of arbitrary type such as lines or regions. These moving
geometries can be rigid, as in the case of autonomous vehicles, or deforming,
as in the case of polluting clouds or spills in bodies of water. The analysis of
data generated by moving objects is called mobility data analysis.
The interest in mobility data analysis has expanded dramatically with
the availability of embedded positioning devices such as GPS. With these
devices, traffic data, for example, can be captured as a collection of sequences
of positioning signals transmitted by the vehicles’ GPS along their itineraries.
Since such sequences can be very long, they are often processed by dividing
them into segments. For instance, the movement of a car can be segmented
with respect to the duration of the time intervals in which it stops at a certain
location. These segments of movement are called trajectories, and they are
12.7 Temporal Types 461
the unit of interest in the analysis of movement data. Mobility analysis can be
applied, for example, in traffic management, which requires traffic flows to be
monitored and analyzed to capture their characteristics. Other applications
aim at tracking the position of persons recorded by the electronic devices
they carry, such as smartphones, in order to analyze their behavior.
Trajectories can be represented in two possible ways. A continuous tra-
jectory represents the movement of an object by a sequence of spatiotempo-
ral points, together with interpolation functions that allows the computation,
with a reasonable degree of confidence, of the position of the object at any
instant in the period of observation. On the other hand, a discrete tra-
jectory contains only a sequence of spatiotemporal points but there is no
plausible interpolation function. As a typical example, consider the case of
check-in services in social networks. A user checks-in at a place at 2 p.m. and
the next day she checks-in at another place at 4 p.m. Interpolation between
these spatiotemporal points will most likely be useless for any application
that wants to analyze the movement of this user. However, an application
aimed at analyzing the presence of people in a given area may find this in-
formation useful. Note that the difference between discrete and continuous
trajectories has to do with the application semantics rather than with the
time between two consecutive trajectory points.
Spatiotemporal or moving-object databases are databases that ma-
nipulate data pertaining to moving objects. For example, a query to a moving-
object database would be “When will the next train from Rome arrive?”.
However, these databases do not enable analytical queries such as “Number
of deliveries that started in Brussels in the last quarter of 2012” or “Av-
erage duration of deliveries by city.” Spatiotemporal or mobility data
warehouses are data warehouses that contain data about the trajectories of
moving objects. Such trajectories are typically analyzed in conjunction with
other spatial data, for instance, a road network or continuous field data such
as elevation.
To represent spatiotemporal data we use a collection of data types that
capture the evolution over time of base types. These types are referred to as
temporal types, and we study them in detail in the next section.
Temporal types represent values that change over time, for instance, to keep
track of the evolution of the salaries of employees. Conceptually, temporal
types are functions that assign to each instant a value of a particular domain.
They are obtained by applying a constructor t(·), where t stands for temporal.
Hence, a value of type t(integer), e.g., representing the evolution of the salary
of an employee, is a partial function f : instant → integer. Temporal types
are partial functions since they may be undefined for certain periods of time.
462 12 Spatial and Mobility Data Warehouses
In what follows, we only consider valid time. As we have seen in Sect. 11.1,
valid time represents the time in the application domain, independently of
when this information is stored in the database.
Salary 20 30
John
Salary 50 60
Mary
For example, Fig. 12.10 shows two values of type t(integer), which rep-
resent the evolution of the salary of two employees. For instance, John has
a salary of 20 in the period [2012-01-01, 2012-07-01) and a salary of 30 in
the period [2012-10-01, 2013-01-01), while his salary remains undefined in
between 2012-07-01 and 2012-09-30. We denote by ‘⊥’ this undefined value.
As a convention, we use closed-open periods for representing the evolution of
values of a temporal type.
Class Operations
Projection to domain/range getTime, getValues, trajectory
Interaction with domain/range atInstant, atInstantSet, atPeriod, atPeriodSet,
atValue, atRange, atGeometry, atMin, atMax
startInstant, startValue, endInstant, endValue,
minValue, maxValue
Rate of change derivative, speed, direction, turn
Temporal aggregation integral, duration, length, TMin, TMax, TSum, TAvg,
TVariance, TStDev
Lifting All new operations inferred
tion to a given instant (set) or period (set), while the operations atValue,
and atRange restrict the temporal type to a value or to a range of values
in the range of the function. The operations atMin and atMax restrict the
function to the instants when its value is minimal or maximal, respectively.
Operations startInstant and startValue return, respectively, the first instant
at which the function is defined and the corresponding value. The operations
endInstant and endValue are analogous.
For example, atInstant(SalaryJohn, 2012-03-15) and atInstant(SalaryJohn,
2012-07-15) return, respectively, the value 20 and ‘⊥’, because John’s salary
is undefined at the latter date. Similarly, atPeriod(SalaryJohn, [2012-04-01,
2012-11-01)) results in a temporal real with value 20 at [2012-04-01, 2012-07-
01) and 30 at [2012-10-01, 2012-11-01), where the periods have been projected
to the period specified in the operation. Further, startInstant(SalaryJohn) and
startValue(SalaryJohn) return 2012-01-01 and 20, which are, respectively, the
start time and value of the temporal value. Moreover, atValue(SalaryJohn, 20)
and atValue(SalaryJohn, 25) return, respectively, a temporal real with value
20 at [2012-01-01, 2012-07-01) and ‘⊥’, because there is no temporal real
with value 25 whatsoever. Finally, atMin(SalaryJohn) and atMax(SalaryJohn)
return, respectively, a temporal real with value 20 at [2012-01-01, 2012-07-01)
and a temporal real with value 30 at [2012-10-01, 2013-01-01).
There are three basic temporal aggregation operations that take as ar-
gument a temporal integer or real and return a real value. Operation integral
returns the area under the curve defined by the function, duration returns the
duration of the temporal extent on which the function is defined, and length
returns the length of the curve defined by the function. From these opera-
tions, other derived operations such as TAvg, TVariance, or TStDev can be
defined. These are prefixed with a ‘T’ (temporal) in order to distinguish them
from the usual aggregation operations generalized to temporal types, which
we discuss below. For example, the operation TAvg computes the weighted
average of a temporal value, taking into account the duration in which the
function takes a value. In our example, TAvg(SalaryJohn) yields 23.36, given
that John had a salary of 20 during 182 days and a salary of 30 during 92
days. Further, TVariance and TStDev compute the variance and the standard
deviation of a temporal type. Finally, TMin and TMax return, respectively,
the minimum and maximum value taken by the function. These are obtained
by min(getValues(·)) and max(getValues(·)), where min and max are the classic
operations over numeric values.
The generalization of operations on nontemporal types to temporal types
is called lifting. An operation for nontemporal types is lifted to allow any
of the arguments to be replaced by a temporal type and returns a temporal
type. As an example, the less than (<) operation has lifted versions where one
or both of its arguments can be temporal types and the result is a temporal
Boolean. Intuitively, the semantics of such lifted operations is that the result
is computed at each instant using the nonlifted operation.
464 12 Spatial and Mobility Data Warehouses
When two temporal values are defined on different temporal extents, the
result of a lifted operation can be defined in two possible ways. On the one
hand, the result is defined on the intersection of both extents and undefined
elsewhere. On the other, the result is defined on the union of the two extents,
and a default value (such as 0, for the addition) is used for extents that belong
to only one temporal type. For lifted operations, we assume that the result
is defined on the intersection of the two extents. For example, in Fig. 12.10,
the comparison SalaryJohn < SalaryMary results in a temporal Boolean with
value true during [2012-04-01, 2012-07-01) and [2012-10-01, 2013-01-01).
Salary 20 30
John
Salary 50 60
Mary
20 35 50 45 60
Avg
Aggregation operations can also be lifted. For example, a lifted Avg opera-
tion combines a set of temporal reals and results in a new temporal real where
the average is computed at each instant. For example, Fig. 12.11 shows the
average of the temporal values in Fig. 12.10. Notice that for temporal aggre-
gation, we assume that the result is defined on the union of all the extents.
The definition of temporal types discussed so far is also valid for spa-
tial types, leading to spatiotemporal types. For example, a value of type
t(point), which can represent the trajectory of a vehicle, is a partial function
f : instant → point. We present next some of the operations of Table 12.1 for
the spatial case using the example in Fig. 12.12a, which depicts two temporal
points RouteV1 and RouteV2 that represent the delivery routes followed by
two vehicles V1 and V2 on a particular day. We can see, for instance, that
vehicle V1 took 15 min to go from point (0,0) to point (3 3), and then it
stopped
√ for 10 min at that point. Thus, vehicle V1 traveled a distance
√ of
18 = 4.24 in 15 min, while vehicle V2 traveled a distance of 5 = 2.23
in the first 10 min and a distance of 1 in the following 5 min. We assume a
constant speed between consecutive pairs of points.
The operation trajectory (see Table 12.1) projects temporal geometries
into the spatial plane. The projection of a temporal point into the plane may
consist of points and lines, the projection of a temporal line into the plane
may consist of lines and regions, and the projection of a temporal region into
the plane consists of a region. In our example, trajectory(RouteV1) results in
the leftmost line in Fig. 12.12a, without any temporal information.
12.7 Temporal Types 465
Y d
3 8:15 8:25 3
8:20
2 RouteV1 2
RouteV2
1 8:15 1
8:05
8:00
Fig. 12.12 Graphical representation of (a) the trajectories of two vehicles, and (b) their
temporal distance.
All operations over nontemporal spatial types are lifted to allow any of
the arguments to be a temporal type and return a temporal type. Intuitively,
the semantics of such lifted operations is that the result is computed at each
instant using the nonlifted operation. As an example, the distance function,
which returns the minimum distance between two geometries, has lifted ver-
sions where one or both of its arguments can be temporal points and the
result is a temporal real. In our example, distance(RouteV1, RouteV2) returns
a temporal real shown in Fig. 12.12b, where, for instance, the function has a
value 1.5 at 8:10 and 2 at 8:15.1
Topological operations can also be lifted. In this case, the semantics is that
the operation returns a temporal Boolean that computes the topological re-
lationship at each instant. For example, Intersects(RouteV1, RouteV2) returns
a temporal Boolean with value false during [8:05, 8:20] since the two vehicles
were never at the same point at any instant of their route.
Several operations compute the rate of change for points. Operation speed
yields the speed of a temporal point at any instant as a temporal real. Op-
eration direction returns the direction of the movement, that is, the angle
between the x-axis and the tangent to the trajectory of the moving point.
Operation turn yields the change of direction at any instant. Finally, deriva-
tive returns the derivative of the movement as a temporal real. For example,
speed(RouteV1) yields a temporal real with values 16.9 at [8:00, 8:10] and
0 at [8:20, 8:25], direction(RouteV1) yields a temporal real with value 45 at
[8:00, 8:10], turn(RouteV1) yields a temporal real with value 0 at [8:00, 8:10],
and derivative(RouteV1) yields a temporal real with value 1 at [8:00, 8:10].
Notice that during the while the vehicle is stopped, the direction and turn
are undefined.
1
Notice that the distance is a quadratic function and we have approximated the distance
with a linear function.
466 12 Spatial and Mobility Data Warehouses
Current DBMSs do not provide support for temporal types. In this section,
we show the temporal types we presented in Sect. 12.7 are implemented on
MobilityDB [278], an open-source mobility database based on PostgreSQL
and PostGIS. As explained in Chap. 11, temporal support has been intro-
duced in the SQL standard and some of its features have been implemented in
SQL Server, DB2, Oracle, and Teradata. Such functionality adds temporality
to tables, thus associating a period to each row. However, to cope with the
needs of mobility applications, we need an alternative approach that allows
the representation of the temporal evolution of individual attribute values.
In order to manipulate temporal types we need a set of time data types
corresponding to those presented in Sect. 11.2.1. MobilityDB uses the times-
tamptz (short for timestamp with time zone) type provided by PostgreSQL
and three new types, namely period, timestampset, and periodset.
The period type represents the instants between two bounds, the lower and
the upper bounds, which are timestamptz values. The bounds can be inclusive
(represented by “[” and “]”) or exclusive (represented by “(” and “)”). A period
value with equal and inclusive bounds corresponds to a timestamptz value.
An example of a period value is as follows
SELECT period '[2012-01-01 08:00:00, 2012-01-03 09:30:00)';
types (which are based on the float, geometry, or geography subtypes) may
evolve in a continuous or stepwise manner.
The subtype of a temporal value states the temporal extent at which
the evolution of values is recorded. Temporal values come in four subtypes,
namely, instant, instant set, sequence, and sequence set.
A temporal instant value represents the value at a time instant, such as
SELECT tfloat '17@2018-01-01 08:00:00';
As can be seen, a value of a sequence type has a lower and an upper bound
that can be inclusive (represented by ‘[’ and ‘]’) or exclusive (represented by
‘(’ and ‘)’). The value of a temporal sequence is interpreted by assuming that
the period of time defined by every pair of consecutive values v1@t1 and v2@t2
is lower inclusive and upper exclusive, unless they are the first or the last in-
stants of the sequence and in that case the bounds of the whole sequence
apply. Furthermore, the value taken by the temporal sequence between two
consecutive instants depends on whether the subtype is discrete or continu-
ous. For example, the temporal sequence above represents that the value is
10 during (2018-01-01 08:00:00, 2018-01-01 08:05:00), 20 during [2018-01-01
08:05:00, 2018-01-01 08:10:00), and 15 at the end instant 2018-01-01 08:10:00.
On the other hand, the following temporal sequence
SELECT tfloat '(10@2018-01-01 08:00:00, 20@2018-01-01 08:05:00,
15@2018-01-01 08:10:00]';
The values for the Salary attribute above correspond to those in Fig. 12.10.
We show next how some of the operations for temporal types defined in
Table 12.1 can be expressed in MobilityDB. For example, given the above
table with the two tuples inserted, the query
SELECT getTime(E.Salary), getValues(E.Salary)
FROM Employee E
The first column of the result above is of type periodset, while the second col-
umn is of type integer[] (array of integers) provided by PostgreSQL. Similarly,
the query
SELECT valueAtTimestamp(E.Salary, '2012-04-15'),
valueAtTimestamp(E.Salary, '2012-07-15')
FROM Employee E
where the NULL value above represents the fact that the salary of John is
undefined on 2012-07-15. The following query
SELECT atPeriod(E.Salary, '[2012-04-01,2012-11-01)')
FROM Employee E
returns
{[20@2012-04-01, 20@2012-07-01), [30@2012-10-01, 30@2012-11-01)}
{[50@2012-04-01, 60@2012-10-01, 60@2012-11-01)}
where the temporal attribute has been restricted to the period given in the
query. Furthermore, the query
12.8 Temporal Types in MobilityDB 469
gives as result the minimum and maximum values and the periods when they
occurred, as follows.
{[20@2012-01-01, 20@2012-07-01)} {[30@2012-10-01, 30@2013-01-01]}
{[50@2012-04-01, 50@2012-10-01)} {[60@2012-10-01, 60@2013-04-01]}
We show next the usage of lifted operations. Recall that the semantics of
such operations is such that the nonlifted operation is applied at each instant.
Then, the query
SELECT E1.Salary #< E2.Salary
FROM Employee E1, Employee E2
WHERE E1.FirstName = 'John' and E2.FirstName = 'Mary'
Notice that the comparison is performed only on the time instants that are
shared by the two temporal values. Similarly, the query
SELECT AVG(E.Salary)
FROM Employee E
We insert now two tuples into this table, containing information about two
deliveries performed by two vehicles V1 and V2 shown in Fig. 12.12a:
INSERT INTO Delivery VALUES
( 'V1', '2012-01-10', TGEOMPOINT '[Point(0 0)@2012-01-10 08:00,
Point(3 3)@2012-01-10 08:15, Point(3 3)@2012-01-10 08:25]' ),
( 'V2', '2012-01-10', TGEOMPOINT '[Point(1 0)@2012-01-10 08:05,
Point(3 1)@2012-01-10 08:15, Point(3 2)@2012-01-10 08:20]' );
470 12 Spatial and Mobility Data Warehouses
Recall that these are continuous trajectories and thus, we assume a constant
speed between any two consecutive points and use linear interpolation to
determine the position of the vehicles at any instant.
We show next examples of lifted spatial operations. The following query
computes the distance between the two vehicles whose graphical representa-
tion was given in Fig. 12.12b:
SELECT distance(D1.Route, D2.Route)
FROM Delivery D1, Delivery D2
WHERE D1.VehicleId = 'V1' AND D2.VehicleId = 'V2'
Finally, the following query uses the lifted ST_Intersects topological operation
to test whether the routes of the two vehicles intersect, as follows:
SELECT tintersects(D1.Route, D2.Route)
FROM Delivery D1, Delivery D2
WHERE D1.VehicleId = 'V1' AND D2.VehicleId = 'V2'
We study next how data warehouses can be extended with temporal types in
order to support the analysis of mobility data. We use the Northwind case
study in order to introduce the main concepts.
The Northwind company wants to build a mobility data warehouse that
keeps track of the deliveries of goods to their customers in order to opti-
mize the shipping costs. Nonspatial data include the characteristics of the
vehicles performing the deliveries. Spatial data include the road network, the
warehouses that store the goods to be delivered, the customers, and the ge-
ographical information related to these locations (city, state, and country).
Spatiotemporal data include the trajectories followed by the vehicles. In our
scenario, vehicles load the goods in a warehouse, perform a delivery serving
several customers, and then return to the warehouse. We will analyze the
deliveries by vehicles, days, warehouses, and delivery locations.
Figure 12.13 shows the conceptual schema depicting the above scenario
using the MultiDim model extended to support spatial data and temporal
types. As shown in the figure, the fact DeliverySegment is related to five
dimensions: Vehicle, Delivery, Date, From, and To, where the latter two link
to either a Customer or a Warehouse. Consider for example a delivery that
serves two clients. This delivery will have three segments: the first one going
from the warehouse to the first client, the second one going from the first
to the second client, and the third one going from the second client back
12.9 Mobility Data Warehouses 471
Customer Municipality
Geography
Warehouse
Geography
CustomerKey MunicipalityName
Name WarehouseKey
Address Address
City City Province
PostalCode PostalCode ProvinceName
RegionName
From To
VehicleBrand
Delivery
BrandKey Date
Segment
BrandName Date
DeliveryKey
WeekNo
SegNo
Brand Route t( ) Calendar
/Trajectory
Vehicle
DeliveryTime (0,1)
VehicleKey Month
Licence MonthNo
Year Delivery
MonthName
Type DeliveryKey
Class NoCustomers
Route t( ) Quarter
/Trajectory
VehicleClass Quarter
Road
Segment
ClassKey
ClassName SegmentId Year
DutyClass Length
WeightLimit MaxSpeed Year
Customer Municipality
CustomerKey Warehouse MunicipalityKey
Name MunicipalityName
Address WarehouseKey MunicipalityGeo
City Address ProvinceKey
PostalCode City
CustomerGeo PostalCode
MunicipalityKey WarehouseGeo Province
MunicipalityKey
From
ProvinceKey
To ProvinceName
VehicleBrand
Delivery ProvinceGeo
BrandKey Segment RegionName
BrandName DeliveryKey
SegNo Date
FromKey
Vehicle ToKey DateKey
VehicleKey Date
VehicleKey DateKey WeekNo
Licence Route t( ) MonthNo
Year MonthName
Type Trajectory
BrandKey DeliveryTime Quarter
ClassKey Year
Fig. 12.14 Relational schema of the Northwind mobility data warehouse in Fig. 12.13
Fig. 12.15 Data generated for the Northwind mobility data warehouse. The road net-
work is shown with blue lines, the warehouses are shown with a red star, the
routes taken by the deliveries are shown with black lines, and the location
of the customers with black points.
Fig. 12.16 A trajectory of a single vehicle during one day serving four clients.
4
http://www.qgis.org
474 12 Spatial and Mobility Data Warehouses
This query does not use the fact table. Since a delivery starts and ends at
a warehouse, we use the function startValue to get the starting point of the
deliveries, which is the location of a warehouse.
This query leverages the fact that a delivery segment which is not the first
one starts from a customer. This query uses the function ST_Covers to obtain
the municipality of the client. Notice that we could alternatively obtain the
municipality of a client from the foreign key in table Customer.
5
https://github.com/MobilityDB/MobilityDB
12.10 Querying the Northwind Mobility Data Warehouse in SQL 475
This query uses uses the same approach as the previous one, but we need
to roll-up to the province level to know the province of a municipality.
Query 12.18. Average duration per day of deliveries starting with a cus-
tomer located in the municipality of Uccle.
WITH UccleDeliveries AS (
SELECT DISTINCT DeliveryKey
FROM DeliverySegment S, Customer C, Municipality M
WHERE S.SegNo = 1 AND S.ToKey = C.CustomerKey AND
ST_Covers(M.MunicipalityGeo, C.CustomerGeo) AND
M.MunicipalityName LIKE 'Uccle%' )
SELECT D.Day, SUM(timespan(D.Route)) / COUNT(*) AS AvgDuration
FROM Delivery D
WHERE D.DeliveryKey IN (SELECT DeliveryKey FROM UccleDeliveries)
GROUP BY D.Day
ORDER BY D.Day;
This query selects in table UccleDeliveries the deliveries whose first customer
is located in the requested municipality. Then, in the main query, it uses
function timespan to compute the duration of the delivery taking into account
the time gaps, and computes the average of all the durations per day.
This query computes in table DelivMunic the pairs of delivery and municipal-
ity such that the delivery traversed the municipality. For this, the function
ST_Intersects is used. Table DelivNoMunic computes from the previous table
the number of municipalities traversed by a delivery. Table DelivWareh asso-
ciates to each delivery its warehouse. Finally, the main query joins the last
two tables to compute the requested answer.
Query 12.20. Total distance traveled by vehicles per brand.
SELECT B.BrandName, SUM(length(S.Route) / 1000)
FROM DeliverySegment S, Vehicle V, VehicleBrand B
WHERE S.VehicleKey = V.VehicleKey AND V.BrandKey = B.BrandKey
GROUP BY B.BrandName
ORDER BY B.BrandName;
This query uses the function length to obtain the distance traveled in a
delivery segment, which is divided by 1,000 to express it in kilometers, and
then aggregates the distances per brand.
Query 12.21. Number of deliveries per day such that their route intersects
the municipality of Ixelles for more than 20 minutes.
SELECT Day, COUNT(DISTINCT DeliveryKey) AS NoDeliveries
FROM Delivery D, Municipality M
WHERE M.MunicipalityName LIKE 'Ixelles%' AND
D.Route && M.MunicipalityGeo AND
duration(atGeometry(D.Route, M.MunicipalityGeo)) >= interval '20 min'
GROUP BY Day
ORDER BY Day;
This query uses the function atGeometry to restrict the route of the delivery to
the geometry of the requested municipality. Then, function duration computes
the time spent within the province and verifies that this duration is at least of
20 minutes. Remark that this duration of 20 minutes may not be continuous.
Finally, the count of the selected deliveries is performed as usual.
The term D.Route && M.MunicipalityGeo is optional and it is included
to enhance query performance. It verifies, using an index, that the spatio-
temporal bounding box of the delivery projected into the spatial dimension
intersects with the bounding box of the municipality. In this way, the at-
Geometry function is only applied to deliveries satisfying the bounding box
condition.
Query 12.22. Same as the previous query but with the condition that the
route intersects the municipality of Ixelles for more than consecutive 20 min-
utes.
SELECT Day, COUNT(DISTINCT DeliveryKey) AS NoDeliveries
FROM Delivery D, Municipality M,
unnest(periods(getTime(atGeometry(D.Route, M.MunicipalityGeo)))) P
WHERE M.MunicipalityName LIKE 'Ixelles%' AND
D.Route && M.MunicipalityGeo AND duration(P) >= interval '20 min'
GROUP BY Day
ORDER BY Day;
12.10 Querying the Northwind Mobility Data Warehouse in SQL 477
As in the previous query, the function atGeometry restricts the route of the
delivery to the requested municipality. This results in a route that may be
discontinuous in time, because the route may enter and leave the munici-
pality. The query uses the function getTime to obtain the set of periods of
the restricted delivery, represented as a periodSet value. The function periods
converts the latter value into an array of periods, and the PostgreSQL func-
tion unnest expands this array to a set of rows, each row containing a single
period. Then, it is possible to verify that the duration of one of the periods
is at least 20 minutes.
This query computes in table Days the minimal and maximal day of the
deliveries. For this, we compute the start and end timestamp of each delivery
with functions startTimestamp and endTimestamp, cast these values to dates
using the PostgreSQL operator ::, and apply the MIN and MAX traditional
aggregate functions. Table TimeSplit splits the range of days found in the
previous table into one-hour periods. Notice that in table Days we added one
day to the maximal date to also split the maximal date in table TimeSplit.
In the main query, with function atPeriod we restrict each delivery to a given
one-hour period, compute the speed of the restricted trip as a temporal float,
and apply the function twAvg to compute the time-weighted average of the
speed as a single float value. Finally, the traditional AVG aggregate function
is used to obtain the average of these values per period.
Query 12.24. For each speed range of 20 km/h, give the total distance trav-
eled by all deliveries within that range.
WITH Ranges AS (
SELECT I AS RangeKey, floatrange(((I - 1) * 20), I * 20) AS Range
FROM generate_series(1, 10) I )
SELECT R.Range, SUM(length(atPeriodSet(D.Route,
getTime(atRange(speed(D.Route) * 3.6, R.Range)))) / 1000)
FROM Delivery D, Ranges R
WHERE atRange(speed(D.Route) * 3.6, R.Range) IS NOT NULL
GROUP BY R.RangeKey, R.Range
ORDER BY R.RangeKey;
478 12 Spatial and Mobility Data Warehouses
The idea of this query is to have an overall view of the speed behavior of
the entire fleet of delivery vehicles. Table Ranges computes the speed ranges
[0, 20), [20, 40), . . . [180, 200). In the main query, the speed of the trips, ob-
tained in meters per second, is multiplied by 3.6 to convert it to km/h. Then,
function atRange restricts the speed to the portions having a given range
and function getTime obtains the time when the speed was within the range.
The overall route is then restricted to the obtained time periods with the
function atPeriodSet. Function length computes the distance traveled by the
vehicle within this speed range and this value is divided by 1,000 to express
the distance in kilometers. Finally, all the distances for all vehicles at the
given speed range are obtained with the SUM aggregate function.
Query 12.25. Number of deliveries per month that traveled at least 20 km
at a speed higher than 100 km/h.
SELECT DT.Year, DT.MonthNumber, COUNT(DISTINCT DeliveryKey)
FROM Delivery D, Date T
WHERE D.DateKey = T.DateKey AND length(atPeriodSet(D.Route,
getTime(atRange(speed(Route) * 3.6, floatrange(100,200))))) / 1000 >= 20
GROUP BY DT.Year, DT.MonthNumber
ORDER BY DT.Year, DT.MonthNumber
The query uses the function speed to obtain the speed of the vehicle at each
instant. The function atRange restricts the speed, expressed in kilometers
per hour, to the range [100,200). The getTime function computes the set of
periods during which the delivery travels within that range. The function
atPeriodSet restricts the overall delivery to these periods and the length func-
tion computes the distance traveled by the restricted deliveries. Finally, we
verify that this distance, converted to kilometers, is at least 20 km.
Query 12.26. Pairs of deliveries that traveled at less than one kilometer
from each other during more than 20 minutes.
SELECT D1.DeliveryKey, D2.DeliveryKey
FROM Delivery D1, Delivery D2
WHERE D1.DeliveryKey < D2.DeliveryKey AND duration(getTime(
atValue(tdwithin(D1.Route, D2.Route, 1000), true))) > '20 min'
The CREATE TABLE above uses the ST_Intersects function to compute the
number of deliveries that used each road segment. Then, the INSERT state-
ment adds to the table those segments that were not used by any delivery.
We need some statistics about the attribute noDels to define the gradient.
SELECT MIN(noDels), MAX(noDels), round(AVG(noDels),3), round(STDDEV(noDels),3)
FROM HeatMap;
-- 0 299 11.194 26.803
The scale_linear function above transforms the value of the attribute noDels
from a value in [0, 20] into a value in [0,1]. Therefore, we decided to assign
a full blue color to a road segment as soon as there are at least 20 deliveries
that traverse it. The ramp_color function states the gradient to be used for
the display, in our case shades of blue.
Fig. 12.17 Visualizing how often the road segments are taken by the deliveries.
480 12 Spatial and Mobility Data Warehouses
12.11 Summary
In this chapter, we first studied how data warehouses can be extended with
spatial data. For this, we presented a spatial extension of the MultiDim con-
ceptual model with spatial types, field types, and topological relationships.
We used as example the GeoNorthwind data warehouse, which extends the
Northwind data warehouse with spatial data. We also generalized the rules
for translating conceptual to logical models to account for spatial data. Then,
we addressed the vector and raster models to explain how spatial data types
at the conceptual level can be represented at the logical level. We imple-
mented the above concepts using the PostgreSQL database with its spatial
extension PostGIS. We also showed how the GeoNorthwind data warehouse
can be queried with SQL extended with spatial functions.
We then discussed how data warehousing techniques can be applied to
mobility data. For this, we first defined temporal types, which capture the
variation of a value across time. Applying temporal types to spatial data leads
to spatiotemporal types, which provide a conceptual view of trajectories.
We then presented how temporal types are implemented in MobilityDB, an
open-source moving-object database based on PostgreSQL and PostGIS. We
illustrated these concepts by extending the Northwind data warehouse with
mobility data and showed how to query this data warehouse in MobilityDB.
12.11 What is the difference between the vector and the raster data models
for representing spatial data?
12.12 Describe the spatial data types implemented in SQL/MM.
12.13 Discuss the mapping rules for translating a spatial MultiDim schema
into an relational schema.
12.14 How are topological relationships represented in a logical schema?
How can these relationships be checked in a logical schema ?
12.15 What are moving objects? How are they different from spatial objects?
12.16 Give examples of different types of moving objects and illustrate with
scenarios the importance of their analysis.
12.17 Discuss various criteria that can be used to segment movement data
taking into different analysis requirements.
12.18 What is a trajectory? What is the difference between continuous and
discrete trajectories?
12.19 Define the terms mobility databases and mobility data warehouses.
Mention the main differences between the two concepts.
12.20 What are temporal types? How are they constructed?
12.21 Give examples of a temporal base type and a spatiotemporal type.
Give examples of operations associated to each of these types.
12.22 Explain why traditional operations must be lifted for temporal types.
Illustrate this with examples.
12.23 Give a hint about how temporal types can be implemented in a plat-
form such as PostGIS. How does this implementation differ from the
abstract definition of temporal types?
12.24 Discuss how temporal types can be added to a multidimensional
schema.
12.25 Discuss the implications of including trajectories as dimensions or
measures in a data warehouse.
12.14 Exercises
Exercise 12.2. Given the relational schema obtained in the previous exer-
cise, write in SQL the following queries.
a. Display all measures summarized for the stores located in California or
Washington, considering only stores in California that are less than 200
km from Los Angeles, and stores in Washington that are less than 200
km from Seattle.
b. For each store, give the total sales to customers from the same city than
the store.
12.14 Exercises 483
Geography
Family Name Country Name Customer ID
Country Geom First Name
Middle Initial
Department Last Name
State Birth Date
Department Name
Gender
State Name Education
State Geom Marital Status
Category
Member Card
Category Name Yearly Income
City Occupation
City Name Total Children
Subcategory Nb Children Home
City Geom
Subcategory Name House Owner
Nb Cars Owned
Geography Customer Geom
Class
Store
Product Year
Store ID
Product ID Store Name Year No
Product Name Store Type
SKU Store Manager
SRP Store Sqft Quarter
Brand Grocery Sqft
Frozen Sqft Quarter Name
Meat Sqft
Brand Coffe Bar
Store Geom Month
Brand Name
Month No
Month Name
Promotion
Sales
Promotion Name Calendar
Media Store Sales
Store Cost Date
Unit Sales Date
/Sales Average Day Name Week
Promotion Media
/Profit Day Nb Month
Media Type Week Nb Year
c. For each store, obtain the ratio between the sales to customers from the
same state against its total sales, in 2013.
d. Total sales of stores located at less than 5 km from the city center against
total sales for all stores in their state.
e. Display the unit sales by product brand, only considering sales to cus-
tomers from a country different than the country of the store.
484 12 Spatial and Mobility Data Warehouses
f. For each store list total sales to customers living closer than 10 km to the
store, against total sales for the store.
g. For each city give the store closest to the city center and its the best sold
brand name.
h. Give the spatial union of all the cities that have more than one store with
a surface of more than 10,000 square feet.
i. Give the spatial union of the states such that the average of the total
sales by customer in 2017 is greater than $60 per month.
j. Give the spatial union of all the cities where all customers have purchased
for more than $100.
k. Display the spatial union of the cities whose sales count accounts for more
than 5% of all the sales.
Exercise 12.3. Add spatial data to the data warehouse schema you created
as a solution of Ex. 5.11 for the AirCarrier application. You must analyze the
dimensions, facts, and measures and define which of them can be extended
with spatial features. You should also consider adding continuous field data
representing altitude so you can enhance the analysis trying to find a corre-
lation between the results and the elevation of the geographic sites.
Exercise 12.4. Use a reverse engineering technique to produce a multidi-
mensional schema from the logical schema obtained as a solution of Ex. 12.3.
Exercise 12.5. Given the logical schema obtained in Ex. 12.3, write in SQL
the following queries.
a. For each carrier and year, give the number of scheduled and performed
flights.
b. For each airport, give the number of scheduled and performed flights in
the last two years.
c. For each carrier and distance group, give the total number of seats sold
in 2012.
d. Display for each city the three closest airports and their distance to the
city independently of the country in which the city and the airport are
located.
e. Give the total number of persons arriving to or departing from airports
closer than 15 km from the city center in 2012.
f. Give for 2012 the ratio between the number of persons arriving to or
departing from aiports closer than 15 km from the city center and the
number of persons arriving to or departing from airports located between
15 and 40 km from the city center.
g. Display the spatial union of all airports with more than 5,000 departures
in 2012.
h. Display the spatial union of all airports where more than 100 carriers
operate.
i. For cities operated by more than one airport, give the total number of
arriving and departing passengers.
12.14 Exercises 485
j. For cities operated by more than one airport, give the total number of
arriving and departing passengers at the airport closest to the city center,
and the ratio between this value and the city total.
k. Display the spatial union of all airports located at more than 1,000 m
above sea level.
l. Compare the number of departed and scheduled flights for airports lo-
cated above and below 1,000 m above sea level in 2012.
Exercise 12.6. Consider the train company application described in Ex. 3.2
and whose conceptual multidimensional schema was obtained in Ex. 4.3. Add
spatiotemporal data to this schema to transform it into a mobility data ware-
house. You must analyze the dimensions, facts, and measures, and define
which of them can be extended with spatiotemporal features.
Exercise 12.8. Write in SQL the following queries on the relational schema
obtained in Ex. 12.7:
a. Give the trip number, origin, and destination of trips that contain seg-
ments with a duration of more than 3 h and whose length is shorter than
200 km.
b. Give the trip number, origin, and destination of trips that contain at least
two segments served by trains from different constructors.
c. Give the trip number of trips that cross at least three cities in less than
2 h.
d. Give the total number of trips that cross at least two country borders in
less than 4 h.
e. Give the average speed by train constructor. This should be computed
taking the sum of the durations and lengths of all segments with the
same constructor and obtaining the average. The result must be ordered
by average speed.
f. For each possible number of total segments, give the number of trips in
each group and the average length, ordered by number of segments. Each
answer should look like (5, 50, 85), meaning that there are 50 trips that
have 5 segments with an average length of 85 km, and so on.
g. Give the trip number, origin, and destination stations for trips with at
least one segment of the trip runs for at least 100 km within Germany.
Location
Date Name Name
Season MajorActivity Population
... ... Area
...
Calendar
Station
Month
Name
Month AirQuality ...
...
Value Road
Index
Quarter Name
Quarter Length
... ...
Pollutant
Categories
Category
Year Name
Type Name
Year LimitValue Description
... ... ...
1 2 3
Graphs like the one in Fig. 13.1 do not suffice when more expressiveness
is required. For example, the lack of edge labels prevents indication of the
kind of relationship that each edge represents. Adding labels to the edges
in the graph leads to the notion of edge-labeled graphs. In general, more
sophisticated graph data models are needed to address the problems that
appear in real-world situations. Graph data models have been extensively
studied in the literature. In practice, the three models mainly used are based
on either the Resource Description Framework (RDF), property graphs, and
hypergraphs.
Models based on RDF represent data as sets of triples where, as we will
study in Chap. 14, each triple consists of three elements that are referred to
as the subject, the predicate, and the object of the triple. These triples al-
low the description of arbitrary objects in terms of their attributes and their
relationships with other objects. Intuitively, a collection of RDF triples is an
RDF graph. An important feature of RDF-based graph models is that they
follow a standard, which is not yet the case of the other graph databases,
although at the time of writing this book, efforts are being carried out for
standardization. Therefore, RDF graphs are mainly used to represent meta-
data. However, since RDF graphs are typically stored using so-called triple
stores, which normally rely on underlying relational databases with special-
13.1 Graph Data Models 489
ized index structures, they are not the best choice when complex path com-
putations are required. The standard query language associated with RDF
graphs is SPARQL. RDF and the Semantic Web are extensively studied in
Chap. 14 of this book, and are thus not covered in this chapter. An example
of a commercial graph database built upon this model is AllegroGraph.1
In models based on property graphs, both nodes and edges can be la-
beled and can be associated with a collection of (attribute, value) pairs.
Property graphs are the usual choice in practical implementations of modern
graph databases. Unlike the case of RDF, there is no standard language for
property graphs, although there is an emerging standard called GQL2 (Graph
Query Language). Examples of graph databases based on the property graph
data model are Neo4j3 and Sparksee.4 Neo4j will be described in detail in
Section 13.2, since it is one of the most-used graph database systems.
Figure 13.2 shows a property graph version of the graph in Fig. 13.1. In
this graph, nodes are labeled Person, with attributes Name and Gender, and
edges are labeled Follows, with no attribute associated. In this example, the
Neo4j notation has been used.
(:Follows)
(:Person {Name:’Peter’, Gender:’male’})
(:Follows) (:Follows)
1 2 3
It has been proved that there is a formal way of reconciling the property
graph and RDF models through a collection of well-defined transformations
between the two models.
In models based on hypergraphs, any number of nodes can be connected
through a relationship, which is represented as a hyperedge. On the contrary,
the property graph model allows only binary relationships. Hypergraphs can
model many-to-many relationships in a natural way. As an example, Fig. 13.3
represents group calls where many users are involved. In this figure, for ex-
ample, there is a hyperedge involving nodes 1, 2, and 3, indicating that Mario
1
http://franz.com/agraph/allegrograph/
2
http://www.gqlstandards.org/
3
https://neo4j.com/
4
http://www.sparsity-technologies.com/
490 13 Graph Data Warehouses
(represented by node 3) initiated a group call with Mary and Julia (nodes 1
and 2, respectively). Here, again, Neo4j notation has been used. There are
nodes of type Phone, with attributes Nbr and Name, representing the phone
numbers and the person’s name, respectively. There are also nodes of type
Call, representing a phone call, with attributes Date and Duration, represent-
ing the date and duration of the call, respectively. We will come back to the
phone call example in more detail in Sect. 13.3.
1 3
4 2
Having presented the main graph data models, we are now ready to discuss
graph database systems based on the property graph model.
store. In what follows we will focus on native graph database systems based
on the property graph data model. We call them property graph database
systems.
We explained above that a property graph is composed of nodes, relation-
ships, and properties. Nodes have a label (i.e., the name of the node), and are
annotated with properties (attributes). These annotations are represented as
(property, value) pairs, where property represents the name of the property
(i.e., a string) and value its actual value. Relationships connect and structure
nodes, and have a label, a direction, a start node, and an end node. Like the
nodes, relationships may also have properties. Figure 13.4 shows a simpli-
fied portion of the Northwind operational database represented as a property
graph. Here, a node with name Employee has properties EmployeeID, First-
Name, and LastName. There is also a node with name Order, and properties
OrderID and OrderDate, and a node with name Product, and with proper-
ties ProductID and ProductName. A relationship Contains is defined from the
Order node to the Product node, with properties Quantity and UnitPrice, in-
dicating that the product is contained in the order.
(:Sold)
2 3
13.2.1 Neo4j
where n is a variable representing the node. Labels (or types) are optional and
a node can have more than one of them. There are also optional (property,
value) pairs. In the example of Fig. 13.4, the statement in Cypher for creating
a Product node is:
CREATE(:Product {ProductID:'1', Name:'Chai'});
Each edge has exactly one label (or type). As for nodes, properties are op-
tional and there can be many of them. The Cypher statement for creating
the edge of type Contains between the order and the product of Fig. 13.4,
indicating that the product is contained in the order, is written as follows:
MATCH (p:Product {ProductID:1}), (o:Order {OrderID:'11006'})
CREATE (o)-[:Contains {Quantity:18, UnitPrice:8}]->(p)
First, MATCH is used to find the product and the order. The results are stored
in the variables p and o, and an edge labeled Contains is created between the
order and the product. In this case, the edge also includes two attributes,
namely Quantity and UnitPrice. Note that, unlike the case of relational data-
bases, there is no formal database schema in Neo4j, and this is, in general,
the case in most graph database systems. This means that instances do not
have to comply with a predefined structure and thus nodes and edges of the
same type may have different components at the instance level. Therefore, in
the example above, we may create another Contains edge with no attributes,
13.2 Property Graph Database Systems 493
or another Product node without a value for ProductName. All of these nodes
and edges will coexist in the database.
In the following sections we will show how data warehouses can be repre-
sented at the logical and physical levels using Neo4j. As an example, we will
use the Northwind data cube.
We start this section by showing how Cypher can be used for creating graphs.
Then, we show the basics of Cypher as a query language, and finally we briefly
discuss Cypher’s semantics.
The Geography hierarchy is represented by the nodes City, State, Country, and
Continent, and the edges between them.
ReportsTo
HasOrderDate
Date HasShippedDate Sales HandledBy Employee
HasDueDate
IsAssignedTo
Supplier Customer
Continent
Fig. 13.5 The schema of the Northwind data warehouse graph in Neo4j
Grate
Tokyo
April HasOrderDate 10.0 SuppliedBy Traders
Lakes
y …
dB
chase
Pur
s
tain
Andrew
Ernst
Con
Ha
Handel Ikura
nd
y
Shipped
To
le
SuppliedBy
B
dB
sed
rts
y
po
Exotic Chang
rch
Re
PurchasedBy
Liquids
By
Pu
s
ain
Su Janet nt
ins
pp Co
lie Ha
nta
dB ndl
y
Co
e dB
y
United
June HasOrderDate 8.0 ShippedBy Package
ShippedBy 21.5 HasOrderDate March
The LOAD statement above retrieves each row in the CSV file, and instan-
tiates the variable row, one row at a time. Then, the CREATE statement
creates the customer node using the values in the row variable. Creating a
relationship is analogous. For example, assuming that the Employee and City
nodes have been created, the IsAssignedTo relationship, indicating that an
employee is assigned to a city, can be created as follows:
LOAD CSV WITH HEADERS FROM "file:/NWdata/territories.csv" AS row
MATCH (employee:Employee {EmployeeKey:row.EmployeeKey})
MATCH (city:City {CityKey:row.CityKey})
MERGE (employee)-[:IsAssignedTo]->(city);
Here, the file territories.csv contains the relationship between the cities and
the employees. The row variable contains the values in each row of the CSV
file. We need two matches then, one for the city nodes (i.e., matching the
CityKey property in the City nodes in the graph with the CityKey value in the
CSV file), and another one for the employee nodes. Finally, the MERGE clause
creates the edge. This clause eliminates the duplicates, in case a link with
the same name between the same nodes already exists. Otherwise, duplicates
will be kept using CREATE.
Another way to import data from the relational database makes use of
one of the popular libraries developed for Neo4j, namely the APOC library. 5
This library contains many useful functions, among them, functions allowing
connection to a database and retrieval of tuples that are used for populating
a graph. For example, to load the Product nodes we may write:
WITH "jdbc:postgresql://localhost:5433/NorthwindDW?
user=postgres&password=postgres" AS url
CALL apoc.load.jdbc(url, 'SELECT * FROM Product') YIELD row
CREATE (:Product {ProductKey:row.ProductKey, ProductName:row.ProductName,
UnitPrice:row.UnitPrice, Discontinued:row.Discontinued});
The connection string in the WITH clause has the credentials of the database
(in this case, NorthwindDW in a PostgreSQL DBMS), and it is passed on
using the variable url. Then, the apoc.load.jdbc function is called. The YIELD
clause retrieves the data in a variable, in this case, named row, which is used
as above to populate the Product nodes.
We could also update node attributes similarly. The following statement:
5
https://neo4j.com/developer/neo4j-apoc/
13.2 Property Graph Database Systems 497
WITH "jdbc:postgresql://localhost:5433/NorthwindDW?
user=postgres&password=postgres" AS url
CALL apoc.load.jdbc(url, 'SELECT * FROM Sales') YIELD row
MATCH (s:Sales) WHERE s.OrderNo = row.OrderNo AND
s.OrderLineNo = row.OrderLineNo
SET s += {Discount:row.Discount, Freight:row.Freight}
will add the Discount and Freight attributes to the existing Sales nodes. All
other attributes are kept unchanged. If we remove the + in the above state-
ment, the node will end up with only these two attributes.
The MATCH clause retrieves all product nodes in the variable p. The type
of this variable is node, which means that all attributes of the nodes will be
retrieved. Since we are only interested in two of them, the RETURN clause
returns only the product names and unit prices, and finally, the ORDER BY
clause sorts the result in descending order according to the unit price. If we
only want the products with name Chocolade, the first line above should be
replaced with
MATCH (p:Product) WHERE p.ProductName = "Chocolade"
To ask for the name of the products belonging to the Beverages category,
we write:
MATCH (p:Product)-[:HasCategory]->(c:Category {CategoryName:'Beverages'})
RETURN p.ProductName
The MATCH clause matches the paths in the Product hierarchy such that a
Beverages node is at the top and the product node is at the bottom of the
hierarchy. Finally, the name is returned. Note that if instead of the graph
nodes we want the whole paths that match the pattern, we should write
MATCH path = (p:Product)-[:HasCategory]->(c:Category {CategoryName:'Beverages'})
RETURN path
The COLLECT clause produces, for each employee in the result of the MATCH
clause, a list of the associated city names. To flatten this collection, and obtain
a table as a result, the UNWIND clause is used.
MATCH (c:City)<-[:IsAssignedTo]-(e:Employee)
WITH e, COLLECT(c.CityName) AS collection
UNWIND collection AS col
RETURN e, col
The WITH clause in Cypher acts like a “pipe”, passing variables on to the
following step. In this case, the pairs (e, collection) are passed on to the
UNWIND clause and a table is then returned.
We explain now how the window functions introduced in Sect. 5.8 can
be expressed in Cypher. Consider the following query over the Northwind
data warehouse: “For each sale of a product to a customer, compare the sales
amount against the overall sales of the product.” This query is written in SQL
using window functions as follows:
SELECT ProductName, CompanyName, SalesAmount,
SUM(SalesAmount) OVER (PARTITION BY ProductName) AS TotAmount
FROM Sales
In the query above, the MATCH clause computes all the sales of products to
customers. With such data, two computations are performed in the WITH
clauses. First, the total sales by customer and product are computed in the
variable salesPC. In the second WITH clause, the total sales for each product
13.2 Property Graph Database Systems 499
Cypher Semantics
We have explained that query evaluation in Cypher is based on pattern
matching. The underlying semantics of Cypher is based on the notions of
graph isomorphism and homomorphism, which we explain next. Let V (G)
and E(G) denote, respectively, the vertices and the edges of a graph G.
An isomorphism from a graph G to a graph H is a bijective mapping
α : V (G) → V (G) such that
The above definition implies that α maps edges to edges and vertices to
vertices. On the other hand, a homomorphism from a graph G to a graph
H is a mapping (not necessarily bijective) α : V (G) → V (G) such that
Notice that the difference between the two definitions is that the if and only
if condition in the first one is replaced with an implication in the second one.
As a consequence, a homomorphism α maps edges to edges but may map a
vertex to either a vertex or an edge. Note that an isomorphism is a special
type of homomorphism.
Cypher semantics is formalized through the graph isomorphism problem,
defined as follows: Given a property graph G and a pattern graph P , a match-
ing P to G results in all subgraphs of G that are isomorphic to P . It is well
known that this is a hard problem, in general NP-complete. For property
graphs, the problem is based on basic graph patterns (BGPs), which is
equivalent to the problem of conjunctive queries, well studied by the database
community. A BGP for querying property graphs is a property graph where
variables can appear in place of any constant (either labels or properties). A
match for a BGP is a mapping from variables to constants such that when
the mapping is applied to the BGP, the result is contained in the original
graph. The results for a BGP are then all the mappings from variables in the
query to constants that comprise a match.
Given the definitions above, evaluating a BGP Q against a graph database
G corresponds to listing all possible matches of Q with respect to G. There
are several semantics to produce this matching:
500 13 Graph Data Warehouses
Mary
Friend
Chris Friend Ted
Friend
Louis
will enable the edge between Chris and Ted to be used in the two clauses.
In fact, the query returns duplicated results, since p1 and p2 are matched
indistinctly to Louis and Mary.
We next show how to query the Northwind data warehouse using Cypher.
13.2 Property Graph Database Systems 501
Query 13.1. Total sales amount per customer, year, and product category.
MATCH (c:Category)<-[:HasCategory]-(:Product)<-[:Contains]-(s:Sales)-
[:PurchasedBy]->(u:Customer)
MATCH (s)-[:HasOrderDate]->(d:Date)
RETURN u.CompanyName AS Customer, d.Year AS Year, c.CategoryName AS Category,
sum(s.SalesAmount) AS SalesAmount
ORDER BY Customer, Year, Category
In this query, the roll-up to the Category level along the product dimension is
performed through the first MATCH clause. The second MATCH clause per-
forms the roll-up to the Year level, although this is straightforward, since all
the date information is stored as properties of the Date nodes. The aggrega-
tion is performed in the RETURN clause, and finally the results are ordered
by the company name of the customer, the year, and the category.
Query 13.2. Yearly sales amount for each pair of customer and supplier
countries.
MATCH (cc:Country)<-[:OfCountry]-(:State)<-[:OfState]-(:City)<-[:IsLocatedIn]-
(:Customer)<-[:PurchasedBy]-(s:Sales)
MATCH (sc:Country)<-[:OfCountry]-(:State)<-[:OfState]-(:City)<-[:IsFrom]-
(:Supplier)<-[:SuppliedBy]-(s:Sales)-[:HasOrderDate]->(d:Date)
RETURN cc.CountryName AS CustomerCountry, sc.CountryName AS SupplierCountry,
d.Year AS Year, sum(s.SalesAmount) AS SalesAmount
ORDER BY CustomerCountry, SupplierCountry, Year
The first MATCH performs a roll-up to the customer country level, while
the second MATCH performs a roll-up to the supplier contry and the Year
levels. The aggregated result is computed in the RETURN clause. Here we
can see the impact of the Cypher semantics that we studied in Sect. 13.2.2.
If, instead of writing the two roll-ups to the Country level using two different
statements we had used only one statement, some tuples would have been
missed because of the no-repeated-edge semantics.
Query 13.3. Monthly sales by customer state compared to those of the pre-
vious year.
// All months and customer states
MATCH (d:Date)
MATCH (:Customer)-[:IsLocatedIn]->(:City)-[:OfState]->(t:State)
WITH DISTINCT t.StateName AS State, d.Year AS Year, d.MonthNumber AS Month
// Sales amount by month and state including months without sales
OPTIONAL MATCH (d:Date)<-[:HasOrderDate]-(s:Sales)-[:PurchasedBy]->
(:Customer)-[:IsLocatedIn]->(:City)-[:OfState]->(t:State)
WHERE t.StateName = State AND d.Year = Year AND d.MonthNumber = Month
502 13 Graph Data Warehouses
This query is more involved. One of the issues to solve here is that we need to
produce the aggregation of all possible combinations of customer state and
month, including those combinations for which there have been no sales. This
is done through the Cartesian product in the first two MATCH clauses and
the WITH clause, which passes all possible combinations of state, year, and
month in the database. The OPTIONAL MATCH clause, which behaves like
an outer join, returns a null value for salesAmount when there are no sales for
a combination. Over the result, the total sales per state, year, and month are
computed in the WITH clause. If there were no sales for a combination, zero
is returned (i.e., the operation sum(s.SalesAmount) AS SalesAmount returns
0 if there was no match in the outer join). The result is sorted by state, year,
and month. Then, for each state, a list is produced by the COLLECT function,
composed of triples containing the sales for each month and year. This list
is stored in the variable rows. For each state, the list has the same length.
The UNWIND clause produces an index, which is used to iterate over the
elements in the list. This index is used to compute the sales amount value of
the same month of the previous year in the last WITH clause. For example,
for each state, rows[i].y returns the year in position [i] in the list. Since the
list is sorted by year and month, the operation rows[i-12] returns the sales
of the same month in the previous year, and the value is stored in variable
SalesAmountPY.
Query 13.4. Monthly sales growth per product, that is, total sales per product
compared to those of the previous month.
// All months and products
MATCH (d:Date)
MATCH (p:Product)
WITH DISTINCT p.ProductName AS Product, d.Year AS Year, d.MonthNumber AS Month
// Sales amount by month and product including months without sales
OPTIONAL MATCH (p:Product)<-[:Contains]-(s:Sales)-[:HasOrderDate]->(d:Date)
WHERE p.ProductName = Product AND d.Year = Year AND d.MonthNumber = Month
WITH Product, Year, Month, sum(s.SalesAmount) AS SalesAmount
ORDER BY Product, Year, Month
// Add row number
WITH Product, collect({y:Year, m:Month, s:SalesAmount}) AS rows
UNWIND range(0, size(rows) - 1) AS i
// Previous month using row number
WITH Product, rows[i].y AS Year, rows[i].m AS Month, rows[i].s AS SalesAmount,
13.2 Property Graph Database Systems 503
This query uses a similar strategy to the previous one to compute aggre-
gate values for all combinations of products and months. The OPTIONAL
MATCH and the WITH clauses perform the aggregation, also producing a
zero as aggregated value for missing combinations. The call to the COLLECT
function and the subsequent UNWIND generates a row number per product
in the variable i. This row number is used to access the sales amount value
of the previous month in the last WITH clause to compute the growth in the
RETURN clause.
This query is simple to express in Cypher. Once the matching of the employ-
ees and their sales is performed, we have all we need for the aggregation. In
the RETURN clause, the first and last name of each employee are concate-
nated, separated by a blank space. Finally, the LIMIT clause retains the first
three employees and their total sales.
Here, the first two MATCH clauses perform the roll-up along the employee,
product, and date dimensions, to collect the data needed later. The first WITH
clause aggregates sales by product and year. The second WITH clause uses
COLLECT to build, for each (product, year) pair, a list with each employee
and her sales of that product that year, and passes these sets in the variable
rows. Finally, the first element of each list is extracted using head(rows).
Query 13.7. Countries that account for top 50% of the sales amount.
MATCH (c:Country)<-[:OfCountry]-(:State)<-[:OfState]-(:City)<-[:IsLocatedIn]-
(:Customer)<-[:PurchasedBy]-(s:Sales)
WITH c.CountryName AS CountryName, sum(s.SalesAmount) AS SalesAmount
ORDER BY SalesAmount DESC
504 13 Graph Data Warehouses
This query requires the computation of a cumulative sum of the sales. The
solution requires the use of some advanced Cypher features. First, the sales
per country are computed and passed, sorted in descending order of the sales
amount, to the first COLLECT function. This builds the collection Countries,
as a list of (Country, Sales) pairs. The value accounting for the top 50% of
the overall sales amount is attached to the Countries list. Then, the UNWIND
clause transforms the collection back to a table c, where each row consists of a
(Country, Sales) pair. For each tuple in c, the expression with the apoc.coll.sum
function of the APOC library computes the cumulative sum in the collection
Countries. As a result, we get a table with the country name, the cumulative
sum for the country, the total sales for it, and the 50 percent value of the
total sales (this is repeated for every tuple). At this point we have everything
we need to produce a result. For this, a collection is built again, containing
triples of the form (CountryName, SalesAmount, CumulSales). The trick to
compute what we need is performed by the next UNWIND clause. The first
line of this clause produces a table with the countries whose cumulative sales
are less than 50% of the total sales. Since we need to add a last country
to exceed the 50%, we must add the first element of the remainder of the
collection. This is done in the second line of the UNWIND clause. The +
operation performs the union of the two tables.
Query 13.8. Total sales and average monthly sales by employee and year.
MATCH (e:Employee)<-[:HandledBy]-(s:Sales)-[:HasOrderDate]->(d:Date)
WITH e.FirstName + ' ' + e.LastName AS Employee, d.Year AS Year,
d.MonthNumber AS Month, sum(s.SalesAmount) AS MonthSales
WITH Employee, Year, sum(MonthSales) AS YearSales, COUNT(*) AS count
RETURN Employee, Year, YearSales, YearSales / count AS avgMonthlySales
ORDER BY Employee, Year
In this query, the MATCH clause performs the roll-up along the employee
and date dimensions. The first WITH clause computes the sales amount per
year and month, while the second one computes the sales amount per year
together with the number of tuples composing each year’s sales in the variable
count. Finally, in the RETURN clause, the average is computed.
13.2 Property Graph Database Systems 505
Query 13.9. Total sales amount and total discount amount per product and
month.
MATCH (d:Date)<-[:HasOrderDate]-(s:Sales)-[:Contains]->(p:Product)
WITH p.ProductName AS Product, d.Year AS Year, d.MonthNumber AS Month,
sum(s.SalesAmount) AS SalesAmount,
sum(s.UnitPrice * s.Quantity * s.Discount) AS TotalDiscount
RETURN Product, Year, Month, SalesAmount, TotalDiscount
ORDER BY Product, Year, Month
Here, to compute the total discount, we need to use the quantity sold of each
product, along with its unit price, instead of using the SalesAmount attribute.
The rest of the query is straightforward.
Query 13.10. Monthly year-to-date sales for each product category.
MATCH (c:Category)<-[:HasCategory]-(:Product)<-[:Contains]-
(s:Sales)-[:HasOrderDate]->(d:Date)
WITH c.CategoryName AS Category, d.Year AS Year, d.MonthNumber AS Month,
sum(s.SalesAmount) AS SalesAmount
MATCH (c1:Category)<-[:HasCategory]-(:Product)<-[:Contains]-
(s1:Sales)-[:HasOrderDate]->(d1:Date)
WHERE c1.CategoryName = Category AND d1.MonthNumber <= Month AND
d1.Year = Year
RETURN Category, Year, Month, SalesAmount, SUM(s1.SalesAmount) AS YTDSalesAmount
ORDER BY Category, Year, Month
Recall that this query is written in SQL using window functions. Here, the
MATCH clause collects all sales with their associated category and date. The
WITH clause aggregates the sales by category, month, and year. Then, a new
MATCH allows these results to be compared with the sales of the same year
and previous months, performing the running sum.
Query 13.11. Moving average over the last 3 months of the sales amount by
product category.
// Sales amount by month and category including months without sales
MATCH (d:Date)
MATCH (c:Category)
WITH DISTINCT c.CategoryName AS Category, d.Year AS Year, d.MonthNumber AS Month
OPTIONAL MATCH (d:Date)<-[:HasOrderDate]-(s:Sales)-[:Contains]->(p:Product)-
[:HasCategory]->(c:Category)
WHERE c.CategoryName = Category AND d.Year = Year AND d.MonthNumber = Month
WITH Category, Year, Month, sum(s.SalesAmount) AS SalesAmount
ORDER BY Category, Year, Month
// Add row number
WITH Category, COLLECT({y:Year, m:Month, s:SalesAmount}) AS rows
UNWIND range(0, size(rows) - 1) AS i
// Moving average using row number
WITH Category, rows[i].y AS Year, rows[i].m AS Month, rows[i].s AS SalesAmount,
rows[CASE WHEN i-2 < 0 THEN 0 ELSE i-2 END..i+1] AS Values
UNWIND Values AS Value
RETURN Category, Year, Month, SalesAmount, collect(Value.s) AS Values,
avg(Value.s) AS MovAvg3M
ORDER BY Category, Year, Month
506 13 Graph Data Warehouses
This is a recursive query, which makes use of one of the main features of
graph databases in general, and Neo4j in particular, which is the easy and
efficient computation of the transitive closure of a graph. At the start of
the query, the sales by employee in 2017 are computed and passed forward.
For each employee in this result, the first OPTIONAL MATCH is used to
compute recursively all her subordinates with the construct [:ReportsTo*].
The second OPTIONAL MATCH computes the sales of these subordinates,
which are aggregated in the RETURN clause.
Query 13.13. Total sales amount, number of products, and sum of the quan-
tities sold for each order.
MATCH (p:Product)<-[:Contains]-(s:Sales)-[:HasOrderDate]->(d:Date)
WITH s.OrderNo AS Order, sum(s.UnitPrice * s.Quantity) AS OrderAmount,
sum(s.Quantity) AS TotalQty, count(*) AS NbrOfProducts
RETURN Order, OrderAmount, TotalQty, NbrOfProducts
ORDER BY Order
Here, the product sales and their dates are computed first. The WITH clause
computes, for each order, the total amount by multiplying the unit price and
the quantity sold, and also computes the number of products for each order.
Query 13.14. For each month, total number of orders, total sales amount,
and average sales amount by order.
13.3 OLAP on Hypergraphs 507
MATCH (s:Sales)-[:HasOrderDate]->(d:Date)
WITH d.Year AS Year, d.MonthNumber AS Month, s.OrderNo AS Order,
sum(s.SalesAmount) AS OrderAmount
RETURN Year, Month, count(*) AS NoOrders, sum(OrderAmount) AS SalesAmount,
avg(OrderAmount) AS AvgOrderAmount
ORDER BY Year, Month
The first MATCH associates to each order its date to find all dates at which
an order was placed. The WITH clause computes the total amount of each
order and associates to it its year and month. In the RETURN clause, the
required aggregations are computed.
Query 13.15. For each employee, total sales amount, and number of cities
to which she is assigned.
MATCH (e:Employee)<-[:HandledBy]-(s:Sales)
MATCH (e:Employee)-[:IsAssignedTo]->(c:City)-[:OfState]->(t:State)
RETURN e.FirstName + ' ' + e.LastName AS Employee,
sum(s.SalesAmount) AS SalesAmount, count(DISTINCT c) AS NoCities,
count(DISTINCT t) AS NoStates
ORDER BY Employee
The first MATCH retrieves the sales of each employee, while the second one
retrieves the employee’s cities and states. The RETURN clause computes the
aggregated sales as well as the number of distinct cities and states.
Traditional data warehousing and OLAP operations on cubes are not suf-
ficient to address the data analysis requirements in modern big data sys-
tems, where data are highly connected. Given the extensive use of graphs
to represent practical problems, multidimensional analysis of graph data and
graph data warehouses is increasingly being studied. There is a need to
perform graph analysis from different perspectives and at multiple granular-
ities. Although OLAP operations and models can expand the possibilities of
graph analysis beyond the traditional graph-based computation, this poses
new challenges to traditional OLAP technology.
Consider a social network represented as a graph where nodes are used to
represent persons. We can enrich this graph with additional nodes, edges, and
properties describing multidimensional information associated to the nodes.
For persons, such information may include Gender, Profession, City, State,
etc. Therefore, while in traditional OLAP as studied in this book queries are
of the kind “Average income by location and profession?”, in our multidi-
mensional network the natural queries would be of the kind “What is the
network structure between the various professions?” To answer such a query
508 13 Graph Data Warehouses
we need to aggregate all user nodes of a given profession into a single node
and aggregate all edges between persons into edges between professions.
Most proposals to perform OLAP over graphs only address graphs whose
nodes are of the same kind, which are referred to as homogeneous. Con-
trary to this, heterogeneous graphs can have nodes of different kinds. Work
on OLAP over heterogeneous graphs is still at a preliminary stage. This sec-
tion presents Hypergraph GOLAP (HGOLAP), an extension of OLAP
concepts to hypergraphs, which are heterogeneous graphs that allow many-
to-many relationships between nodes.
The hypergraph model is appropriate for representing facts having a vari-
able number of dimensions, which can even be of different types. A typical
example is the analysis of phone calls that we will use in this section. Here, we
represent point-to-point calls between two participants, but also group calls
between any number of participants, initiated by one of them. As discussed
in Sect. 5.6.2, a relational representation of this scenario typically requires
a bridge table to represent the many-to-many relationship between the fact
representing the calls (with measures such as call duration) and the dimension
representing the participants. The hypergraph model described next allows
a more natural representation of this scenario.
A hypergraph is composed of nodes and hyperedges. Nodes have a type
and are described by attributes. Attributes, which correspond to dimension
levels, have an associated domain. For formal reasons (not covered in this
chapter) the first attribute in a node type corresponds to a distinguished
identifier attribute. Hyperedges are defined analogously to nodes but without
an identifier attribute. Figure 13.8 illustrates this model, based on the ex-
ample of the group calls mentioned above. The figure depicts a node of type
Person, having an identifier attribute and three dimensions representing the
name, the city, and the phone number. There is also a hyperedge of type
Call with dimension Date and a measure denoted Duration (recall that OLAP
measures can be considered as dimensions). This hyperedge says that the
person represented by node 1 initiates a call where persons represented by
nodes 2 and 3 participate. We keep the Neo4j notation for nodes and edges.
As mentioned in Sect. 13.2.2, although conceptually the direction of the ar-
rows indicates the role of each node in the hyperedge, this is not an issue in
a Neo4j implementation. Therefore, in Fig. 13.8 we follow the semantics of
the problem and define an edge going from a Person node to the Call node to
indicate the initiator of the call. Analogously, an edge going from a Call node
to a Person node indicates the receiver of the call.
We can now define the model as follows. Given n dimensions D1 , ..., Dn
defined at granularity levels l1 , ..., ln , a hypergraph by (D1 .l1 , ..., Dn .ln ) is a
multi-hypergraph (i.e., there can be several hyperedges between the same
pair of nodes) where all attributes in nodes and edges are defined at the
granularity Di .li . The base hypergraph is the graph where all information
in the data cube is at the finest granularity level (i.e., the bottom level).
13.3 OLAP on Hypergraphs 509
hyperedge
(:Person{ID:1, Name:Ann, City:Brussels, Phone:+32-2-234-5634})
3 1
node
(:Call{Date:10/10/2016, Duration:8})
2
edge type dimension measure
{Name, BirthDate,
Gender, Profession} {Number} {Name}
4 5
(:Call {2/5/2018, 8}) (:Phone {5, +34-678-4673})
The graph depicted in Fig. 13.10 is actually the base graph. Note that
several hypergraphs at the same granularity can exist for a given base graph.
Suppose that the phone 1 corresponds to the operator Orange, phones 2
and 4 to Vodafone, and phones 3 and 5 to Movistar. A hypergraph by
(Date.Day, Phone.Operator), shown in Fig. 13.11, is obtained from the base
graph in Fig. 13.10 by replacing all phone numbers in the graph with their
corresponding operators, keeping the rest of the graph unchanged. As can be
seen in the figure, the hypergraph has the same number of nodes, but at a
coarser granularity level.
13.3 OLAP on Hypergraphs 511
4 5
(:Call{2/5/2018, 5}) (:Phone {5, Movistar})
Fig. 13.11 Hypergraph by (Date.Day, Phone.Operator) for the hypergraph of Fig. 13.10
3
1 2
(:Call {10/10/2016, 8})
Fig. 13.13 Aggregation operation over the minimal hypergraph of Fig. 13.12
Roll-up and drill-down The roll-up operation can be defined from the
climb and aggregation operations given above. Given a hypergraph G, a di-
mension D with a level L, a measure M , and an aggregate function F , the
roll-up over M using F along the dimension D up to level L is defined as
13.3 OLAP on Hypergraphs 513
the result of three operations: (1) Compute the climb operation along D up
to L, yielding a hypergraph G0 ; (2) Compute the minimal hypergraph of G0 ;
(3) Perform an aggregation of G0 over the measure M using the function F .
The drill-down operation does the opposite of roll-up, taking a hypergraph
to a finer granularity level, along the dimension D.
To illustrate this concept, consider the hypergraph in Fig. 13.10. A roll-
up operation over this hypergraph along the Phone dimension up to level
Operator, and along the Date dimension up to level Year, using the function
Sum, produces the hypergraph of Fig. 13.16. First, a climb up to the Year
and Operator levels is produced. The result is depicted in Fig. 13.14. Then,
a minimal hypergraph of this last result is built, as shown in Fig. 13.15.
Finally, an aggregation over this hypergraph applying the Sum function to
the measures is performed as follows. The two Call hyperedges corresponding
to calls from an Orange to a Vodafone line in 2016 are aggregated using
the Sum function over the measure Duration. The same occurs with the calls
between the nodes 2 and 3.
4 5
(:Call{2017, 3})
(:Call {2016, 4})
Fig. 13.15 Minimal hypergraph produced by climbing up to Year and Operator levels
Fig. 13.16 Roll-up to levels Year and Operator for the base hypergraph in Fig. 13.10
3
1 2
Fig. 13.17 Dicing the graph for operators other than “Orange” for data in Fig. 13.10
Fig. 13.18 Slicing the Date dimension for the graph in Fig. 13.10
This query shows a roll-up along the Phone dimension up to the level User.
The climb is done by the MATCH clause (the climbing path is explicit in this
clause), while the aggregation is performed in the RETURN clause. Note that
a single user can have several phones. The condition in the WHERE clause
above only considers calls between phones belonging to different users.
Query 13.18. Compute the shortest path between each pair of phones.
This query aims at analyzing the connections between phone users, and has
many real-world applications (for example, to investigate calls made between
two persons who use a third one as an intermediary). From a technical point
of view, this is an aggregation over the whole graph, using as a metric the
shortest path between every pair of nodes. We next show the Cypher query.
MATCH (m:Phone), (n:Phone)
WITH m, n WHERE m < n
MATCH p = shortestPath((m)-[:Receives|:Creates*]-(n))
RETURN p, length(p)
over other geospatial objects that are semantically related to the trajectories.
These objects are typically referred to as places of interest (PoIs), and they
depend on the application domain. For example, in a tourist application,
usual examples of PoIs are hotels, restaurants, and historical buildings, while
for traffic analysis, examples of PoIs could be road junctions or parking lots.
A PoI is considered a stop in a trajectory when the moving object remains
in it for a duration longer than some threshold. Thus, each trajectory, be-
ing a sequence of points, can be transformed into a sequence of stops and
moves, and trajectory analysis can be applied to these transformed trajecto-
ries, which are called semantic trajectories. An example is given next.
Assume we have a tourist application for New York where the PoIs include
hotels, restaurants, and buildings. There is also information about moving
objects telling how tourists move in the city. For example, a moving object
can go from Hotel 1 to St. Patrick’s Cathedral, then to the Empire State
Building, and finally return to the hotel. In this setting, a data scientist
may pose queries like “How many persons went from a hotel to St. Patrick’s
Cathedral and then to the Empire State Building (stopping to visit both
places) in the same day.” An analyst may also want to identify interesting
patterns in the trajectory data with queries like “Give the percentage of
trajectories visiting two restaurants in the same day.”
As shown in this section, graph databases can be used to manipulate se-
mantic trajectory data. We focus on analytical queries that typically require
the trajectory graph to be aggregated at various granularities, as shown in
Sect. 13.3. For this, we use a real-world dataset containing about ten months
of check-in data in New York City, collected by the Foursquare social net-
work.6 The dataset has been enriched with geographic data about New York
City.7 In addition, we define Time and Stop dimensions used as contextual
information to perform OLAP on this trajectory graph.
{SeqNo, Instant}
OfVenue
Sub
Venue HasSubcategory Category HasCategory Category
{VenueName, {SubcategoryName} {CategoryName}
Latitude, Longitude}
6
https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset
7
Maps were downloaded from http://www.mapcruzin.com
518 13 Graph Data Warehouses
Figure 13.20 depicts the schema of the semantic trajectory graph using
the hypergraph OLAP model described in Sect. 13.3. The base hypergraph
is composed of the trajectories themselves. Each step in a trajectory is rep-
resented by the fact Stop, which has a self-referencing edge labeled TrajStep.
Stop has properties UserId (the user identifier), Position (the relative position
of the stop in the trajectory), and Instant (the time instant when reaching the
stop). Stop is associated with the contextual dimensions Venue and Date. The
Date dimension aggregates data from the instants represented in the stops up
to the Year level. The Venue dimension is associated with level Subcategory,
which is further associated with level Category. For example, a node Afghan
Restaurant at the Subcategory level is associated with the node Restaurant
at the Category level. Venue has properties VenueId, Latitude, and Longitude.
Thus, through the Venue dimension, a trajectory consisting of sequences of
stops becomes a sequence of the form hHome, Station, Restaurant, . . .i. Exam-
ples of OLAP queries defined over this dataset are:
• Users that travel to an airport by taxi during the night.
• Trajectories in which users go from their homes to an airport after 5 p.m.
• Trajectories where users go from a restaurant to a sports event and then
to a coffee shop.
• Number of users moving between two or more boroughs in the same day.
• Average distance traveled per user and per day.
We show next some analytical queries over the trajectory graph, expressed
in Cypher using the Graph OLAP operations explained in Sect. 13.3.
Query 13.19. Find the trajectories that go from a private home to a station
and then to an airport, without intermediate stops.
MATCH (c1:Category {CategoryType:'Home'})<-[*3..3]-(s1:Stop)-[:TrajStep]->
(s2:Stop)-[:TrajStep]->(s3:Stop)-[*3..3]-(c3:Category {CategoryType:'Airport'})
WHERE s2.Position = s1.Position + 1 AND s3.Position = s2.Position + 1
MATCH (s2)-[*3..3]->(c2:Category {CategoryType:'Station'})
WITH s1 ORDER BY s1.Position
RETURN s1.userid, COLLECT(distinct s1.Position) ORDER BY s1.UserId
Query 13.20. For each trajectory, compute the distance traveled between
each pair of consecutive stops.
MATCH (s1:Stop)-[:TrajStep]->(s2:Stop)
WITH point({Longitude: s1.Longitude, Latitude:s1.Latitude}) AS p1,
point({Longitude:s2.Longitude, Latitude:s2.Latitude}) AS p2,
s1, s2, s1.UserId AS User
RETURN User, s1.Position, s2.Position, round(distance(p1, p2)) AS TravelDistance
ORDER BY s1.UserId ASC, TravelDistance DESC
In this query, all consecutive pairs of stops are computed first (by pattern
matching, rather than joins, which would be the case in the relational model).
Then, the (Latitude, Longitude) pairs of each stop are obtained. Finally, for
each trajectory, the distance between two consecutive stops is computed.
As a last example, we show a query requiring the computation of the
transitive closure of the trajectory graph.
Query 13.21. For each trajectory, find the paths that go from a private home
to an airport on the same day.
MATCH (cat1:Category {CategoryType:'Home'})<-[*3..3]-(s1:Stop)
MATCH (cat2:Category {CategoryType:'Airport'})<-[*3..3]-(s2:Stop)
WHERE s1.UserId = s2.userid AND s1.Position < s2.Position AND
apoc.date.fields(s1.Instant, 'yyyy-MM-dd HH:mm:ss').years =
apoc.date.fields(s2.Instant, 'yyyy-MM-dd HH:mm:ss').years AND
apoc.date.fields(s1.Instant, 'yyyy-MM-dd HH:mm:ss').months =
apoc.date.fields(s2.Instant, 'yyyy-MM-dd HH:mm:ss').months AND
apoc.date.fields(s2.Instant, 'yyyy-MM-dd HH:mm:ss').days =
apoc.date.fields(s1.Instant, 'yyyy-MM-dd HH:mm:ss').days
WITH s1, apoc.coll.sort(collect(s2.Position)) AS FirstAirports
WITH s1, head(FirstAirports) AS s2pos
MATCH path = (s1)-[:TrajStep*]->(s2:Stop {Position:s2pos})
RETURN s1, path
This query requires some explanation. Several climb operations are required
along both background dimensions: (1) along the Stop dimension up to the
Category level, to find the stops corresponding to homes and airports; (2)
along the Time dimension up to the Day level. Further, dice operations are
used to filter out the subtrajectories not occurring during the same day, and
to keep only the trajectories going from a home to an airport. The climbs and
dices are computed in the first two MATCH clauses. The transitive closure of
the resulting subgraph is finally computed.
The climb along the Time dimension is performed through a conjunction
of Boolean conditions over the Instant property of the Stop nodes. Note that
to operate with dates, we could have written:
MATCH (s1)-[:IsInstantOf]->()-[:IsMinuteOf]->()-[:IsHourOf]->(d:Day),
(s2)-[:IsInstantOf]->()-[:IsMinuteOf]->()-[:IsHourOf]->(d:Day)
Instead, we used the APOC library, which is more efficient than performing
the two matchings along the Time hierarchy.
520 13 Graph Data Warehouses
Graph database systems like Neo4j are in general adequate for medium-sized
graphs, since, in general, they do not partition the graph for parallel process-
ing. The so-called graph processing frameworks are used when there is
a need to handle very large graphs. These frameworks provide specific APIs
for parallel execution of complex graph algorithms over very large data vol-
umes. Most of these frameworks use a graph traversal query language called
Gremlin. In the first part of this section we introduce this language, and then
briefly comment on JanusGraph, a graph processing framework that uses
Gremlin as query language. In this section, we use the term vertex instead of
node, to follow the terminology used in Gremlin and JanusGraph.
13.4.1 Gremlin
e1 e2
x y z
For moving from vertices to edges, the basic operations are given next.
The operations outE() and inE() navigate from a vertex to the outgoing and
incoming edges, respectively. For example, in Fig. 13.21, if while navigating
the graph we are at vertex y, outE() moves us to the edge e2, while inE() takes
us to e1. The operation bothE() combines the two operations. For example,
standing at y, the operation bothE() moves us to e1 and e2.
For moving from edges to vertices, the basic operations are as follows.
The operations outV() and inV() navigate from edges to their tail and head
vertices, respectively. For example, standing at e1, outV() takes us to x, while
inV() takes us to y. The operator otherV() navigates from an edge and arrives
at the vertex from which it did not come. In Fig. 13.21, standing at e1, if we
arrived at that edge from x, otherV() takes us to y.
13.4 Graph Processing Frameworks 521
The first expression above returns all the outgoing edges in the graph. The
second expression returns the vertices at the other end of the outgoing edges
of the graph. Expression 3 returns a list of properties of the edges containing
properties Score and Time. Expression 4 returns the two-hop vertices from
the starting ones. Expression 5 returns the two-hop vertices from the vertices
with value ‘CA’ in the Country key. Expression 6 shows an equivalent way
of writing Expression 5 with a repeat operator. This allows more concise
expressions, to avoid long sequences of out operations. Expression 7 does
the same as the previous two, but returning the paths. Finally, Expression 8
returns the edges with a property Score with value less than 3.
Gremlin can also define traversals using pattern matching, where the prop-
erties of the vertices and edges are declared, leaving to Gremlin the decision
of computing the most appropriate traversals to match the pattern. This is
called declarative traversal, since the order of the traversals is not defined
a priori. Each traverser will select a pattern to execute from a collection of
patterns, which may yield several different query plans. A pattern-matching
traversal uses the match() step to evaluate a graph pattern. Each graph pat-
tern in a match() is interpreted individually, allowing construction of com-
posite graph patterns by joining each of the traversals. These graph patterns
are represented using the as() step (called step modulators), which mark the
start and end of particular graph patterns or path traversals (note that all
patterns must start with _.as()).
For example, the query over the Northwind graph “List the products sup-
plied by suppliers located in London” can be expressed as follows:
g.V().match(
_.as('s'). hasLabel('Sales').outE('SuppliedBy').outE('isFrom').inV().hasLabel('City').
has('CityName', 'London'),
_.as('s'). hasLabel('Sales').outE('Contains').inV().hasLabel('Product').as('p1').
select('p1').values('ProductName').dedup() )
The query starts by defining a variable g for the graph traversal. Then, two
traversals are defined (separated by a comma), one for the supplier and an-
other one for the products. Variable ’s’ binds the patterns, and variable ’p1’
is used to find the product names. The result is listed with the select() step,
and duplicates are removed with the dedup() step.
We now compare Cypher and Gremlin using the Northwind data ware-
house graph. In Cypher, the query “Find the customers who purchased a
product with name Pavlova and another product” reads :
MATCH (c:Customer)<-[:PurchasedBy]-(s1:Sales)-[:Contains]->(p1:Product)
WHERE p1.ProductName = 'Pavlova'
MATCH (c:Customer)<-[:PurchasedBy]-(s2:Sales)-[:Contains]->(p2:Product)
WHERE p1 <> p2
RETURN DISTINCT c.CustomerKey
_as('a').inE('PurchasedBy').outV().hasLabel('Sales').outE('Contains').inV().
hasLabel('Product').has('ProductName', 'Pavlova').as('p1'),
_.as('a').inE('PurchasedBy').outV().hasLabel('Sales').outE('Contains').inV().
hasLabel('Product').as('p2').where('p1', neq('p2')) ).
select('c').values('CustomerKey').dedup()
The Cypher query performs two matches to find the customers buying at
least two products, one of them being Pavlova and the other being different
from the first one. On the other hand, the Gremlin query starts by defining a
variable g for the graph traversal. Then two graph patterns are defined, with
traversals that define variables p1 and p2 using an alias (the as() step). The
final dedup() step after the select step is analogous to the DISTINCT clause
in Cypher.
13.4.2 JanusGraph
(:Speaks) (:Person
{Name:’John’})
(:Speaks)
3
(:Speaks) (:Speaks)
1 2 1 2
// Properties
CategoryKey = m.makePropertyKey('CategoryKey').dataType(Integer.class).make();
CategoryName = m.makePropertyKey('CategoryName').dataType(String.class).make();
...
m.commit()
Category defines the path to the CSV file that contains the records for the
categories. The code iterates over every line in the file. A vertex is created
with the addVertex method. Then, the values for the properties of the vertex
are set with the property method using the various columns in the file.
Finally, we show next how the edges are loaded.
// Load category edges iterating over the lines of the CSV file.
new File(HasCategory).eachLine {
...
src = g.V().has('id', field[2]).next(); // source
dst = g.V().has('id', field[3]).next(); // destination
e = src.addEdge('HasCategory', dst); // add the edge
... }
As in the case of vertices, first the path to the CSV file should be defined in
HasCategory, and then the code iterates over the lines of this file. Here, the
source and the destination of an edge are taken, respectively, from the second
and third fields of the file. Then, the edge can be created.
526 13 Graph Data Warehouses
13.1 What are the main characteristics of graph databases? How do they
differ from relational databases?
13.2 What are graph data warehouses?
8
https://s3.amazonaws.com/artifacts.opencypher.org/website/ocim1/slides/
cypher_implementers_day_2017_pattern_matching_semantics.pdf
13.7 Exercises 527
13.3 Discuss at least three scenarios where graph data warehouses can be
more convenient than relational-based data warehousing. Elaborate
on the reasons for your choices.
13.4 Give five typical queries that can exploit a graph data warehouse for
the scenarios discussed in your previous answer.
13.5 What is a property graph? How do property graphs differ from tradi-
tional graphs?
13.6 Discuss the main data models underlying graph databases.
13.7 Explain the main characteristics of Neo4j.
13.8 Discuss different ways of populating a Neo4j database.
13.9 How does Neo4j qualify with respect to the CAP theorem?
13.10 Explain the differences between graph databases and graph processing
frameworks. In which scenarios you would use each one.
13.11 Compare Cypher and Gremlin as query languages for graphs. When
would you use each one of them?
13.12 Discuss different relational data warehouse representations for the
phone calls example given in this chapter. Compare them against the
HGOLAP solution.
13.13 Give real-world examples where the hypergraph data model can be
naturally used.
13.7 Exercises
Exercise 13.1. Given the Northwind operational database in Fig. 2.4, pro-
pose a graph representation using property graphs. Export the data into a
Neo4j database, and express the following queries in Cypher. The schema of
the Northwind operational database in Neo4j in given in Fig. ??.
a. List products and their unit price.
b. List information about products ‘Chocolade’ and ‘Pavlova’.
c. List information about products with names starting with a ‘C’, whose
unit price is greater than 50.
d. Same as the previous query, but considering the sales price, not the prod-
uct’s price.
e. Total amount purchased by customer and product.
f. Top 10 employees, considering the number of orders sold.
g. For each territory, list the assigned employees.
h. For each city, list the number of customers and number of suppliers lo-
cated in that city.
i. For employees who have subordinates, list the number of direct or indirect
subordinates.
j. Direct or indirect supervisors of employees whose first name is “Robert”.
k. List the employees who do not have supervisors.
528 13 Graph Data Warehouses
l. List the suppliers, number of categories they supply, and a list of such
categories.
m. List the suppliers who supply beverages.
n. List the customer who purchased the largest amount of beverages.
o. List the five most popular products considering number of orders.
p. Products ordered by customers from the same country as their suppliers.
Promotion Category
Exercise 13.2. Consider the relational schema of the Foodmart data ware-
house given in Fig. 5.41 and its implementation in Neo4j given in Fig. 13.23.
Write in Cypher the queries given in Ex. 7.3.
Exercise 13.3. Consider the schema of the graph trajectory database given
in Fig. 13.20, obtained from the Foursquare check-ins database from Kag-
gle.com. Write the following queries in Cypher.
a. List the trajectories that go from a bar (or similar) to a restaurant, again
to a bar (or similar) and finish at a restaurant, without intermediate
stops.
b. For all trajectories that go directly from a private home to an airport,
list the user identifier, together with the distance traveled between these
two places each time that this pattern occurs.
c. Compute the number of different categories of venues visited by month.
d. Compute the number of stops per day per user, along with the starting
position of each subtrajectory for each day.
e. For each trajectory, compute its total length, as the sum of the distances
between each pair of stops.
f. For each day, and for each trajectory, find the longest subtrajectory.
g. Write the previous query in SQL, and compare against the Cypher equiv-
alent. For this, use the PostgreSQL database dump provided.
FlowsTo
Segment
Conference
Series Institution
{Name, Acronym} {Name, Sector}
Fig. 13.25 An instance of the conference database using the HGOLAP model
(:Affiliated) (:Affiliated)
4 9
Fig. 13.26 An instance of the conference database using the HGOLAP model
http://example.org/NWDW#hasEmployee
http://example.org/NWDW#iri http://example.org/NWDW#employee1
http://example.org/NWDW#HireDate http://example.org/NWDW#LastName
http://example.org/NWDW#FirstName
The first line is the typical XML heading line, and the document starts
with the RDF element. The xmlns attribute is used to define XML namespaces
composed of a prefix and an IRI, making the text less verbose. The subject
and object of the triple representing the company and its employee are within
Description elements, where the attribute rdf:about indicates the IRIs of the
resources. The ex prefix refers to the Northwind data warehouse.
The same triple will be written as follows using Turtle.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ex: <http://example.org/NWDW#> .
Note that Turtle provides a much simpler, less verbose syntax, compared to
RDF/XML, so we use Turtle in the remainder of the chapter.
Data types are supported in RDF through the XML data type system.
For example, by default ex:HireDate would be interpreted as a string value
rather than a date value. To explicitly define the data types for the example
above, we would write in Turtle:
14.1 Semantic Web 535
To further simplify the notation, Turtle allows to replace rdf:type with ‘a’.
Thus, instead of
ex:employee1 rdf:type ex:Employee ;
we could write
ex:employee1 a ex:Employee ;
Also, the xml:lang attribute allows us to indicate the language of the text
in the triple. For example, to indicate that the name of the employee is an
English name, we may write in Turtle:
ex:employee1 ex:FirstName "Nancy"@en ; ex:LastName "Davolio"@en .
Finally, blank nodes are represented either explicitly with a blank node
identifier of the form _:name, or with no name using square brackets. The
latter is used if the identifier is not needed elsewhere in the document. For ex-
ample, the following triples state that the employee identified by ex:employee1,
who corresponds to Nancy Davolio in the triples above, has a supervisor who
is an employee called Andrew Fuller:
ex:employee1 a ex:Employee ;
ex:Supervisor [ a ex:Employee ; ex:FirstName "Andrew" ; ex:LastName "Fuller" ] .
In this case, the blank node is used as object, and this object is an anonymous
resource; we are not interested in who this person is.
A blank node can be used as subject in triples. If we need to use the blank
node in other part of the document, we may use the following Turtle notation:
ex:employee1 a ex:Employee ; ex:Supervisor _:employee2 .
_:employee2 a ex:Employee ; ex:FirstName "Andrew"; ex:LastName "Fuller" .
The expression above thells that there is an employee who has a supervisor
who is also an employee and whose first name in Andrew and whose last
name is Fuller.
the direct mapping and the R2RML mapping. We next explain both of them
using as an example a portion of the Northwind data warehouse, which is
stored in a relational database. Suppose that the Northwind company wants
to share their warehouse data on the web, for example, to be accessible to all
their branches. We want to publish the Sales fact table and the Product di-
mension table of Fig. 14.2, which are simplified versions of the corresponding
data warehouse tables. Note that we added an identifier SalesKey for each
tuple in the Sales fact table.
Fig. 14.2 An excerpt of a simplified version of the Northwind data warehouse. (a) Sales
fact table; (b) Product dimension table
Direct Mapping
The direct mapping7 defines an RDF graph representation of the data in
a relational database. This mapping takes as input the schema and instance
of a relational database, and produces an RDF graph called the direct graph,
whose triples are formed concatenating column names and values with a base
IRI. In the examples below, the base IRI is <http://example.org/>. The
mapping also accounts for the foreign keys in the databases being mapped.
The direct mapping for the Sales fact table and the Product dimension table
in Fig. 14.2 results in an RDF graph, from which we show below some triples.
@base <http://example.org/>
@prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
7
http://www.w3.org/TR/rdb-direct-mapping/
14.1 Semantic Web 537
Each row in Sales produces a set of triples with a common subject. The
subject is an IRI formed from the concatenation of the base IRI, the table
name, the primary key column name (SalesKey), and the primary key value
(s1 for the first tuple). The predicate for each column is an IRI formed as
the concatenation of the base IRI, the table name, and the column name.
The values are RDF literals taken from the column values. Each foreign
key produces a triple with a predicate composed of the foreign key column
names, the referenced table, and the referenced column names. The object of
these triples is the row identifier for the referenced triple. The reference row
identifiers must coincide with the subject used for the triples generated from
the referenced row. For example, the triple
<Sales/SalesKey="s1"> <Sales#ref-ProductKey> <Product/ProductKey="p1">
tells that the subject (the first row in Sales) contains a foreign key in the col-
umn ProductKey (the predicate in the triple) which refers to the triple identi-
fied in the object (the triple whose subject is <Product/ProductKey="p1">).
As it can be seen above, the direct mapping is straightforward but it does
not allow any kind of customization. Indeed, the structure of the resulting
RDF graph directly reflects the structure of the database, the target RDF vo-
cabulary directly reflects the names of database schema elements, and neither
the structure nor the vocabulary can be changed.
R2RML Mapping
RDB to RDF Mapping Language (R2RML)8 is a language for ex-
pressing mappings from relational databases to RDF data sets. Such map-
pings provide the ability to view relational data in RDF using a customized
structure and vocabulary. As with the direct mapping, an R2RML mapping
results in an RDF graph.
An R2RML mapping is an RDF graph written in Turtle syntax, called the
mapping document. The main object of an R2RML mapping is the so-called
8
http://www.w3.org/TR/r2rml
538 14 Semantic Web Data Warehouses
#TriplesMap_Product
rr:column rr:column
ProductName UnitPrice
<#TriplesMap_Product>
a rr:TriplesMap ;
rr:logicalTable [ rr:tableName "Product" ] ;
rr:subjectMap [
rr:template "http://example.org/product/{ProductKey}" ;
rr:class ex:product ] ;
rr:predicateObjectMap [
rr:predicate ex:productName ;
rr:objectMap [ rr:column "ProductName" ; rr:language "en" ] ; ] ;
rr:predicateObjectMap [
rr:predicate ex:unitPrice ;
rr:objectMap [ rr:column "UnitPrice" ; rr:datatype rdfs:integer ] ; ] .
14.2 Introduction to SPARQL 539
Foreign keys are handled through referencing object maps, which use the
subjects of another triples map as the objects generated by a predicate-object
map.
<#TriplesMap_Sales>
rr:predicateObjectMap [
rr:predicate ex:product ;
rr:objectMap [
rr:parentTriplesMap <#TriplesMap_Product> ;
rr:joinCondition [
rr:child "ProductKey" ;
rr:parent "ProductKey" ] ; ] ; ] .
SPARQL queries are built using variables, which are denoted by using either
‘?’ or ‘$’ as a prefix, although the former is normally used. The query eval-
uation mechanism of SPARQL is based on subgraph matching, where the
selection criteria is expressed as a graph pattern. This pattern is matched
against an RDF graph instantiating the variables in the query.
In what follows, we will work with the Northwind data warehouse repre-
sented as an RDF graph, as studied in Sect. 14.1.3. To get started, consider
the following SPARQL query, which asks for names and hire date of employ-
ees.
PREFIX ex:<http://example.org/NWDW#>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
There are three parts in the query. A sequence of PREFIX clauses declares
the namespaces. The SELECT clause indicates the format of the result. The
WHERE clause contains a graph pattern composed of four triples in Turtle
notation. The triples in the query are matched against the triples in an RDF
graph that instantiates the variables in the pattern. In our case, this is the
default RDF graph that represents the Northwind data warehouse. If we want
to include other graphs, a FROM clause must be added, followed by a list of
named graphs. The query can be more succinctly written as follows (we omit
the prefix part) :
SELECT ?firstName ?lastName ?hireDate
WHERE { ?emp a ex:Employee ; ex:Employee#FirstName ?firstName ;
ex:Employee#LastName ?lastName ; ex:Employee#HireDate ?hireDate . }
To evaluate this query, we instantiate the variable ?emp with an IRI whose
type is http://example.org/NWDW#Employee. Then, we look if there is a
triple with the same subject and property ex:Employee#FirstName, and, if
so, we instantiate the variable ?firstName. We proceed similarly to instanti-
ate the other variables in the query and return the result. Note that in this
case the result of the query is not an RDF graph, but a set of literals. Alter-
natively, the CONSTRUCT clause can be used to return an RDF graph built
by substituting variables in a set of triple templates.
From now on, we omit the prefix clauses in queries for brevity. The key-
word DISTINCT is used to remove duplicates in the result. For example,
the following query returns the cities of the Northwind customers, without
duplicates.
14.2 Introduction to SPARQL 541
The FILTER keyword selects patterns that meet a certain condition. For
example, the query “First name and last name of the employees hired between
2012 and 2014” reads in SPARQL as follows:
SELECT ?firstName ?lastName
WHERE { ?emp a ex:Employee ; ex:Employee#FirstName ?firstName ;
ex:Employee#LastName ?lastName ; ex:Employee#HireDate ?hireDate .
FILTER( ?hireDate >= "2012-01-01"^^xsd:date &&
?hireDate <= "2014-12-31"^^xsd:date) }
Filter conditions are Boolean expressions constructed using the logical con-
nectives && (and), || (or), and ! (not).
The FILTER keyword can be combined with the NOT EXISTS keyword to
test the absence of a pattern. For example, the query “First name and last
name of employees without supervisor” reads in SPARQL as follows:
SELECT ?firstName ?lastName
WHERE { ?emp a ex:Employee ; ex:Employee#FirstName ?firstName ;
ex:Employee#LastName ?lastName .
FILTER NOT EXISTS { ?emp ex:Employee#Supervisor ?sup . } }
The OPTIONAL keyword is used to specify a graph pattern for which the
values will be shown if they are found. It behaves in a way similar to an outer
join in SQL. For example, the query “First and last name of employees, along
with the first and last name of her supervisor, if she has one” can be written
in SPARQL as follows:
SELECT ?empFirstName ?empLastName ?supFirstName ?supLastName
WHERE { ?emp a ex:Employee ; ex:Employee#FirstName ?empFirstName ;
ex:Employee#LastName ?empLastName .
OPTIONAL { ?emp ex:Employee#Supervisor ?sup .
?sup a ex:Employee ; ex:Employee#FirstName ?supFirstName ;
ex:Employee#LastName ?supLastName . } }
The GROUP BY clause collects the orders associated to each employee, the
HAVING clause keeps only the employees who have more than 100 distinct
orders, and the ORDER BY clause orders the result in descending order ac-
cording to the number of orders.
Consider now the query “For customers from San Francisco, list the to-
tal quantity of each product ordered. Order the result by customer key, in
ascending order, and by quantity of products ordered, in descending order.”
SELECT ?cust ?prod (SUM(?qty) AS ?totalQty)
WHERE { ?sales a ex:Sales ; ex:Sales#Customer ?cust ;
ex:Sales#Product ?prod ; ex:Sales#Quantity ?qty .
?cust a ex:Customer ; ex:Customer#City ?city .
?city a ex:City ; ex:City#Name ?cityName .
FILTER(?cityName = "San Francisco") }
GROUP BY ?cust ?prod
ORDER BY ASC(?cust) DESC(?totalQty)
This query defines a graph pattern linking sales to customers and cities.
Prior to grouping, we need to find the triples satisfying the graph pattern,
and select the customers from San Francisco. We then group by pairs of ?cust
and ?prod and, for each group, take the sum of the attribute ?qty. Finally,
the resulting triples are ordered.
In SPARQL, a subquery is used to look for a certain value in a database,
and then use this value in a comparison condition. A subquery is a query
enclosed into curly braces used within a WHERE clause. As an example, the
query “For each customer compute the maximum sales amount among all her
orders” is written as follows:
SELECT ?cust (MAX(?totalSales) AS ?maxSales)
WHERE {{
SELECT ?cust ?orderNo (SUM(?sales) AS ?totalSales)
WHERE { ?sales a ex:Sales ; ex:Sales#Customer ?cust ;
ex:Sales#OrderNo ?orderNo ; ex:Sales#SalesAmount ?sales .
?cust a ex:Customer . }
GROUP BY ?cust ?orderNo } }
GROUP BY ?cust
The inner query computes the total sales amount for each customer and
order. Then, in the outer query, for each customer we select the maximum
sales amount among all its orders.
Subqueries are commonly used with the UNION and MINUS keywords.
The UNION combines graph patterns so that one of several alternative graph
14.2 Introduction to SPARQL 543
patterns may match. For example, the query “Products that have been or-
dered by customers from San Francisco or supplied by suppliers from San
Jose” can be written as follows:
SELECT DISTINCT ?prodName
WHERE { {
SELECT ?prod
WHERE { ?sales a ex:Sales ; ex:Sales#Product ?prod ;
ex:Sales#Customer ?cust .
?cust a ex:Customer ; ex:Customer#City ?custCity .
?custCity a ex:City ; ex:City#Name ?custCityName .
FILTER(?custCityName = "San Francisco") } }
UNION {
SELECT ?prod
WHERE { ?sales a ex:Sales ; ex:Sales#Product ?prod ;
ex:Sales#Supplier ?sup .
?sup a ex:Supplier ; ex:Supplier#City ?supCity .
?supCity a ex:City ; ex:City#Name ?supCityName .
FILTER(?supCityName = "San Jose") } } }
The inner query computes the products ordered by customers from San
Francisco. The outer query obtains all products that have been ordered and
subtracts from them the products obtained in the inner query.
Consider the example depicted in Fig. 13.7. In the case of Cypher, the
query returned the empty set, because of its no-repeated-edge semantics. In
SPARQL’s semantics, the result is the correct one (except for duplicated
results).
SELECT ?x4
FROM <SocialNetwork>
WHERE {
?x1 a <http://xmlns.com/foaf/0.1/Person> .
?x2 a <http://xmlns.com/foaf/0.1/Person> .
?x3 a <http://xmlns.com/foaf/0.1/Person> .
?x4 a <http://xmlns.com/foaf/0.1/Person> .
?x1 <http://www.lib.org/schema#friendOf> ?x3 .
?x2 <http://www.lib.org/schema#friendOf> ?x3 .
?x3 <http://www.lib.org/schema#friendOf> ?x4 .
}
The answer to this query will be obtained by performing all the possible
assignments according to SPARQL’s homomorphism-based semantics. For
example, the basic graph pattern in the WHERE clause will be true if we
instantiate variable x1 to the constant John, x2 to Ann, x3 to Klara and x4
to Amber. Then, x4 will be projected as the result in the SELECT clause.
The following table indicates all the matchings that satisfy the homomorphic
conditions in this query.
x1 x2 x3 x4
John Ann Klara Amber
John John Klara Amber
Ann John Klara Amber
Ann Ann Klara Amber
Thus, the query returns four tuples corresponding to the four isomorphic
matches of Amber. Adding the condition FILTER(?x1 < ?x2) we would obtain
just the answer we need.
The IRI in the first line represents the resource (Paris), and it is the subject
of the triples formed with the predicate-object pairs below it (telling, e.g.,
that the country code of Paris is FR).
For representing multidimensional data in RDF we use the QB4OLAP
vocabulary.11 QB4OLAP is an extension of the RDF data cube vocab-
ulary12 or QB. The QB vocabulary is compatible with the cube model un-
derlying the Statistical Data and Metadata eXchange (SDMX)13 standard,
an ISO standard for exchanging and sharing statistical data and metadata
among organizations. QB is also compatible with the Simple Knowledge Or-
ganization System (SKOS)14 vocabulary.
Figure 14.4 depicts the QB4OLAP vocabulary. Capitalized terms represent
RDF classes and noncapitalized terms represent RDF properties. Classes and
properties in QB have a prefix qb. Classes and properties added to QB (with
prefix qb4o) are depicted with light gray background and black font. Classes
in external vocabularies are depicted in light gray font.
A data structure definition or DSD, defined as an instance of the class
qb:DataStructureDefinition, specifies the schema of a data set (i.e., a cube),
the latter defined as an instance of the class qb:DataSet. This structure can
be shared among different data sets. The DSD of a data set is defined by
means of the qb:structure property. The DSD has component properties for
representing dimensions, dimension levels, measures, and attributes, called
qb:dimension, qb4o:level, qb:measure, and qb:attribute, respectively. Compo-
nent specifications are linked to DSDs via the property qb:component.
Observations (i.e., facts), which are instances of qb:Observation, represent
points in a multidimensional data space. They are grouped into data sets by
means of the qb:dataSet property. An observation is linked to a value in each
10
http://www.geonames.org/ontology/
11
http://purl.org/qb4olap/cubes
12
http://www.w3.org/TR/vocab-data-cube/
13
http://sdmx.org/
14
http://www.w3.org/2009/08/skos-reference/skos.html
546 14 Semantic Web Data Warehouses
qb4o:isCuboidOf
qb:ComponentSpecification qb:DataStructureDefinition
qb:component
qb:componentRequired : boolean qb:sliceKey
qb:componentAttachment : rdf:Class qb:structure
qb:order : xsd:int qb:SliceKey qb:DataSet
qb:componentProperty
qb:dimension qb:sliceStructure qb:slice qb:dataSet
qb:attribute
qb:measure qb:Slice qb:Observation
qb4o:level qb:componentProperty
qb4o:cardinality qb:subSlice qb:observation
qb4o:aggregateFunction
qb:AttributeProperty
qb:ComponentProperty qb4o:hasID
qb:concept
skos:Concept qb4o:LevelMember qb4o:LevelAttribute
skos:broader qb4o:memberOf qb4o:inLevel
qb4o:hasAttribute
qb:MeasureProperty
qb4o:LevelProperty
qb4o:aggregateFunction
qb4o:hasLevel
qb4o:AggregateFunction qb:DimensionProperty
qb4o:hasHierarchy
qb4o:OneToOne qb4o:Min qb4o:Sum
qb4o:inDimension
qb4o:OneToMany qb4o:Max qb4o:Avg qb4o:Hierarchy
qb4o:inHierarchy
qb4o:ManyToOne qb4o:Count
qb4o:childLevel
qb4o:ManyToMany qb4o:Cardinality qb4o:HierarchyStep
qb4o:pcCardinality qb4o:parentLevel
In this section, we show how the Northwind data cube in Fig. 4.1 can be
represented in RDF using the QB4OLAP vocabulary.
We start by defining the namespace prefixes as follows:
@prefix qb: <http://purl.org/linked-data/cube#> .
@prefix qb4o: <http://purl.org/qb4olap/cubes#> .
@prefix nw: <http://dwbook.org/cubes/schemas/northwind#> .
@prefix nwi: <http://dwbook.org/cubes/instances/northwind#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdmx-concept: <http://purl.org/linked-data/sdmx/2009/concept#> .
@prefix sdmx-dimension: <http://purl.org/linked-data/sdmx/2009/dimension#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix db: <http://dbpedia.org/resource/> .
For example, the definition of the Product level and its attributes is shown
next.
nw:product a qb4o:LevelProperty ; rdfs:label "Product Level"@en ;
qb4o:hasAttribute nw:productKey ; qb4o:hasAttribute nw:productName ;
qb4o:hasAttribute nw:quantityPerUnit ; qb4o:hasAttribute nw:unitPrice ;
qb4o:hasAttribute nw:discontinued .
nw:productKey a qb4o:LevelAttribute ; rdfs:label "Product Key"@en .
nw:productName a qb4o:LevelAttribute ; rdfs:label "Product Name"@en .
nw:quantityPerUnit a qb4o:LevelAttribute ; rdfs:label "Quantity per Unit"@en .
nw:unitPrice a qb4o:LevelAttribute ; rdfs:label "Unit Price"@en .
nw:discontinued a qb4o:LevelAttribute ; rdfs:label "Discontinued"@en .
Given the schema of the Northwind cube in Fig. 4.1 expressed in QB4OLAP,
we revisit the queries of Sect. 4.4 in SPARQL.
Query 14.1. Total sales amount per customer, year, and product category.
SELECT ?custName ?catName ?yearNo (SUM(?sales) AS ?totalSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Customer ?cust ;
nw:OrderDate ?odate ; nw:Product ?prod ; nw:SalesAmount ?sales .
?cust nw:companyName ?custName . ?odate skos:broader ?month .
?month skos:broader ?quarter . ?quarter skos:broader ?sem .
?sem skos:broader ?year . ?year nw:year ?yearNo .
?prod skos:broader ?cat . ?cat nw:categoryName ?catName . }
GROUP BY ?custName ?catName ?yearNo
ORDER BY ?custName ?catName ?yearNo
In this query, we select the customer, order date, product, and sales amount
of all sales, roll-up the date to the year level, roll-up the product to the
category level, and aggregate the sales amount measure.
Query 14.2. Yearly sales amount for each pair of customer country and sup-
plier countries.
SELECT ?custCountryName ?supCountryName ?yearNo (SUM(?sales) AS ?totalSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Customer ?cust ; nw:Supplier ?sup ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales .
?cust skos:broader ?custCity . ?custCity skos:broader ?custState .
{ ?custState skos:broader ?custRegion .
?custRegion skos:broader ?custCountry . }
UNION { ?custState skos:broader ?custCountry . }
?custCountry nw:countryName ?custCountryName .
550 14 Semantic Web Data Warehouses
The above query performs a roll-up of the customer and supplier dimen-
sions to the country level, a roll-up of the order date to the year level, and
then aggregates the measure sales amount. Since a state rolls up to either a
region or a country, the patterns between curly brackets before and after the
UNION operator to take into account both alternative aggregation paths.
Query 14.3. Monthly sales by customer state compared to those of the pre-
vious year.
SELECT ?stateName ?yearNo ?monthNo ?totalSales ?salesPrevYear
WHERE {
# Monthly sales by state
{ SELECT ?stateName ?yearNo ?monthNo (SUM(?sales) AS ?totalSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Customer ?cust ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales .
?cust skos:broader ?city . ?city skos:broader ?state .
?state nw:stateName ?stateName . ?odate skos:broader ?month .
?month nw:monthNumber ?monthNo ; skos:broader ?quarter .
?quarter skos:broader ?sem . ?sem skos:broader ?year .
?year nw:year ?yearNo . }
GROUP BY ?stateName ?yearNo ?monthNo }
# Monthly sales by state for the previous year
OPTIONAL {
{ SELECT ?stateName ?yearNo1 ?monthNo
(SUM(?sales1) AS ?salesPrevYear)
WHERE { ?o1 qb:dataSet nwi:dataset1 ; nw:Customer ?cust1 ;
nw:OrderDate ?odate1 ; nw:SalesAmount ?sales1 .
?cust1 skos:broader ?city1 . ?city1 skos:broader ?state .
?state nw:stateName ?stateName . ?odate1 skos:broader ?month1 .
?month1 nw:monthNumber ?monthNo ; skos:broader ?quarter1 .
?quarter1 skos:broader ?sem1 . ?sem1 skos:broader ?year1 .
?year1 nw:year ?yearNo1 . }
GROUP BY ?stateName ?yearNo1 ?monthNo }
FILTER ( ?yearNo = ?yearNo1 + 1) } }
ORDER BY ?stateName ?yearNo ?monthNo
The first inner query computes the monthly sales by state by rolling-up
the customer dimension to the state level and the order date dimension to
the month level. Then, after the OPTIONAL keyword, the second inner query
computes again the monthly sales by state. The FILTER condition makes the
join of the two inner queries relating the sales amount of a month and that
of the corresponding month of the previous year.
14.5 Querying the Northwind Cube in SPARQL 551
Query 14.4. Monthly sales growth per product, that is, total sales per product
compared to those of the previous month.
SELECT ?prodName ?yearNo ?monthNo ?totalSales ?prevMonthSales
(?totalSales - ?prevMonthSales AS ?salesGrowth)
WHERE {
# Monthly sales by product
{ SELECT ?prodName ?yearNo ?monthNo (SUM(?sales) AS ?totalSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Product ?prod ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales .
?prod nw:productName ?prodName . ?odate skos:broader ?month .
?month nw:monthNumber ?monthNo ; skos:broader ?quarter .
?quarter skos:broader ?sem . ?sem skos:broader ?year .
?year nw:year ?yearNo . }
GROUP BY ?prodName ?yearNo ?monthNo }
# Monthly sales by product for the previous month
OPTIONAL {
{ SELECT ?prodName ?yearNo1 ?monthNo1
(SUM(?sales1) AS ?prevMonthSales)
WHERE { ?o1 qb:dataSet nwi:dataset1 ; nw:Product ?prod ;
nw:OrderDate ?odate1 ; nw:SalesAmount ?sales1 .
?prod nw:productName ?prodName . ?odate1 skos:broader ?month1 .
?month1 nw:monthNumber ?monthNo1 ; skos:broader ?quarter1 .
?quarter1 skos:broader ?sem1 . ?sem1 skos:broader ?year1 .
?year1 nw:year ?yearNo1 . }
GROUP BY ?prodName ?yearNo1 ?monthNo1 }
FILTER( ( (?monthNo = ?monthNo1 + 1) && (?yearNo = ?yearNo1) ) ||
( (?monthNo = 1) && (?monthNo1 = 12) &&
(?yearNo = ?yearNo1+1) ) ) } }
ORDER BY ?prodName ?yearNo ?monthNo
The first inner query computes the monthly sales by product. Then, after
the OPTIONAL keyword, the second inner query computes again the monthly
sales by product. The FILTER condition makes the join of the two inner
queries relating the sales amount of a month and that of the previous month.
The condition must take into account whether the previous month is in the
same year or in the previous year.
This query computes the total sales by employee, sorts them in descending
order of total sales, and keeps the first three results.
552 14 Semantic Web Data Warehouses
The first inner query computes the maximum employee sales by product
and year. Then, the second inner query computes the sales per product, year,
and employee. The FILTER condition makes the join of the two inner queries
relating the maximum sales with the employee that realized those sales.
Query 14.7. Countries that account for top 50% of the sales amount.
For simplicity, in this query we compute the top 50% of the sales amount
by state, instead of by country. In this case, we must not take care of the fact
that states roll-up to either regions or countries. This can be taken care by
using a UNION operator as was we did in Query 14.2.
SELECT ?stateName ?totalSales ?cumSales
WHERE { ?state nw:stateName ?stateName .
# Total sales and cumulative sales by state
{ SELECT ?state ?totalSales (SUM(?totalSales1) AS ?cumSales)
WHERE {
# Total sales by state
{ SELECT ?state (SUM(?sales) AS ?totalSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Customer ?cust ;
nw:SalesAmount ?sales . ?cust skos:broader ?city .
?city skos:broader ?state . }
GROUP BY ?state }
# Total sales by state
14.5 Querying the Northwind Cube in SPARQL 553
The first inner query computes for each country the total sales and the
cumulative sales of all countries having total sales greater than or equal to
the total sales of the country. The second inner query computes the threshold
value, which represents the minimum cumulative sales greater than or equal
to the 50% of the overall sales. Finally, the FILTER selects all countries whose
cumulative sales are less than or equal to the threshold value.
Query 14.8. Total sales and average monthly sales by employee and year.
SELECT ?fName ?lName ?yearNo (SUM(?monthlySales) AS ?totalSales)
(AVG(?monthlySales) AS ?avgMonthlySales)
WHERE {
# Monthly sales by employee
{ SELECT ?fName ?lName ?month (SUM(?sales) AS ?monthlySales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Employee ?emp ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales .
?emp nw:firstName ?fName ; nw:lastName ?lName .
?odate skos:broader ?month . }
GROUP BY ?fName ?lName ?month }
?month skos:broader ?quarter . ?quarter skos:broader ?sem .
?sem skos:broader ?year . ?year nw:year ?yearNo . }
GROUP BY ?fName ?lName ?yearNo
ORDER BY ?fName ?lName ?yearNo
554 14 Semantic Web Data Warehouses
In the query above, the inner query computes the total sales amount by
employee and month. The outer query rolls-up the previous result to the year
level while computing the total yearly sales and the average monthly sales.
Query 14.9. Total sales amount and total discount amount per product and
month.
SELECT ?prodName ?yearNo ?monthNo (SUM(?sales) AS ?totalSales)
(SUM(?unitPrice * ?qty * ?disc) AS ?totalDiscAmount)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Product ?prod ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales ;
nw:Quantity ?qty ; nw:Discount ?disc ; nw:UnitPrice ?unitPrice .
?prod nw:productName ?prodName . ?odate skos:broader ?month .
?month nw:monthNumber ?monthNo ; skos:broader ?quarter .
?quarter skos:broader ?sem . ?sem skos:broader ?year . ?year nw:year ?yearNo . }
GROUP BY ?prodName ?yearNo ?monthNo
ORDER BY ?prodName ?yearNo ?monthNo
Here, we roll-up to the month level and compute the requested measures.
This query starts by selecting the category, month, and year levels. Then,
for each category, month, and year, it selects all facts whose order date is in
the same year but whose month is less than or equal to the current month.
Query 14.11. Moving average over the last 3 months of the sales amount by
product category.
SELECT ?catName ?yearNo ?monthNo (AVG(?totalSales1) AS ?MovAvg3MSales)
WHERE { ?cat nw:categoryName ?catName .
?month nw:monthNumber ?monthNo ; skos:broader ?quarter .
?quarter skos:broader ?sem . ?sem skos:broader ?year . ?year nw:year ?yearNo.
OPTIONAL {
{ SELECT ?catName ?yearNo1 ?monthNo1 (SUM(?sales1) AS ?totalSales1)
WHERE { ?o1 qb:dataSet nwi:dataset1 ; nw:Product ?prod1 ;
14.5 Querying the Northwind Cube in SPARQL 555
This query starts by selecting the category, month, and year levels. Then,
for each category, month, and year, the query selects all facts whose order
date is within a three-month window from the current month. This selec-
tion involves an elaborated condition in the FILTER clause, which covers
three cases, depending on whether the month is March or later, the month
is February, or the month is January.
Query 14.12. Personal sales amount made by an employee compared with
the total sales amount made by herself and her subordinates during 2017.
SELECT ?fName ?lName ?persSales ?subordSales
WHERE { ?emp nw:firstName ?fName ; nw:lastName ?lName .
{ SELECT ?emp (SUM(?sales) AS ?persSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Employee ?emp ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales .
?odate skos:broader ?month . ?month skos:broader ?quarter .
?quarter skos:broader ?sem . ?sem skos:broader ?year .
?year nw:year ?yearNo .
FILTER(?yearNo = 2017) }
GROUP BY ?emp }
{ SELECT ?emp (SUM(?sales1) AS ?subordSales)
WHERE { ?subord nw:supervisor* ?emp .
?o1 qb:dataSet nwi:dataset1 ; nw:Employee ?subord ;
nw:OrderDate ?odate1 ; nw:SalesAmount ?sales .
?odate1 skos:broader ?month1 . ?month1 skos:broader ?quarter1 .
?quarter1 skos:broader ?sem1 . ?sem1 skos:broader ?year1 .
?year1 nw:year ?yearNo1 .
FILTER(?yearNo1 = 2017) }
GROUP BY ?emp } }
ORDER BY ?emp
The first inner query computes by employee the personal sales in 2017.
The second inner query exploits the recursive hierarchy Supervision with a
property path expression in SPARQL. The ‘*’ character states that the tran-
sitive closure of the supervision hierarchy must be taken into account for
obtaining all subordinates of an employee. Then, the sales in 2017 of all
these subordinates are aggregated.
556 14 Semantic Web Data Warehouses
Query 14.13. Total sales amount, number of products, and sum of the quan-
tities sold for each order.
SELECT ?orderNo (SUM(?sales) AS ?totalSales)
(COUNT(?prod) AS ?nbProducts) (SUM(?qty) AS ?nbUnits)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Order ?order ;
nw:Product ?prod ; nw:SalesAmount ?sales ; nw:Quantity ?qty .
?order nw:orderNo ?orderNo . }
GROUP BY ?orderNo
ORDER BY ?orderNo
Here, we group sales by order number and compute the requested mea-
sures.
Query 14.14. For each month, total number of orders, total sales amount,
and average sales amount by order.
SELECT ?yearNo ?monthNo (COUNT(?orderNo) AS ?nbOrders)
(SUM(?totalSales) AS ?totalSalesMonth)
(AVG(?totalSales) AS ?avgSalesOrder)
WHERE {
{ SELECT ?orderNo ?odate (SUM(?sales) AS ?totalSales)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Order ?order ;
nw:OrderDate ?odate ; nw:SalesAmount ?sales .
?order nw:orderNo ?orderNo . }
GROUP BY ?orderNo ?odate }
?odate skos:broader ?month .
?month nw:monthNumber ?monthNo ; skos:broader ?quarter .
?quarter skos:broader ?sem . ?sem skos:broader ?year . ?year nw:year ?yearNo . }
GROUP BY ?yearNo ?monthNo
ORDER BY ?yearNo ?monthNo
Here, the inner query computes the total sales by order. The outer query
then rolls-up the previous result to the month level and computes the re-
quested measures.
Query 14.15. For each employee, total sales amount, number of cities, and
number of states to which she is assigned.
SELECT ?fName ?lName (SUM(?sales)/COUNT(DISTINCT ?city) AS ?totalSales)
(COUNT(DISTINCT ?city) AS ?noCities)
(COUNT(DISTINCT ?state) AS ?noStates)
WHERE { ?o qb:dataSet nwi:dataset1 ; nw:Employee ?emp ; nw:SalesAmount ?sales .
?emp nw:firstName ?fName ; nw:lastName ?lName ; skos:broader ?city .
?city skos:broader ?state . }
GROUP BY ?fName ?lName
ORDER BY ?fName ?lName
14.6 Summary
There are many books explaining the basics of the semantic web, for exam-
ple, [102, 106]. The book [101] provides an introduction to Linked Data, a
paradigm in which data is seen as a first-class citizen of the Web, thereby
enabling the extension of the Web with a global data space – the Web of
Data. The book [207] provides a recent view of methods, technologies, and
systems related to Linked Data. A book entirely devoted to SPARQL is [63].
A recent survey on the topic is [104].
A review on the application of semantic web techologies for data warehous-
ing is [78]. Sect. 14.3 of this chapter is based on research work by Etcheverry
and Vaisman on QB4OLAP [72, 73] and on [246]. Kämpgen and Harth [125]
propose to load statistical linked data into an RDF triple store and to an-
swer OLAP queries using SPARQL. For this, they implement an OLAP
to SPARQL engine which translates OLAP queries into SPARQL. Further,
Breucker et al. [39] proposed SEO4OLAP, a tool for generating search engine
optimized web pages for every possible view of a statistical linked dataset
modeled in the RDF Data Cube Vocabulary. This approach to querying sta-
tistical linked open data is comprehensively studied in [124]. QB4OLAP has
also been used to enrich statistical data represented in the QB vocabulary
with dimensional data [252]. Matei et al. [148] propose a framework for inte-
grating OLAP and semantic web data, also based in QB4OLAP. A proposal
to perform OLAP analysis over RDF graphs is presented in [19]. Knowledge
graphs are large networks of entities, their semantic types, properties, and
558 14 Semantic Web Data Warehouses
14.1 What are the two main approaches to perform OLAP analysis with
semantic web data?
14.2 Briefly describe RDF and RDFS and their main constructs.
14.9 Exercises 559
14.9 Exercises
Exercise 14.1. Given the Northwind data cube shown in Fig. 4.2, give the
QB4OLAP representation of the Sales fact. Provide at least two observations.
Exercise 14.2. Given the Northwind data cube shown in Fig. 4.2, give the
QB4OLAP representation of the Customer dimension.
Exercise 14.3. Do the same as Ex. 14.2 for the Employee dimension.
Exercise 14.4. Write the R2RML mapping that represents the Northwind
data warehouse using the QB4OLAP vocabulary.
Exercise 14.7. Represent the Foodmart cube given in Fig. 4.24 using the
QB4OLAP vocabulary.
Exercise 14.8. Consider the Foodmart table instances given in Fig. 14.5.
Represent sales facts as observations, adhering to the Data Structure Defini-
tion specified in the previous exercise.
Exercise 14.9. Write in SPARQL the queries over the Foodmart cube given
in Ex. 4.9.
560 14 Semantic Web Data Warehouses
Sales
StoreID DateID ProductID PromotionID CustomerID StoreSales StoreCost UnitSales
1 738 219 0 567 7.16 2.4344 3
1 738 684 0 567 12.88 5.0232 2
2 739 551 7 639 5.20 2.236 4
Product
ProductID ProductName BrandName ProductClassID
219 Best Choice Corn Chips Best Choice 12
551 Fast Low Fat Chips Fast 12
684 Gorilla Blueberry Yogurt Gorilla 6
ProductClass
ProductClassID ProductSubcategory ProductCategory ProductDepartment ProductFamily
6 Yogurt Dairy Dairy Food
12 Chips Snack Foods Snack Foods Food
Date
DateID Date WeekdayName Month Year Day Week Month Quarter
738 2018-01-07 Wednesday January 2018 7 4 1 Q1
739 2018-01-08 Thursday January 2018 8 4 1 Q1
Customer
CustID FName MI LName State Country Marital YearlyIncome Gender Education
City
Status
567 Charles L. Mank Santa Fe DF Mexico S $50K-$70K F Bachelors
639 Michael J. Troyer Kirkland WA USA M $30K-$50K M High School
Promotion
PromID PromName MediaType
0 No Promotion No Media
7 Fantastic Discounts Sunday paper, Radio, TV
Store
StoreID StoreType StoreName StoreCity StoreState StoreCountry StoreSqft
1 Supermarket Store 1 Acapulco Guerrero Mexico 23593
2 Small Grocery Store 2 Bellingham WA USA 28206
Big data is usually defined as data that is so big, that may arrive at a pace
or with such a variety that makes it difficult to manage them with methods
and tools. Social networks, the Internet of Things (IoT), smart city devices,
and mobility data, among others, produce enormous volumes of data that
must be analyzed for decision making. Management and analysis of these
massive amounts of data demand new solutions that go beyond the traditional
processes or tools. All of these have great implications for data warehousing
architectures and practices. For instance, big data analytics typically requires
dramatically reduced data latency, which is the time elapsed between the
moment some data are collected and the action based on such data is taken.
This chapter discusses the main technologies that address the challenges
introduced by big data. We start by motivating the problem in Sect. 15.1.
Distributed processing is studied in Sect. 15.2, where we address Hadoop, the
Hadoop Distributed File System (HDFS), and Spark. The section concludes
by discussing Apache Kylin, a big data warehouse system running on Hadoop
or Spark, built using the concepts covered in this book. We then study modern
database technologies designed for storing and querying large data volumes.
We first introduce in Sect. 15.3 distributed database systems. We then study
new approaches, as in-memory database systems (Sect. 15.4), and column-
store database systems (Sect. 15.5), NoSQL database systems (Sect. 15.6),
NewSQL database systems (Sect. 15.7), array database systems (Sect. 15.8),
the new Hybrid Transactional and Analytical Processing (HTAP) paradigm
(Sect. 15.9), which allows OLTP and OLAP processing on the same system,
and polystores (Sect. 15.10), which provide integrated access to multiple cloud
data stores. Cloud data warehouses are discussed in Sect. 15.11. We continue
in Sect. 15.12 by discussing data lakes and Delta Lake. The former is an
approach aimed at loading data right after the ingestion phase, while the
latter consists of a layer on top of the data lake. We conclude the chapter
in Sect. 15.13 by describing future perspectives on data warehousing in the
context of big data.
The amount of data currently available from a wide variety of sources increas-
ingly poses challenges to scientific and commercial applications. Datasets are
continuously growing beyond the limits that traditional database and com-
puting technologies can handle. Applications over such amounts of data are
called big data applications or data-centric applications, and require appro-
priate solutions to be developed. These applications share some typical char-
acteristics: the amount of data, the complexity of data, and the speed at
which data arrive to the system.
Big data is normally defined using four dimensions (informally charac-
terized by the well-known “four Vs”), namely:
• Volume: Big datasets are in the range of petabytes or zettabytes.
• Variety: Relational DBMSs are designed to work on structured data,
compliant with a schema. In big data applications, data may come in any
format, not necessarily structured, such as images, text, audio, video, etc.
• Velocity: Big data applications usually deal with real-time data process-
ing, not only with batch processing, since data may come at a speed that
systems cannot store before processing.
• Veracity: The reliability of big data sources usually varies. For example,
data coming from social networks is not always reliable, while sensor data
is usually more reliable.
From the above, it follows that a single tool or DBMS cannot meet all the
data processing and storage requirements of big data applications. Big data
systems are usually decomposed into tasks that can be performed efficiently
by specific software tools. Those tasks are orchestrated and integrated by ad
hoc applications. In this way, there are tools specific for capturing data in
real time (e.g., Kafka), indexing data (e.g., ElasticSearch), storing data (e.g.,
Cassandra), and analyzing data (e.g., R, Python).
Furthermore, the continuously increasing data volumes to be processed
require the computing infrastructure to scale. This can be done in two ways.
Vertical scaling or scaling up consists in adding more resources to an exist-
ing system. This includes adding more hardware resources such as processing
power, memory, storage, or network, but also adding more software resources
such as more threads, more connections, or increasing cache sizes. On the
other hand, horizontal scaling or scaling out consists in adding additional
machines to a distributed computing infrastructure, such as a cluster or a
cloud. This enables data partitioning across multiple nodes and the use of
distributed processing algorithms where the core of the computation is per-
formed locally at each node. In this way, applications can scale out by adding
nodes as the dataset size grows. Horizontal scaling also enables elasticity,
which is defined as the degree to which a system is able to automatically adapt
to workload changes by provisioning and de-provisioning resources, such that
15.2 Distributed Processing Frameworks 563
at each point in time the available resources match the current demand as
closely as possible.
For decades, organizations have invested in implementing enterprise data
warehouses to support their BI systems. Modern data warehousing aims at
delivering self-service BI, which empowers users at all organizational levels
to produce their own analytical reports. However, big data greatly impacts
on the design of data warehouses, which not only must ingest and store large
data volumes coming at high speed, but also deliver query performance under
these conditions and support new kinds of applications. For example, data
warehouses provide consistent datasets for artificial intelligence and machine
learning applications and, conversely, artificial intelligence and machine learn-
ing can enhance the capabilities of data warehouses. As an example, Google
has incorporated machine learning into its BigQuery data warehouse. Also,
new kinds of data warehouses are emerging, such as cloud data warehouses,
which we discuss in Sect. 15.11. Indeed, organizations are increasingly mov-
ing their data warehouses to the cloud to take advantage of its flexibility,
although many questions remain open in this respect.
In summary, to cope with the new requirements discussed above, new data
warehousing approaches must be developed. Data warehouses must evolve to
deliver the scalability, agility, and flexibility that data-driven applications
need. Agility and flexibility are provided by new database technologies, sup-
porting unstructured, schemaless data, and new ways of ingesting data, reduc-
ing the time required to make data available to users. Scalability is achieved
through both distributed computing frameworks and distributed data stor-
age. Scalability requirements are due to increasing data volumes. However,
in cloud applications elastic scalability is needed, so the system is able to
scale up and down dynamically as the computation requirements change. We
study these new technologies in this chapter.
tion in applications that do not require the overhead of a DBMS, which are
typical in big data scenarios. Finally, MapReduce is a very efficient model
for applications that require processing the data only once. However, when
programs require processing the data iteratively, the MapReduce approach
becomes inefficient. This is true of machine learning code or complex SQL
queries. Spark, studied in Sect. 15.2.3 is typically used in those cases.
Figure 15.1 shows an example of how MapReduce works. Consider that
orders in the Northwind database come from many sources, each from one
country. We are interested in analyzing product sales. The files in this example
contain pairs of the form (ProductKey, Quantity). In the Map phase, the
input is divided into a list of key-value pairs with ProductKey as the key and
Quantity as the value. This list is sent to the so-called Shuffle phase in which
it is sorted such that values with the same ProductKey are put together.
The output of the Shuffle phase is a collection of tuples of the form (key,
list-of-values). This is forwarded to the Reduce phase where many different
operations such as average, sum, or count can be performed. Since the key-
list pairs are independent of each other, they can be forwarded to multiple
workers for parallel execution.
Reduce with
Map Shuffle function MAX
15.2.1 Hadoop
metadata
NameNode
Metadata operations Client
Block operations
Client
Write Write
Read
DataNodes DataNodes
replication
RACK 1 RACK 2
mits the job (e.g., a .jar file) and the configuration to the ResourceManager,
who distributes them to the workers, scheduling and monitoring the tasks,
and returning status and diagnostic information to the job client. Note that
although the Hadoop framework is implemented in Java, MapReduce appli-
cations can be written also in other programming languages.
The key idea of YARN is to split the functionality of resource man-
agement and job scheduling and monitoring into separate programs. To do
this, there is a global ResourceManager and one ApplicationMaster per ap-
plication. An application is either a single job or a Directed Acyclic Graph
(DAG) of jobs. There is one NodeManager at each node in the cluster, which
works together with the ResourceManager. The ResourceManager assigns
resources to all the applications in the system. The NodeManager monitors
the usage of computing resources such as CPU, network, etc., and reports
to the ResourceManager. The ApplicationMaster negotiates resources from
the ResourceManager, and works together with the NodeManager(s) to run
and monitor tasks. Figure 15.3 shows the scheme explained above. Clients
submit jobs to the ResourceManager, which has two components: the Sched-
uler and the ApplicationsManager. The former allocates resources to the
(in general, many) running applications. The latter accepts job submissions
and negotiates the first container (memory resource) for executing the Ap-
plicationMaster, also taking care of the ApplicationMaster container in case
of failure. The ApplicationMaster negotiates resource containers with the
Scheduler, and tracks their status. Each NodeManager reports its status to
the ResourceManager. An application, for example, MapReduce, is usually
spanned into many containers (holding one MR task each), coordinated by
(in this example) the MapReduce ApplicationMaster.
The interaction between YARN and Hadoop is as follows: The NameNode
and the ResourceManager live on two different hosts, since they store key
metadata. The DataNode and the NodeManager processes are placed on
15.2 Distributed Processing Frameworks 567
App. App.
Container Container Container Container
Master Master
the same host (see Fig. 15.3). Since a file is split and saved to the HDFS
DataNodes, to access it, a YARN application (such as MapReduce) is written
using a YARN client, which reads data using an HDFS client. The application
looks for the file location in the NameNode, and asks the ResourceManager to
provide containers on the hosts that hold the file blocks. Since the distributed
job gets a container on a host which keeps the DataNode, the read operation
will be local and not over the network, which enhances reading performance.
In a big data environment like Hadoop, tasks are performed by specialized
modules, which compose the Hadoop ecosystem. Key components of this
ecosystem are Hive, a high-level query language for querying HDFS (stud-
ied in Sect. 15.2.2), and HBase, a column-oriented, non-relational database
(studied in Sect. 15.6). Other key components are:
• Sqoop: Standing for “SQL to Hadoop”, is a data ingestion tool like Flume,
but, while Flume works on unstructured or semi-structured data, Sqoop
is used to export and import data from and to relational databases.
• Zookeeper: A service that coordinates distributed applications. It has
information about the cluster of the distributed servers it manages.
• Kafka: A distributed messaging system, used with Hadoop for fast data
transfers in real time.
15.2.2 Hive
Using Hadoop is not easy for end users not familiar with MapReduce. They
need to write MapReduce code even for simple tasks suh as counting or
averaging. High-level query languages provide a solution to this problem; they
enable a higher level of abstraction than Java or other lower-level languages
supported by Hadoop. The most commonly used is Hive. Hive queries are
568 15 Recent Developments in Big Data Warehouses
translated into MapReduce jobs, resulting in programs that are much smaller
than the equivalent Java ones. Besides, Hive can be extended, for example,
by writing user-defined functions in Java. This can work the other way round:
Programs written in Hive can be embedded in other languages as well.
Hive brings the concepts of tables, columns, partitions, and SQL into the
Hadoop architecture. Hive organizes data in tables and partitions, which are
then stored into directories in the HDFS. A popular format for storing data
in Hive is ORC (standing for Optimized Row Columnar) format, which is a
columnar format for storing data. Another widely used format is Parquet,
which is also columnar and includes metadata information. Parquet is sup-
ported by many data processing systems.
In addition, Hive provides an SQL dialect called Hive Query Language
(HiveQL) for querying data stored in a Hadoop cluster. Hive works as follows.
Clients communicate with a Hive server. The request is handled by a driver,
which calls the compiler and the execution engine, which sends the job to
the MapReduce layer and receives the results, which are finally sent back to
the driver. The job is executed at the processing layer, where it is handled
by YARN, as explained above. A relational database, such as MySQL or
PostgreSQL, holds Hive metadata, which is used to translate the queries
written in HiveQL. The actual data are stored in HDFS, in the storage layer.
The Hive data model includes primitive data types such as Boolean and
int, and collection data types such as Struct, Map, and Array. Collection data
types allow many-to-many relationships to be represented, avoiding foreign
key relationships between tables. On the other hand, they introduce data
duplication and do not enforce referential integrity. We next show a simplified
representation of the table Employees from the Northwind database, where
the attributes composing a full address are stored as a Struct data type, and
the Territories attribute is an Array that contains the set of territory names to
which the employee is related. Hive has no control over how data are stored,
and supports different file and record formats. The table schema is applied
when the data are read from storage. The example below includes the file
format definition (Textfile in this case) and the delimiter characters needed
to parse each record:
CREATE TABLE Employees (
EmployeeID int, Name String,
Address Struct<Street:String, City:String,
Region:String, PostalCode:String, Country:String>,
Territories Array<String> )
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS Textfile;
15.2.3 Spark
information (typically, how it was derived from other RDDs) to rebuild it.
We explain RDDs and provide an example later in this section.
Every Spark application needs an entry point for communicating with data
sources and performing certain operations such as reading and writing data.
In Spark version 1, three entry points existed: SparkContext, SQLContext,
and HiveContext. In Spark version 2, a new entry point called SparkSession
has been introduced, which essentially combines all functionalities available
in the above. Spark uses a master-slave architecture. The SparkContext is
used by the driver process of the Spark application running on a master
node. It communicates with a coordinator, called the cluster manager, which
manages workers running on worker nodes (the slave nodes), where executors
run a set of tasks. Thus, the SparkContext splits a job into tasks performed
by the workers. The application code, written by the user, interacts with the
driver, creating RDDs and performing a series of transformations leading to
the result. These RDD transformations are then translated into a DAG. This
logical graph is converted into a physical execution plan containing many
stages, each stage containing physical execution units, namely, the tasks.
This is then submitted to the scheduler for execution on the worker nodes.
The cluster manager then negotiates the resources. Figure 15.4 illustrates
these processes.
Spark application
Spark Context
Task Task
RDD4
• Cluster mode: the client submits the driver and finishes immediately. The
master node chooses one worker node and this node launches the driver.
Thus, all the processes run inside the cluster.
In practice, a Spark application in client mode is submitted as follows,
assuming that node1 is the coordinator:
node1:$ spark-submit --master yarn --deploy-mode client
--class example.SparkAppMain $HOME/myCode.jar
Here, SparkAppMain is the name of the main program, myCode.jar is the file
to execute, and example is the name of the software package.
As mentioned above, Spark applications are built upon the notion of
RDDs. Spark provides an API for caching, partitioning, transforming, and
materializing data. An RDD can be created either from an external storage
or from another RDD. The RDD stores information about its parents to opti-
mize execution (via pipelining of operations), and to recompute the partition
in case of failure. RDDs provide an interface based on coarse-grained transfor-
mations that apply the same operation to many data items. The operations
on RDDs can be:
• Transformations: Apply a user function to every element in a partition or
to the whole partition, apply an aggregate function to the whole dataset
(through GROUP BY and SORT BY operations), and introduce depen-
dencies between RDDs to form the DAG. All transformations receive an
RDD and produce another RDD. Some operations are only available on
RDDs of key-value pairs. This is true of the join operation. Other opera-
tions include map, a one-to-one mapping; flatMap, which maps each input
value to one or more outputs (similarly to the map in MapReduce); filter,
which keeps the elements in an RDD satisfying a condition; and union.
• Actions: Launch a computation to return a value to the program or write
data to external storage. Actions receive an RDD and return values. For
example, count() receives an RDD and returns the count of its elements,
collect() returns a sequence. Other operations are reduce, lookup, and save.
The latter receives an RDD and stores it in HDFS or another storage
system.
• Other operations: Provide RDD persistence, by storing RDDs on disk.
A Spark application transforms data into RDDs. To obtain RDDs, two
strategies can be followed: parallelizing an existing Java collection in the
driver, or reading a file via a SparkContext. That means that we can read
data from different sources, but in order to process data in a distributed
fashion, we always need to transform data into RDDs, and here is where
Spark offers specific methods. We next give an example of a Java program
running a Spark application, called SparkAppMain. The application reads a
CSV file containing tweets, such that each line contains the tweet identifier,
the sentiment, the author, and the content. We want to count the number of
15.2 Distributed Processing Frameworks 573
First, the Spark application creates a SparkContext handle, and from there
on, the methods can be invoked. This handle is used to load the text file into
RDDs, using the method textFile(args[0]). The RDD variable is called original-
RDD. Then, another transformation operation takes place: filter, creating a
new RDD (recall that RDDs are immutable structures) called matchingRDD.
Then, an action is invoked to compute the lines satisfying the condition (us-
ing an anonymous function). Finally, the number of elements of the RDDs
is displayed. Note that each RDD computes its result, and the Driver ag-
gregates the partial results. To execute the application in YARN mode, we
would write
node1:$ spark-submit --master yarn --deploy-mode cluster
--class example.SparkAppMain $HOME/myCode.jar file:///$HOME/text_emotion.csv
Spark SQL
Spark SQL is a Spark module for structured data processing. For this,
Spark SQL uses the concepts of DataFrame and Dataset. Like an RDD,
a DataFrame is an immutable distributed collection of data. However, un-
like an RDD, a DataFrame is organized into named columns, like a table
in a relational database. DataFrames allow developers to impose a structure
574 15 Recent Developments in Big Data Warehouses
For writing Spark SQL applications, the typical entry point into all SQL
functionality in Spark is the SparkSession class. For example, a SparkSession
handle can be obtained in a Java program as follows:
import org.apache.spark.sql.SparkSession;
val spark = SparkSession
.builder()
.appName("a name for the application")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Associated with the DataFrame object df, there are a series of methods that
allow it to be manipulated, for example select, filter, groupBy, and many more.
In other words, a Spark SQL program can be written over this DataFrame.
Spark SQL can also be run programatically, that is, SQL statements can
be passed as arguments to the API. For this, views must be defined. Tem-
porary views in Spark SQL are views that are valid within a session. If a
view must be shared across several sessions, a global temporary view must
be created, associated with a system database global_temp. A view can be
used as follows.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
// Registers the Dataset as an SQL temporary view
df.createOrReplaceTempView("Employee");
Dataset<Row> sqlDF = spark.sql("SELECT * FROM Employee");
sqlDF.show();
Two or more DataFrames can be joined. For this, different views must be
defined, e.g., with a statement like
df.createOrReplaceTempView("Department");
Datasets can be created in Java and Scala. They use a specialized en-
coder to serialize the objects for processing or transmitting over the network.
RDDs can be converted into Datasets using two methods. One of them in-
fers the schema, and the other one uses a programmatic interface that allows
construction of a schema and its application to an existing RDD.
Parquet files are another popular data source allowed by Spark. Spark
SQL provides support for both reading and writing Parquet files that auto-
matically capture the schema of the original data. Columnar storage is effi-
cient for summarizing data, and follows type-specific encoding. RDD data
576 15 Recent Developments in Big Data Warehouses
can be easily converted into a Parquet file with a simple statement like
DataFrame.write.parquet(”myPath”). Analogously, Parquet files are read into
Spark SQL using spark.read.parquet(”myPath”).
Spark SQL allows several optimizations such as caching, join hints, config-
uration of partition sizes, configuration of partitions when shuffling, among
others. Diving into this is outside the scope of this book.
The discussion about Hadoop versus Spark has been going on since Spark
first appeared. The two systems are appropriate for different scenarios, and
can even work together, as we explain next.
In Sect. 15.2.1, we have seen that Hadoop is written in Java, and Map-
Reduce programs can be written on Hadoop using many different program-
ming languages. There is a software ecosystem for performing specialized
tasks on Hadoop, including data capturing, transforming, analyzing, and
querying. Further, Hadoop is also used for cloud storage, as in the case of
Amazon S3 buckets or Azure blobs. Unlike Hadoop, which relies on perma-
nent disk storage, Spark processes data in RAM in a distributed way, using
RDDs, DataFrames, or Datasets, and can obtain data from HDFS, S3, or
any RDBMs. Further, Spark comes equipped with APIs for executing SQL
queries and running machine learning or graph data algorithms, using differ-
ent programming languages.
Given the characteristics above, reports indicating that Spark runs up
to two orders of magnitude faster in memory, and one order of magnitude
on disk are not surprising. Obviously, Spark is faster on machine learning
applications since MapReduce is not efficient in these cases. Anyway, there
are situations where Hadoop may be more efficient than Spark, for example
when Spark runs on YARN with other shared services, because of the RAM
overhead. For this reason, Hadoop has been found to be more efficient than
Spark for batch processing. Also, Hadoop is highly fault-tolerant because
of the data replication across many nodes. In Spark, since RDDs reside in
memory, data may become corrupted if there is a node or network failure,
although in general they can be rebuilt.
Regarding query languages, Hive QL is used to query data stored on HDFS,
to avoid writing MapReduce code. Spark SQL, on the other hand, can be used
to query data within Spark, although these data could be obtained from
different sources, and it is mainly used to query real-time data. Hive QL is
typically used with Java programs, while Spark SQL supports APIs for Scala,
Java, Python, and R. Since the original goal of Hive has been to provide data
warehouse functionality for big data, its main use is to run aggregate queries.
However, as we mentioned above, Spark SQL can be used also for querying
Hive tables.
15.2 Distributed Processing Frameworks 577
Nevertheless, there are many cases where the two systems work together
and complement each other. When both batch and stream analysis are
needed, Hadoop may be more convenient (because of lower hardware costs),
and Spark can be used for real-time processing. Anyway, this is an ongoing
debate at the present time.
15.2.5 Kylin
ble aggregations for these two records. For example, (p1,c1,*) contributes to
the aggregation by Product and Customer. The second table from the left
shows the shuffling that MapReduce computes for the same keys. Finally, the
aggregation is computed. The key-value pairs are then stored in HBase in
columnar format.
Kylin also supports Apache Spark for building the data cube. Here, RDDs
are at the cornerstone of data cube computation. These RDDs have a parent-
child relationship between them, as the parent RDD can be used to generate
the children RDDs. With the parent RDD cached in memory, the child RDD’s
generation can be more efficient than reading from disk. Figure 15.6 describes
this process.
In summary, Kylin allows three abstraction levels of the OLAP model to
be defined. The end user works at the logical level with the star/snowflake
schema. The cube modeler is aware of the data cubes built from the logical
model and knows the cubes that are materialized. Finally, at the physical
level, the administrator takes care of the (distributed) storage in HBase.
From the technological side, the Kylin stack is composed as follows. The
logical model is stored in Hive files over Hadoop. The cubes are computed
with MapReduce or Spark and stored in HBase. Therefore, a Kylin installa-
tion requires, at least, Hadoop, YARN, Zookeeper, and HBase. In addition,
since typically a data warehouse is obtained from a relational database, Sqoop
15.3 Distributed Database Systems 579
RDD-4
All
RDD-3
RDD-2
RDD-1
is normally used to import data from an RDBMS into Hive. Figure 15.7 de-
picts the Kylin architecture. Queries are submitted via third-party applica-
tions, the Kylin web interface, or via JDBC/ODBC from BI tools, such as
Tableau or Saiku. Therefore, queries can be written in SQL or MDX, though
they are finally submitted via SQL. SQL is handled by Apache Calcite, an
open-source framework for building databases and DBMSs, which includes
an SQL parser, an API for building expressions in relational algebra, and
a query planning engine. An SQL query is handled as follows: If the query
matches one of the materialized cubes, it is sent by the routing tool to the
HBase repository where the cubes are stored as HFiles, thus achieving low la-
tency. Otherwise, they are evaluated by the Hive server, which will obviously
take much longer to execute.
REST server
Hive queries Query engine
Low latency
Routing
Metadata
becomes harder with hash partitioning, since it may require use of a different
hash function on an attribute different from the initial one.
Parallel algorithms for efficient query processing are designed to take ad-
vantage of data partitioning. This requires a trade-off between parallelism
and communication costs. Since not all operations in a program can be
parallelized, algorithms for relational algebra operators aim at maximizing
the degree of parallelism. For example, highly sequential algorithms, such as
Quicksort, are not good for parallelization. On the other hand, the sort-merge
algorithm is appropriate for parallel execution, and thus it is normally used
in shared-nothing architectures.
Interconnect
DB-cluster DB-cluster
middleware …….. middleware
DBMS DBMS
application, which can connect to any node in the cluster and then queries
will automatically access the right shard(s). While queries are processed in
parallel, join and filter operations are executed locally on the partitions and
data nodes. A mechanism called Adaptive Query Localization (AQL) allows
joins to be distributed across the data nodes, enhancing local computation
and reducing the number of messages being passed around the system. In
other words, joins are pushed down to the data nodes.
A typical MySQL Cluster architecture has three layers: the client, the
application, and the data layer. The core of the architecture is the data
layer, which is composed of two kinds of nodes: data nodes and management
nodes. The application layer is composed of application nodes, which receive
the clients’ requests and pass them on to the data nodes.
Data nodes are organized in groups. There can be more than one data
node per physical server, and a set of physical servers compose a cluster. The
data nodes receive read and write requests from the clients and also take care
of several tasks in the cluster as follows:
• Storage and management of both in-memory and disk-based data. MySQL
can store all data in memory or some of these data can be stored on disk.
However, only non-indexed data can be stored on disk, indexed columns
must always be in memory.
• Automatic and user-defined sharding of tables.
• Synchronous replication of data between data nodes.
• Transactions, data retrieval, recovery, and rollback to a consistent state.
Management nodes provide services for the cluster as a whole. They are
responsible for publishing the cluster configuration to all nodes in the cluster
and also for startup, shutdown, and backups. They are also used when a node
wants to join the cluster and when there is a system reconfiguration. Man-
agement nodes can be stopped and restarted without affecting the ongoing
execution of the other kinds of nodes. Note that there can be more than one
management node per cluster, and they are defined in a configuration file.
Application nodes, also called SQL nodes, compose the application layer
and are instances of MySQL Server that connect to all of the data nodes
for data storage and retrieval and support the NDB Cluster storage engine.
Thus, MySQL Server is used to provide an SQL interface to the cluster. Data
nodes can also be queried using any of the APIs mentioned at the beginning
of this section. All application nodes can access all data nodes. Therefore if
a data node fails applications can simply use the remaining nodes.
In-memory data are regularly saved to disk for recovery, performing check-
points (this is coordinated for all the data nodes). Data can be stored on
disk when the data set is bigger than the available RAM and has less-strict
performance requirements. Again, this is similar to other in-memory sys-
tems that are studied in Sect. 15.4. These in-memory optimized tables pro-
vide the ability to serve real-time workloads with low latency, one of the
main features of MySQL Cluster. However, note that for large databases
584 15 Recent Developments in Big Data Warehouses
15.3.2 Citus
Data node 1
SELECT * FROM TABLE_1001
Table_1001
SELECT * FROM TABLE_1003
Table_1003
Data node 2
SELECT * FROM TABLE_1002
Table_1002
SELECT * FROM TABLE_1004
Table_1004
SELECT * FROM TABLE Citus Coordinator
Metadata table
Data node n
Each node has PostgreSQL with Citus installed
One shard, one PostgreSQL table
There are three types of tables in a cluster, each used for different purposes.
• Distributed tables, which are horizontally partitioned across worker
nodes. Citus uses algorithmic sharding to assign rows to shards. This
assignment is made based on the value of a particular table column called
the distribution column. The cluster administrator must designate this
column when distributing a table. Figure 15.9 shows an example of this
partitioning.
• Reference tables, which are replicated on every worker. Thus, queries
on any worker can access the reference information locally, without the
network overhead of requesting rows from another node. Reference tables
are typically small, and are used to store data that is relevant to queries
running on any worker node. Examples are enumerated values such as
product categories or customer types.
• Local tables, which are traditional tables that are not distributed. This
is useful for small tables that do not participate in join queries.
Since Citus uses the distribution column to assign table rows to shards,
choosing the best distribution column for each table is crucial to put related
15.4 In-Memory Database Systems 587
data together on the same physical nodes, even across different tables. This is
called data co-location. As long as the distribution column provides a mean-
ingful grouping of data, relational operations can be performed within the
groups. In Citus, a row is stored in a shard if the hash of the value in the dis-
tribution column falls within the shard’s hash range. To ensure co-location,
shards with the same hash range are always placed on the same node. As
an example, consider the Northwind data warehouse, shown in Fig. 5.4. The
Employee table can be defined as a reference table, as follows:
SELECT create_reference_table('Employee');
The fact table Sales can be defined as a distributed table with distribution
column CustomerKey.
SELECT create_distributed_table('Sales', 'CustomerKey');
That means that, for example, if the Customer table is also partitioned using
the CustomerKey attribute, it will be co-located, and thus efficiently joined,
with the Sales table.
The pg_dist_shard metadata table on the coordinator node contains a row
for each shard of each distributed table in the system. This row matches a
shardid with a range of integers in a hash space (shardminvalue, shardmax-
value). If the coordinator node wants to determine which shard holds a row
of a given table, it hashes the value of the distribution column in the row,
and checks which shard’s range contains the hashed value.
Citus uses a two-stage optimizer when planning SQL queries. The first
phase involves transforming the queries so that query fragments can be
pushed down and run on the workers in parallel. The distributed query plan-
ner applies several optimizations to the queries depending on the distribution
column and distribution method. The query fragments are then sent to the
workers to start the second phase of query optimization. The workers simply
apply PostgreSQL’s standard planning and execution algorithms to run the
SQL query fragment. Explaining query processing in Citus in more detail is
outside the scope of this chapter.
Main memory
Active data
Main store Buffer store Additional structure
Indexes
Merge
Object guides
Prejoins
...
Historic data
(passive) Database snapshots Transaction log
Traditional database
columns. The buffer store is a write-optimized data structure that holds data
that have not yet been moved to the main store. This means that a query can
need data from both the main store and the buffer. The special data struc-
ture of the buffer normally requires more space per record than the main
store. Thus, data are periodically moved from the buffer to the main store, a
process that requires a merge operation. There are also data structures used
to support special features. An example is inverted indexes for fast retrieval
of low-cardinality data.
Finally, although data in the database are stored in main memory, to save
memory space, IMDBSs also store data persistently. This is done as follows.
The most recent data are kept in main memory, since these are the data
most likely to be accessed and/or updated. These data are called active.
By contrast, passive data are data not currently used by a business process,
used mostly for analytical purposes. Passive data are stored on nonvolatile
memory, even using traditional DBMSs. This supports so-called time-travel
queries, which allow the status of the database as of a certain point in time
to be known. Data partition between active and passive data is performed
by data aging algorithms. Nonvolatile memory is also used to guarantee
590 15 Recent Developments in Big Data Warehouses
consistency and recovery under failure: Data updates are written in a log file,
and database snapshots are kept in nonvolatile memory, to be read in case
of failure. This combination of main and nonvolatile memory allows IMDBSs
to support OLTP transactions and OLAP analysis at the same time, which
is the basis of HTAP systems (see Sect. 15.9).
is updated and stored back into the central database. The IMDB cache can
also be used as a read-only cache, for example, to provide fast access to aux-
iliary data structures, such as look-up tables. Data aging is an operation
that removes data that are no longer needed. There are two kinds of data
aging algorithms: ones that remove old data based on a timestamp value, and
ones that remove the least recently used data.
Like the other systems described in this chapter, TimesTen uses compres-
sion of tables at the column level. This mechanism provides space reduction
for tables by eliminating duplicate values within columns, improving the per-
formance of SQL queries that must perform full table scans.
Finally, TimesTen achieves durability through transaction logging over a
disk-based version of the database. TimesTen maintains the disk-based ver-
sion using a checkpoint operation that occurs in the background, with low
impact on database applications. TimesTen also has a blocking checkpoint
that does not require transaction log files for recovery and must be initiated
by the application. TimesTen uses the transaction log to recover transactions
under failure, undo transactions that are rolled back, replicate changes to
other TimesTen databases and/or to Oracle tables, and enable applications
to detect changes to tables.
Oracle commercializes an appliance called Oracle Exalytics In-Memory
Machine, which is similar to the SAP HANA appliance (Sect. 15.7.2). Exa-
lytics is composed of hardware, BI software, and an Oracle TimesTen IMDBS.
The hardware consists of a single server configured for in-memory analytics
of business intelligence workloads. Exalytics comes with the Oracle BI Foun-
dation Suite and Essbase, a multidimensional OLAP server enhanced with
a more powerful MDX syntax and a high-performance MDX query engine.
Oracle Exalytics complements the Oracle Exadata Database Machine, which
supports high performance for both OLAP and OLTP applications.
15.4.2 Redis
Redis is a key-value (see Sect. 15.6 for details) in-memory storage system.
It enables data to be stored across nodes over a network and also supports
replication. Multiple machines provide data storage services together. These
are called master nodes. Each master node is responsible for storing parts of
the data. Different clients can visit these nodes independently, and read and
write data. To ensure data safety, there is at least one slave node for each
master node to replicate data, so when the master is down, the slave node
can replace it to restore the service.
Since Redis is a key-value storage system, the main problem is the as-
signment of keys to nodes. Redis uses a key-slot-node mapping strategy to
maintain the data distribution status. To map keys to nodes, there is first a
hash stage, where, for a given key, Redis calculates its hash value. All hash
592 15 Recent Developments in Big Data Warehouses
values are evenly assigned to slots and there are 16,382 slots in total. In a
second stage, called the slot mapping stage, the corresponding slot is assigned
to a key based on its hash value. All slots can be dynamically assigned to
different nodes. In this case, if any node is under heavy workload, some parts
can be moved from their slots to other nodes or new nodes can even be cre-
ated. Thus, in a third stage, the corresponding node of the given key is found
based on the slot-node mapping status. In summary, given a key, it is hashed,
a slot is found, and, finally, the key is assigned to a node.
We have seen that in clustered systems such as Hadoop, important infor-
mation (e.g., number of nodes, their IP addresses and port numbers, current
status) is stored on a central node of the cluster, to simplify the design of the
distributed system, at the cost of creating a single point of failure. To avoid
this, Redis uses a fully decentralized design, which uses the Gossip protocol
to maintain this information on all nodes. All nodes in the system are fully
connected and know the current state of the system. Every time the system’s
status changes, this new info is propagated to every node. Nodes will also
randomly send PING messages to other nodes and expect to receive PONG
messages to prove that the cluster is working correctly. If any node is found
to be not working properly, all other nodes will participate in a vote to use
a slave node to replace the crashed master node. With this design, clients
can connect to any master node which, upon reception of a key-value request
from a client, checks whether this key belongs to this node using the key-
slot-node mapping. If so, it processes the request and returns the results to
the client. Otherwise, it returns an error containing the right node for this
request, so the client knows which node it should send the request to.
Redis uses two mechanisms for data protection: persistence and repli-
cation. The so-called RDB (Redis DB) persistence periodically performs
snapshots of the data in memory and writes the compressed file on disk. Al-
ternatively, the AOF (Append Only File) persistence mechanism logs every
write operation received by the server. To restore the service, these opera-
tions are processed again. Data replication is more efficient than persistence,
since, when data is replicated on different nodes, upon a crash, the service
can be restored very fast, replacing the master node with the slave node. On
the other hand, for data persistence it takes longer to read data from disk.
data (as in OLAP), other structures can do better. This is true of column-
store database systems, where the values for each column (or attribute)
are stored contiguously on disk pages, such that a disk page contains a num-
ber of database columns, and thus, a database record is scattered across many
different disk pages. We study this architecture next.
C1 C2 C3 C4 C5 C6 C7 C8 C1 C2 C3 C4 C5 C6 C7 C8
rows ...
Page 1 Page n
(a)
C1 C2 C3 C4 C5 C6 C7 C8
...
pages
(b)
Figure 15.11a shows the row-store organization, where records are stored
in disk pages, while Fig. 15.11b shows the column-store alternative. In most
systems, a page contains a single column. However, if a column does not fit
in a page, it is stored in as many pages as needed. When evaluating a query
over a column-store architecture, a DBMS just needs to read the values of the
columns involved in the query, thus avoiding loading into memory irrelevant
attributes. For example, consider a typical query over the Northwind data
warehouse:
SELECT CustomerName, SUM(SalesAmount)
FROM Sales S, Customer C, Product P, Date D, Employee E
594 15 Recent Developments in Big Data Warehouses
Depending on the query evaluation strategy, the query above may require
all columns of all the tables in the FROM clause to be accessed, totaling 51
columns. The number of columns can increase considerably in a real-world
enterprise data warehouse. However, only 12 of them are actually needed
to evaluate this query. Therefore, a row-oriented DBMS will read into main
memory a large number of columns that do not contribute to the result and
which will probably be pruned by a query optimizer. As opposed to this,
a column-store database system will just look for the pages containing the
columns actually used in the query. Further, it is likely that the values for
E.City, T.Year, and P.Discontinued will fit in main memory.
f v l f v l f v l
1 e1 5 1 c1 2 1 p1 1
6 e2 3 3 c2 6 2 p4 4
9 e3 2 9 c3 2 6 p5 2
... ... ... ... ... ... 8 p1 1
... ... ... ... ... ... 9 p2 2
... ... ... ... ... ... ... ... ...
(b) (c) (d)
Fig. 15.12 Storing columns of the Sales fact table: one table per column. (a) Fact
table Sales; (b) Column EmployeeKey; (c) Column CustomerKey; (d) Column
ProductKey
15.5 Column-Store Database Systems 595
We can perform a variation of a sort-merge join which just accesses the pages
containing the required columns. Thus, we set two cursors at the beginning
of the partitions for ProductKey and CustomerKey. The first runs of values of
these attributes have lengths one and two, respectively. Thus, the first pair
(p1, c1) is produced. We advance the cursors. Now the cursor for ProductKey
is at the first position of the second run, and the cursor CustomerKey is at
the second position of the first run; thus, we build the second tuple, (p4, c1).
The run for CustomerKey = c1, is finished so when we advance this cursor it
positions at the beginning of a run of length six with the value c2; the cursor
over ProductKey is at the second position of a run of length four with the
value p4. Therefore, we will now retrieve three tuples of the form (p4, c2). We
continue in this fashion. Note that once the column stores are sorted, this
join between columns can be done in linear time.
15.5.1 Vertica
column: one contains the column itself and the other contains the position in-
dex. At the ROS, partitioning and segmentation are applied to facilitate
parallelism. The former, also called intra-node partitioning, splits data hori-
zontally, based on data values, for example, by date intervals. Segmentation
(also called inter-node partitioning) splits data across nodes according to a
hash key. When the WOS is full, data are moved to the ROS by a moveout
function. To save space in the ROS, a mergeout function is applied (this is
analogous to the merge operation in Fig. 15.10).
Finally, although inserts, deletes, and updates are supported, Vertica may
not be appropriate for update-intensive applications, such as heavy OLTP
workloads that, roughly speaking, exceed 10% of the total load.
15.5.2 MonetDB
Citus (described in Sect. 15.3.2) recently added columnar storage for read-
only operations. This enhances its capability for handling OLAP queries. As
a consequence, Citus now has three features that are relevant for data ware-
houses: distribution, columnar storage, and data compression (which comes
together with columnar storage). To illustrate the relevance of columnar stor-
age for OLAP queries, we use the example below.2
CREATE EXTENSION Citus;
CREATE TABLE perf_row_30(c00 int8, c01 int8, c02 int8, c03 int8, c04 int8, ..
... c28 int8, c29 int8);
CREATE TABLE perf_columnar_30 (LIKE perf_row_30) USING COLUMNAR;
On a four-machine cluster, the query took amost two minutes to run. The
same query using columnar storage took less than seven seconds, because
it only makes use of the columns mentioned in the query, instead of the 30
columns, which is the case of the row-storage alternative. Running the same
query over the perf_row_100 table takes almost 5 minutes since, of course,
the table occupies more pages, although the number of records is the same,
therefore there are more pages to read. However, the columnar alternative still
takes seven seconds, since only the required columns are used. This shows that
columnar storage is crucial for a large variety of OLAP queries, particularly
over wide tables. Of course, using the distribution capabilities of Citus would
further enhance query performance.
Relational databases are designed for storing and querying structured data
under strong consistency requirements. However, traditional database solu-
tions do not suffice for managing massive amounts of data. To cope with
such big data scenarios, a new generation of databases have been developed,
known as NoSQL database systems. This term accounts for databases
that do not use relations to store data, do not use SQL for querying data,
and are typically schemaless. Therefore, they go beyond the “one size fits
all” paradigm of classic relational databases, which refers to the fact that the
traditional DBMS architecture has been used to support many data-centric
applications with widely varying characteristics and requirements. NoSQL
data stores are categorized in four classes, as follows:
• Key-value stores, which consist of a set of key-value pairs with unique
keys. A value represents data of arbitrary type, structure, and size,
uniquely identified by a key. This very simple structure allows these sys-
tems to store large amounts of data, although they only support very
simple update operations.
• Wide-column stores, in which data are represented in a tabular format of
rows and a fixed number of column families, each family containing an
arbitrary number of columns that are logically related to each other and
usually accessed together. Data here are physically stored per column-
family instead of per row. The schema of a column-family is flexible as
its columns can be added or removed at runtime.
• Document stores, where values in the key-value pairs are represented as
a document encoded in standard semistructured formats such as XML,
JSON, or BSON (Binary JSON). This brings great flexibility for access-
ing the data, compared with simple key-value stores. A document has a
flexible schema, and attributes can be added or deleted at runtime.
• Graph stores, typically used to store highly connected data, where queries
require intensive computation of paths and path traversals. Although
600 15 Recent Developments in Big Data Warehouses
15.6.1 HBase
Figure 15.14 depicts a scheme of the HBase data model. We can see that
there is a row identifier which acts as the primary key for the table. Columns
can be added on the fly to any column family at any time, and, of course, null
values are allowed for any column. Therefore, HBase is schemaless, since there
is not a fixed schema, and only the column families are defined in advance.
Further, when inserting data into HBase, there is an associated timestamp,
which is generated automatically by a so-called RegionServer or supplied by
15.6 NoSQL Database Systems 601
the user, and defines the version of a value. Note that, physically, data are
stored in column-family basis rather than in column basis. That is, groups
of rows of the form (row key, timestamp, column family) are stored together.
With respect to the CAP theorem, being based on Hadoop, there is a single
point of failure; thus HBase supports partitioning and consistency, at the
expense of availability. HBase provides auto-sharding, which means that it
dynamically distributes tables when they become large.
HBASE API
RegionServer
HDFS ZooKeeper
Figure 15.15 shows the HBase architecture, built on top of HDFS. The
main components are the MasterServer and the RegionServer. The Master-
Server assigns regions to a RegionServer using ZooKeeper, the task coordina-
tor introduced in Sect. 15.2.1. The clients communicate with a RegionServer
via ZooKeeper. The MasterServer also carries out load balancing of the re-
gions in the RegionServers. The RegionServers host and manage regions, split
them automatically, handle read/write requests, and communicate with the
clients directly. Each RegionServer contains a write-ahead log (called HLog)
and serves multiple regions. Each region is composed of a MemStore and
multiple store files, each called an HFile, where data live as column families.
The MemStore holds in-memory modifications to the data, and an HFile is
generated when the MemStore is full and flushed to disk. The process is fur-
ther explained in Fig. 15.16. The mapping from the regions to a RegionServer
is kept in a system table called .META, where each RegionServer maps to
a collection of regions within a range of row identifiers. Each region (i.e.,
a machine, called m*.host in the figure) holds a collection of indexed rows,
stored in column family basis. When trying to read or write data, the HBase
clients read the required region information from the .META table and com-
municate with the appropriate RegionServer. When a client asks for rows in
HBase, the process is similar to what happens in a typical DBMS: It first
reads the MemStore. If the record is not there, it reads the RegionServer’s
block cache (which is an MRU cache of rows), and, if it fails, it loads the
HFiles in memory.
602 15 Recent Developments in Big Data Warehouses
row1
…
row 99
row 200 m1.host
Table Sort key Region ID RegionServer …
row 299
T1 row1 1 m1.host
row 100
T1 row 100 2 m2.host …
row 199 m2.host
T1 row 200 3 m1.host
row 300
T1 row 300 4 m2.host …
row 399
T1 row 400 5 m6.host
row 400
T1 row 500 6 m6.host …
row 499
… … … … row 500
m6.host
…
row1
Here, the value 1 indicates the row number and the key-value pairs are
(nbr,123), (item1,12), and (item2, 25).
Listing the table with the command scan ’Orders’ results in:
1 column = OrNbr:nbr, timestamp = 3545355, value = 123
1 column = OrDetails: item1, timestamp = 3545445, value = 12
1 column = OrDetails: item2, timestamp = 3545567, value = 25
...
15.6.2 Cassandra
architecture, instead of a master-slave one, where all nodes have the same
role. In this architecture, one or more nodes in a cluster act as replicas for
given data. Therefore, with respect to the CAP theorem, Cassandra supports
partitioning and availability at the expense of consistency. Cassandra uses the
Gossip protocol to allow the nodes to communicate with each other to detect
any failing nodes in the cluster. Figure 15.17 shows the replication scheme.
Replication
node 2 node 3
Replication Replication
node 1 node 4
Replication Replication
Replication
node 6 node 5
Keyspace
Column 1 Column 2
Column Column Family
ROW
Family
KEY 1
Value 1 Value 2
Column 1
ROW
KEY 2
Value 1
The CREATE TABLE statement is used to create a table (that is, a column
family) as follows:
CREATE TABLE Customer (
CompanyName text,
CustEmail text,
City text,
State text,
Country text,
PRIMARY KEY (CompanyName)
);
Here, Cassandra will store, for each customer, the data specified in the
columns together with a system-generated timestamp. Columns can be added
and dropped with the ALTER TABLE commands.
We can insert data into the previous table as follows:
INSERT INTO Customer(CompanyName, CustEmail, City, State, Country) VALUES
('Europe Travel', '[email protected]', 'Antwerp', 'Antwerp', 'Belgium')
606 15 Recent Developments in Big Data Warehouses
If we would also like the partition key to be the customer’s email, for
example, because our queries will perform a lookup based on the email, we
would need to define another table where the email is defined as the primary
key. Note that if we want to get all customers from the same country stored
together, the partition key must be defined to be the country; therefore the
primary key must be (Country, CompanyName), because the partition key
is the first element in the primary key. If we define the partition key as a
multicolumn key, we have a composite key. For instance, defining ((Country,
State), CompanyName) as primary key, data would be partitioned by country
and state, because now the pair (Country, State) is the first element of the
composite primary key.
a power failure, the database can be restarted from the savepoints like a
disk-based database: The database logs are applied to restore the changes
that were not captured in the savepoints, ensuring that the database can be
restored in memory to the same state as before the failure.
Finally, compression is performed using data dictionaries. For this, each
attribute value in a table is replaced by an integer code, and the correspon-
dence of codes and values is stored in a dictionary. For example, in the City
column of the Customer table, the value Berlin can be encoded as 1, and the
tuple (Berlin, 1) will be stored in the dictionary. Thus, if needed, the corre-
sponding value (Berlin, in this case) will be accessed just once. Therefore, data
movement is reduced without imposing additional CPU load for decompres-
sion, for example, in run-length encoding. The compression factor achieved is
highly dependent on the data being compressed. Attributes with few distinct
values compress well (e.g., if we have many customers from the same city),
while attributes with many distinct values do not benefit as much.
15.7.3 VoltDB
We have seen when studying MOLAP and MDX that multidimensional ar-
rays are the natural representation for data cubes. Similarly, in most science
and engineering domains, arrays represent spatiotemporal sensor, image, and
simulation output, among other kinds of data. For example, temperature data
can be seen as a data cube with four dimensions (latitude, longitude, alti-
tude, and time), and one or more measures representing temperatures (e.g.,
minimum, maximum, average). Classic database technology does not support
arrays adequately. Therefore, most solutions are proprietary, which produces
information silos and requires data to be moved between relational and multi-
dimensional databases. Array database systems attempt to close this gap
by providing declarative query support for flexible ad-hoc analytics on large
multidimensional arrays. Potentially, array databases can offer significant ad-
vantages in terms of flexibility, functionality, extensibility, performance, and
scalability. The possibility of having data cubes in array databases may also
close the gap between ROLAP and MOLAP approaches. There are systems
in the database market that may be be considered a serious option for data
cube storage and querying.
Arrays are partitioned in subarrays, called tiles or chunks, which are the
unit of access to disk storage. This partition strategy impacts on query per-
formance. Thus, it is desirable to allow dynamic partitioning, or to statically
define a partition strategy appropriate to the expected query workload. Ar-
ray database systems such as rasdaman and SciDB, which we introduce in
this section, use different strategies. Based on the degree of grid alignment,
tiling schemes can be classified into aligned and non-aligned. The former can
15.8 Array Database Systems 611
(a) Aligned regular (b) Aligned irregular(c) Partially (d) Totally nonaligned
aligned
In the query above, mdavg performs an iterative average over the array extent,
and an image is produced as a result. We do not give details on array algebras,
612 15 Recent Developments in Big Data Warehouses
and refer to the bibliography for details. Languages are further commented
below, for the rasdaman and SciDB systems.
Regarding query processing, unlike relational DBMSs, array operations are
strongly CPU bound. Some array operations can be easily parallelized, such
as simple aggregations, and distributed on either local or remote processing
nodes. In other cases, queries must be rewritten as parallelizable operations.
In the first case, parallelization across several cores in one node allows vertical
scalability to be exploited; in the second case, distributed processing requires
strategies based on data location, intermediate results transfer costs, and
resource availability, among others. Two main techniques are used: query
rewriting and cost-based optimization. Query rewriting takes a query and
rewrites it into an equivalent one. The array algebra facilitates this task,
since it allows equivalence rules to be defined. Different systems use different
sets of rules. For example, rasdaman (Sect. 15.8.1) has about 150 such rules.
Cost-based optimization aims at finding an efficient query execution plan
from a space of possible plans. This involves knowing the approximate costs
of processing different operations.
Array DBMSs can be classified in three groups, described next.
First, generic array DBMSs provide characteristic features such as a query
language, multi-user concurrent operation, dedicated storage management,
and so on. Examples of these kinds of system are discussed next.
Second, object-relational extensions are based on object-relational DBMSs,
which allow users to define new data types and operators. However, in these
systems, an array is a type constructor, not a data type; thus it cannot provide
abstraction, only instantiated data types. Examples of this kinds of systems
are PostGIS raster, Oracle GeoRaster, and Teradata arrays.
Finally, array tools are specific-purpose tools for operating with arrays. Ex-
amples are TensorFlow, a machine learning library, Xarray, a Python package
for working with arrays, and Xtensor, a C++ library for numerical analysis.
Finally, attempts have been made to implement partitioned array manage-
ment and processing on top of MapReduce. Examples are SciHadoop, an
experimental Hadoop plugin allowing scientists to specify logical queries over
array-based data models, which are executed as MapReduce programs. Since
MapReduce presents the problems already mentioned when iterations are
needed, SciSpark (a NASA project) aims at augmenting Spark with the abil-
ity to load, process, and deliver scientific data and results. SciSpark produces
methods to create RDDs that preserve the logical representation of structured
and dimensional data.
15.8.1 Rasdaman
Rasdaman (standing for “raster data manager”) was the first array database
system. It was developed initially as an academic project, and then became
15.8 Array Database Systems 613
the input of the query to those images where at least one difference pixel
value is greater than a certain value. An example query is shown below.
SELECT arr1 - arr2
FROM arr1, arr2
WHERE cells(arr1 - arr2 > 50)
For development, rasdaman provides a C++ API, raslib, and a Java API,
rasj, both adhering to the ODMG standard.
The core part of rasdaman responsible for query evaluation is called
rasserver. Rasdaman runs several rasserver processes, each responsible for
the evaluation of a single query. Therefore, multiple queries are dispatched to
multiple rasservers. The rasserver processes are supervised by a process called
rasmgr. Client queries are sent to rasmgr, which allocates a rasserver for the
client. Rasserver connects to a backend database RASBASE, to store details
of persistent data, such as collection names, types, dimensions, indices, etc.
Arrays also need to be persisted. This is performed on the file system as ordi-
nary files, or as blobs within RASBASE. On top of the core rasdaman there
is a geo-services layer called petascope, where the data model is composed
of coverages associated with geo-referencing information, supporting regular
and irregular grids, and implementing several OGC standards.
15.8.2 SciDB
SciDB
Engine
Local
Store
PostgreSQL
Connections
Unlike in rasdaman, foreign keys are not part of the array data model
used by SciDB. SciDB does not use an index, but maps chunks of an array
to specific nodes by hashing the chunk’s coordinates. It also has a map that
allows dimensions specified with user-defined data types to be represented
internally as integers, which is called an index in the SciDB documentation.
In SciDB, users define how each attribute of an array is compressed when
the array is created. The default is no compression. Other options are zlib,
bzlib, or null filter compression. Since data are stored by attribute, where log-
ical chunks of an array are vertically partitioned into single-attribute physical
chunks, the specified compression is used on a chunk-by-chunk basis. If cer-
tain parts of a chunk are accessed more often than others, SciDB can partition
a chunk into tiles and compress on a tile-by-tile basis. Run-length encoding
(see bitmap compression in Chap. 8) is used to compress recurring sequences
of data. In addition, SciDB’s storage manager compression engine can split
or group logical chunks in order to optimize memory usage while remaining
within the limit of the buffer pool’s fixed-size slots. During the creation of an
array in SciDB the user must specify an array name, the dimensions of the
array, and at least one attribute for the array. Attributes can be added to an
array as the result of an operation in SciDB. For example:
CREATE ARRAY myArray <val:double>[I=0:9,10,0, J=0:*,10,0];
616 15 Recent Developments in Big Data Warehouses
computes the maximum values for attribute Altitude of array AltMap, grouped
by dimension y, where the latter refers to the traditional coordinate axis.
Many current big data applications require both to acquire huge amounts
of data in very short times, such as data coming from Internet of Thing
(IoT) or real-time data from social networks, and to perform analytics over
such data. The typical division between concurrent transaction processing
(i.e., OLTP) and analytical queries (i.e., OLAP) is not appropriate for these
applications, since they require the best from both worlds. For these data pro-
cessing requirements Gartner coined the term hybrid transactional and
analytical processing (HTAP), which combines OLTP and OLAP. The
database technologies studied in this chapter allow these kinds of systems
to be built. As we have seen, since OLTP and OLAP systems have different
requirements, they are tuned very differently. Further, the idea of general-
purpose systems has been losing strength, and specialized database systems
have appeared. Essentially, in HTAP systems, an OLTP engine continuously
updates the data, which serves OLAP queries. Thus, data freshness depends
on transactional throughput. Ideally, an HTAP system provides OLAP over
up-to-date data without affecting the performance of the OLTP system, and
also OLAP query response should remain unaffected by OLTP traffic. How-
ever, these assumptions are not accomplished in actual systems. It is the case
that performance of either the OLTP or the OLAP part of an HTAP system
is affected by the other one.
15.9 Hybrid Transactional and Analytical Processing 617
Three main HTAP architectures have been proposed. In all them, the data
is logically shared among the OLTP and OLAP components. However, the
various architectures differ in how they implement the sharing of data.
The two-copy, mixed-format (TCMF) architecture keeps two copies of the
data, one for OLTP in row format and one for OLAP in columnar format.
This requires the data to be converted between the two formats. To keep the
data consistent across the OLTP and OLAP components, an intermediate
data structure, referred to as delta, is used, which keeps track of the recently
modified, fresh tuples. Periodically, TCMF propagates the fresh tuples from
the OLTP side to the OLAP side by scanning the delta. Therefore, the latest
committed data might not be available to the analytical queries right away.
The single-copy, mixed-format (SCMF) architecture keeps a single copy of
the data and uses the intermediate delta data structure as the OLTP store.
The delta is in row format, whereas the main copy of the data is in columnar
format. OLTP transactions only modify the delta, whereas the OLAP queries
read both the delta and the main copy of the data.
Lastly, the single-copy, single-format (SCSF) architecture keeps a single
copy of the data and uses a single format, that is, either row or columnar,
for both for OLTP and OLAP workloads. This requires mechanisms such
as multi-version concurrency control to keep multiple versions of the data,
through which analytical queries can access the most recent transactional
data.
We briefly describe examples of HTAP systems next.
15.9.1 SingleStore
15.9.2 LeanXcale
15.10 Polystores
from HDFS tables to the SQL database nodes and doing parallel joins. On the
other hand, they can access only specific data stores, usually through SQL
mappings, which is costly compared to accessing directly the data store.
Hybrid polystores combine the advantages of loosely coupled and tightly
coupled systems, that is, they combine the high expressiveness of native
queries with massive parallelism. Therefore, the architecture follows the
mediator-wrapper architecture, and the query processor can also directly ac-
cess some data stores, e.g., HDFS through MapReduce or Spark. One example
of a hybrid polystore is Spark SQL (see Sect. 15.2.3). We next describe two
other representative polystore efforts.
15.10.1 CloudMdsQL
15.10.2 BigDAWG
This query performs a join between the relational and key-value stores that
is executed in a RELATIONAL scope. The CAST operation translates Tweet
into a table when it is initially accessed. The user does not care whether the
query is executed in the relational store or the key-value one provided his
prescribed island semantics are obeyed.
The middleware relies on a catalog that stores metadata about the sys-
tem components. This is maintained in a database that includes information
about (a) engine names and connection information, (b) databases, their
engine membership, and connection authentication information, (c) data ob-
jects (e.g., tables and column names) stored on each engine, (d) the shims
that integrate the engines within each island, and (e) the available casts be-
tween engines.
A data lake is a central storage repository that holds vast amounts of raw
data coming from multiple sources. The term raw data refers to data that
is kept in its original format and that has not yet been processed for a pur-
pose. Intuitively, the term data lake refers to the ad hoc nature of the data
15.12 Data Lakes and Data Lakehouses 625
Operational
BI
IoT Hadoop, Spark, Hive, SQL
Alerts
Data processing Streaming
Data
Analytics
Spatial Data
Data
Science
Machine
NoSQL Data Warehousing Learning
Corporate Data …
Document
databases Data
Data Cubes
Warehouse Data
Graph Exploration
Third Party Data Databases
Self-Service
…….. Data Semantic BI
Marts Layer
Reporting
this next, to introduce the notion of Delta Lake, which attempts to solve this
problem.
Data in data lakes are typically stored using file formats such as ORC
or Parquet (Sect. 15.2.2), possibly partitioned. Achieving ACID properties
using these file formats is very difficult since, when multiple objects need to
be updated, users will see partial updates on individual objects. A similar
situation occurs with rollbacks. Delta Lake is an open-source storage layer
that brings transactions with ACID properties to Apache Spark and big
data workloads. The goal of Delta Lake is to unify stream and batch data
processing to solve the inconsistency problems explained above.
The main abstraction in the Delta Lake solution is the Delta table, a
Spark table that stores data as Parquet files in the file system, and maintains
a transaction log that efficiently tracks changes to the table. Data can be read,
written, and stored in the Delta format using Spark SQL. The transaction
log allows ACID transactions, so users can concurrently modify a dataset and
get consistent views.
The general problem consists in capturing changes in data sources and
merging these changes into target tables. This is called change data cap-
turing (CDC). The basic idea is to maintain a staging table that accumulates
all updates, and then produce a final table that contains the current up-to-
date snapshot that users can query. This is done in a two-step process. Data
are read and inserted into a Parquet table acting as a staging table. This ta-
ble is read by a job running at regular intervals and the new data are loaded
into the final Delta table, which complies with the ACID properties.
The snapshots of data mentioned above can be accessed and reverted to
earlier versions of data for audits or rollbacks. Delta Lake also allows a table
schema to be changed and this change is applied automatically. For this,
Delta Lake uses schema validation on write, where all new writes to a table
are checked for compatibility with the table schema at write time. If the
schema is not compatible, the transaction is cancelled. This is called schema
enforcement. If the user wants to add the rejected data into the schema, she
can enable schema evolution. When this occurs, any column in the Apache
Spark DataFrame (where processing is occurring) but not in the target table
is automatically added at the end of the schema.
The combined data lake and data warehouse architecture presented above,
which is increasingly being used in industry, suffers from the following prob-
lems. (a) Reliability: Keeping the data lake and data warehouse consistent
is difficult and costly. (b) Data staleness: The data in the warehouse is stale
compared to that of the data lake, with new data frequently taking days to
load. (c) Limited support for advanced analytics: None of the leading ma-
chine learning systems work well on top of data warehouses. (d) Total cost
of ownership: In addition to the ETL costs, users double the storage cost for
data extracted from the data lake into the data warehouse.
A data lakehouse is a recent architecture that aims to enable both arti-
ficial intelligence and business intelligence tasks directly on data lakes. This
628 15 Recent Developments in Big Data Warehouses
is made possible due to recent technical advances, detailed next. (a) Multiple
solutions, such as Delta Lake mentioned above, add to traditional data lakes
reliable data management features such as ACID, schema evolution, time
travel, incremental consumption, etc. (b) Many data science systems have
adopted DataFrames as the abstraction for manipulating data, and recent
declarative DataFrame APIs enable machine learning workloads to directly
benefit from many optimizations in lakehouses. (c) Lakehouses provide SQL
performance on top of the massive Parquet/ORC datasets that have been
collected over the last decade. Recent results have shown that an SQL engine
over Parquet outperforms leading cloud data warehouses on the well-known
TPC-DS analytical benchmark. Nevertheless, it still remains to be proven
that a unified data platform architecture that implements data warehous-
ing functionality over open data lake file formats can provide performance
competitive with today’s data warehouse systems and considerably simplify
enterprise data architectures.
Big data has become the standard setting in todays’ applications, and this
state of affairs will remain in the years to come. In the last decades, the data
management domain has explored various approaches to solve the multiple
challenges posed by big data. Since traditional relational databases have lim-
itations for coping with these challenges, the initial approaches abandoned
many of their foundational aspects. The advent of the first generation of
NoSQL systems represents this situation. Although NoSQL systems provided
relevant solutions for scalability and elasticity, over the years it became clear
that many characteristics of traditional databases, in particular the ACID
properties for transaction processing, were still relevant. The advent of the
first NewSQL systems represent this situation. Around the same time, the
HTAP (Hybrid transaction and analytical processing) paradigm was pro-
posed as a solution to the limitations of the ETL (extraction, transformation,
and loading) approach for coping with big data, in particular for reducing
the latency for processing real-time data, as well as for reducing the high
overhead, complexity, and costs of maintaining two parallel systems, one for
coping with OLTP (online transaction processing) workloads and another for
OLAP (online analytical processing) workloads.
Each one of the solutions proposed during the last decades tackle some, but
not all, of the multiple challenges posed by big data. The CAP theorem was
fundamental in this respect, showing that a compromise between the various
challenges must be made depending on the application requirements. Over the
years, initial solutions were replaced by newer, more efficient ones. The long-
standing debate between MapReduce and Spark represents this situation.
Gradually, the proposed approaches blended several of the previous ones into
15.15 Bibliographic Notes 629
a single framework. For example, several of the new database systems provide
both in-memory and columnar storage. However, the most recent solutions
lack the maturity of those proposed many years ago. Examples are polyglot
database systems or delta lakes. In conclusion, what is still missing are robust
approaches that encompass many of the ones proposed in the last decades
within a single framework.
In this chapter, we surveyed the approaches proposed so far to address
the big data challenges. The choice of the technologies and systems covered
reflects the current understanding of the authors, at the time of writing this
book, about which of them will still be relevant in the years to come. Although
only time will tell whether our choice was correct, what we are completely
sure of is that big data management and analytics represents nowadays one
of the foundational disciplines of computer science.
15.14 Summary
We have studied the challenges that big data analytics brought to data ware-
housing, and the approaches that have been proposed for addressing these
challenges. We presented a comprehensive state of the art in the field. We
addressed the problem along two dimensions: distributed processing frame-
works and distributed data storage and querying. For the first dimension, we
studied the two main frameworks, namely, Hadoop and Spark, as well as high-
level query languages for them, namely, HiveQL and Spark SQL. For the sec-
ond dimension, we presented various database technologies enabling big data
processing. These are distributed database systems, in-memory database sys-
tems, columnar database systems, NoSQL database systems, NewSQL data-
base systems, and array database systems. For each of these technologies we
also presented two representative systems. We covered the HTAP paradigm,
which proposes performing OLAP and OLTP workloads on the same system.
An analogous approach is behind polystores, which aim at querying hetero-
geneous database systems through a single query language. Crossing all of
the above, we discussed cloud data warehousing, and the classic dilemma of
on-premises against in-cloud processing was also addressed. We concluded by
presenting the data lake approach, used for making the data available right
after the ingestion phase, including the notion of Delta Lake, an open-source
layer on top of a data lake, allowing ACID transactions on the data lake.
15.14 Compare Times Ten and Redis against the characteristics of in-
memory database systems. Which mechanisms do they use to comply
with the ACID properties?
15.15 Explain the main characteristics of column-store database systems.
15.16 Why do column-store database systems achieve better efficiency than
row-store database systems for data warehouses?
15.17 How do column-store database systems compress the data? Why is
compression important?
15.18 Explain the architectures of Vertica and MonetDB, and compare them
against a general column-store database architecture.
15.19 What do columnar capabilities add to Citus?
15.20 Define NoSQL database systems. Classify them and give examples.
15.21 Explain the CAP theorem and its relevance for NoSQL database sys-
tems.
15.22 Compare the architectures of HBase and Cassandra and classify them
with respect to the CAP theorem.
15.23 What are NewSQL database systems? What do they try to achieve?
15.24 Describe the main features of Cloud Spanner.
15.25 Compare SAP HANA and VoltDB against the NewSQL goals. Do
they achieve such goals? Elaborate on this topic.
15.26 Explain the general architecture of array database systems.
15.27 Explain the main features of rasdaman and SciDB and compare them
against each other.
15.28 What is Hybrid Transactional and Analytical Processing (HTAP)?
15.29 Explain the three typical HTAP architectures.
15.30 Discuss why SingleStore and LeanXcale are examples of HTAP archi-
tectures.
15.31 Define polystores and explain the three different kinds of polystores.
15.32 Compare CloudMdsQL and BigDAWG and explain their query pro-
cessing mechanisms.
15.33 Explain the deployment and service models in cloud computing.
15.34 What are cloud data warehouses? Explain the three possible ways of
deploying a cloud data warehouse.
15.35 Discuss pros and cons of on-premises and cloud data warehouses.
15.36 How does extraction, loading, and transformation (ELT) differ from
extraction, transformation, and loading (ETL)? Choose an application
scenario you are familiar with to motivate the use of ELT.
15.37 What are data lakes? Do they replace data warehouses?
15.38 Describe the main components of a data lake architecture.
15.39 How does a data lakehouse differ from a data lake? What would you
use a data lakehouse for?
15.40 Elaborate on a vision of future scenarios for data warehousing.
Appendix A
Graphical Notation
name Client
Entity type
(short description)
name Client
identifier attribute ClientId
FirstName (1,1) default cardinality
simple monovalued LastName
attributes BirthDate
SalaryRange
NoChildren (0,1)
multivalued attribute Profession (1,n) cardinalities
composite attribute Address
Street
component attributes
City
Hobbies (0,n) multivalued composite
attribute
Name
component attributes
Rank
Entity type
(with attributes and identifiers shown)
Section
Section
Semester
Year
Homepage (0,1)
cardinalities
(0,n) (1,n)
Participate
name
Relationship type
(short description)
cardinalities
(0,n) (1,n)
Participate
name
StartDate
attributes EndDate
Salary
Relationship type
(with attributes shown)
(0,n) (1,n)
CouSec
Customer
CustomerId
supertype CustomerName
CustomerAddress
BranchName
name Product
primary key
ProductKey
attribute
ProductNumber
attributes ProductName
Description
alternate key AK: ProductNumber
Relational table
(with attributes and keys shown)
Product Category
primary key primary key
ProductKey CategoryKey
attribute attribute
ProductNumber CategoryName
ProductName Description
Description
foreign key
attribute CategoryKey
Referential integrity
attributes
name Product
primary key Product Product Product Description Category foreign key
attribute Key Number Name Key attribute
1 QB876 Milk ... C1
instances
2 QD555 Soup ... C2
name Product
Level
(short description)
name Product
identifier attribute ProductNumber
ProductName
descriptive attributes Description
Size
Level
(with attributes and identifiers shown)
636 A Graphical Notation
name Sales
Fact
(short description)
name Sales
Quantity
measures
Amount
Fact
(with measures shown)
(0,1)
(1,1)
(0,n)
(1,n)
Cardinalities
Customer Many-to-one
dimension
CustomerId
CompanyName
Role-playing ... Many-to-many
dimension dimension
Order One-to-one
dimension
OrderNo
DeliveryAddress
...
Types of dimensions
leaf level root level
criterion cardinalities
Product
ProductGroups
Category Department
ProductNumber
ProductName CategoryName DepartmentName
Description Description Description
Size
parent level
child level
Balanced hierarchy
A.3 MultiDim Model for Data Warehouses 637
category 1 category 2
Hierarchy members
Unbalanced hierarchy
Employee
EmployeeId
FirstName
Supervisor
LastName
...
Supervision
Subordinate
Parent-child hierarchy
Customer SectorName
Description Branch
CustType
CustomerId ...
BranchName
CustomerName
Description
Address ...
... Profession
ProfessionName
Description
exclusive path symbol ... exclusive path symbol
Generalized hierarchy
Region
RegionCode
StateName ... CountryName
StateCode Capital
StateCapital Population
... ...
Ragged hierarchy
638 A Graphical Notation
distributing attribute
OrgStucture
EmployeeId SectionName
EmployeeName Description
... ...
Nonstrict hierarchy
Fiscal
Quarter FiscalYear
QuarterNo FiscalYear
Date Month ... ...
Time
Date MonthName
... ... Calendar CalendarYear
Quarter
QuarterNo CalendarYear
... ...
Alternative hierarchy
City State
CityName StateName
Customer CityPopulation StateCapital
Lives
CityArea StatePopulation
CustomerId ... ...
FirstName
LastName
BirthDate AgeGroup
Age
Address
... AgeGroupId
Description
City
CityName
Employee CityPopulation
Lives
CityArea State
EmployeeId ...
StateName
FirstName
StateCapital
LastName
Territories
StatePopulation
Title Territory ...
BirthDate
... TerritoryName
Description
Geo c ComplexGeo
s SimpleGeo PointSet
Point LineSet
Line OrientedLineSet
OrientedLine SurfaceSet
Surface SimpleSurfaceSet
SimpleSurface
Spatial data types
f() Spatial
f(, ) Spatiotemporal
Field data types
meets overlaps/intersects
contains/inside covers/coveredBy
equals disjoint
crosses
Topological relationship types
640 A Graphical Notation
topological
relationship
County State
Geo location
CountyName StateName
CountyPopulation StatePopulation
CountyArea Capital
Spatial hierarchy
Highway
name Maintenance
topological
relationship
Spatial fact
(short description)
name Highway
Maintenance topological
measure relationship
calculated with Length
spatial operators CommonArea spatial data type
for measure
spatial measure
Spatial fact
(with measures shown)
A.5 MultiDim Model for Temporal Data Warehouses 641
Time c ComplexTime
s SimpleTime InstantSet
Instant IntervalSet
Interval
meets overlaps/intersects
contains/inside covers/coveredBy
equals disjoint
starts finishes
precedes succeeds
Synchronization predicates
temporality type
of level Product name
Temporal level
(short description)
temporality type
of level Product name
key attribute Number
Name nontemporal attributes
Description
Size temporal attributes
Distributor
Temporal level
(with attributes and key shown)
Product name
key attribute Number
Name nontemporal attributes
Description
Size temporal attributes
Distributor
Nontemporal level
with temporal attributes
642 A Graphical Notation
synchronization
relationship
Product Category
Categories
ProductKey CategoryKey
ProductName CategoryName
QuantityPerUnit Description
UnitPrice
Temporal hierarchy
Product Load
Activity
Continent
Country State
Load
+
Subprocess
Looping Looping
Activity Subprocess
+
Looping
Splitting Merging
Splitting and merging gateways
Activity Activity
Canceled Compensated
Send Correct
Message Error
Update Update
Add Columns Drop Columns
Columns Columns
Sort Multicast
Y Found
Lookup Lookup
N
NotFound
[266] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with ef-
ficient compression. ACM Transactions on Database Systems, 31(1):1–
38, 2006.
[267] J. Xu and R.H. Güting. A generic data model for moving objects.
GeoInformatica, 17(1):125–172, 2013.
[268] J. Yang and J. Widom. Maintaining temporal views over non-temporal
information sources for data warehousing. In [212], pages 389–403.
[269] M. Yin, B. Wu, and Z. Zeng. HMGraph OLAP: A novel framework
for multi-dimensional heterogeneous network analysis. In Proc. of the
15th ACM International Workshop on Data Warehousing and OLAP,
DOLAP 2012, pages 137–144. ACM Press, 2012.
[270] J. Yu, Z. Zhang, and M. Sarwat. Spatial data management in
Apache Spark: the GeoSpark perspective and beyond. GeoInformat-
ica, 23(1):37–78, 2019.
[271] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Sto-
ica. Spark: Cluster computing with working sets. In Proc. of the 2nd
USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10.
USENIX Association, 2010.
[272] F. Zemke. What’s new in SQL:2011. SIGMOD Record, 41(1):67–73,
2012.
[273] F. Zhang, Q. Lu, Z. Du, X. Chen, and C. Cao. A comprehensive
overview of RDF for spatial and spatiotemporal data management. The
Knowledge Engineering Review, 36(e10), 2021.
[274] P. Zhao, X. Li, D. Xin, and J. Han. Graph Cube: On warehousing and
OLAP multidimensional networks. In Proc. of the ACM SIGMOD In-
ternational Conference on Management of Data, SIGMOD 2011, pages
853–864. ACM Press, 2011.
[275] Y. Zheng and X. Zhou, editors. Computing with Spatial Trajectories.
Springer, 2011.
[276] E. Zimányi. Temporal aggregates and temporal universal quantifiers in
standard SQL. SIGMOD Record, 32(2):16–21, 2006.
[277] E. Zimányi, T. Calders, and S. Vansummeren, editors. Proc. of the
Journées francophones sur les Entrepôts de Données et l’Analyse en
ligne, EDA 2015. Editions Hermann, 2015.
[278] E. Zimányi, M.A. Sakr, and A. Lesuisse. MobilityDB: A mobility data-
base based on PostgreSQL and PostGIS. ACM Transactions on Data-
base Systems, 45(4):19:1–19:42, 2020.
Glossary
child level: Given two related levels in a hierarchy, the lower level, contain-
ing more detailed data. This is to be contrasted with a parent level.
cloud computing: The on-demand delivery of computing services, such as
servers, storage, databases, or software, over the Internet without direct
active management by the user.
cluster: A set of computers connected by a network that work together
so that they can be viewed as a single system. Each node in a cluster
performs the same task, controlled and scheduled by software.
coalescing: In a temporal database, the process of combining several value-
equivalent rows into one provided that their time periods overlap.
column-oriented DBMS: A database management system that uses col-
umn-oriented storage. Examples include PostgreSQL and MySQL. This
is to be contrasted with a row-oriented DBMS.
column-oriented storage: A storage method that organizes data by col-
umn, keeping all of the data associated with a column next to each other
in memory. This is to be contrasted with a row-oriented storage.
complex attribute: An attribute that is composed of several other at-
tributes. This is to be contrasted with a simple attribute.
composite key: A key of a relation that is composed of two or more at-
tributes.
conceptual design: The process of building a user-oriented representation
of a database or a data warehouse that does not contain any implemen-
tation considerations, which results in a conceptual model.
conceptual model: A set of modeling concepts and rules for describing
conceptual schemas.
conceptual schema: A schema that is as close as possible to the users’
perception of the data, without any implementation consideration. This
is to be contrasted with a logical schema and a physical schema.
constellation schema: A relational schema for representing multidimen-
sional data composed of multiple fact tables that share dimension tables.
continuous field: A technique to represent phenomena that change contin-
uously in space and/or time, such as altitude and temperature.
continuous trajectory: A trajectory for which an interpolation function is
used to determine the position of its object at any instant in the period
of observation. This is to be contrasted with a discrete trajectory.
conventional attribute: An attribute that has a conventional data type
as its domain. This is to be contrasted with a temporal and a spatial
attribute.
conventional data type: A data type that represents conventional al-
phanumeric information. Examples include Boolean, integer, float, and
string. This is to be contrasted with a temporal and a spatial data type.
conventional database: A database that only manipulates conventional
data. This is to be contrasted with temporal, a spatial, and a moving-
object database.
670 Glossary
data lake: A repository that stores structured and unstructured data in its
original format to be used for data analytics. This is to be contrasted
with delta lake.
data loading: The process of populating a data warehouse, typically done
as part of the ETL process.
data mart: A data warehouse targeted at a particular functional area
or user group in an organization. Its data can be derived from an
organization-wide data warehouse or be extracted from data sources.
data mining: The process of analyzing large amounts of data to identify
unsuspected or unknown relationships, trends, patterns, and associations
that might be of value to an organization.
data model: A set of modeling concepts and rules for describing the schema
of a database or a data warehouse.
data quality: The degree of excellence of data. Various factors contribute
to data quality, such as whether the data are consistent, nonredundant,
complete, timely, well understood, and follows business rules.
data refreshing: The process of propagating updates from data sources to
a data warehouse in order to keep it up to date. This is typically done as
part of the ETL process.
data source: A system from which data are collected in order to be in-
tegrated into a data warehouse. Such a system may be a database, an
application, a repository, or a file.
data staging area: A storage area where the ETL process is executed and
where the source data are prepared in order to be introduced into a data
warehouse or a data mart.
data transformation: The process of cleaning, aggregating, and integrat-
ing data to make it conform with business rules, domain rules, integrity
rules, and other data. This is typically done as part of the ETL process.
data type: A domain of values with associated operators. The data types
in this book include conventional, spatial, temporal, and spatiotemporal
data types.
data warehouse: A database that is targeted at analytical applications. It
contains historical data about an organization obtained from operational
and external data sources.
database: A shared collection of logically related data, and a description of
these data, designed to meet the information needs of an organization
and to support its activities.
database management system (DBMS): A software system that allows
users to define, create, manipulate, and manage a database.
database reverse engineering: The process through which the logical and
conceptual schemas of a database or of a set of files are reconstructed from
various information sources.
DataFrame: In Spark, an immutable distributed collection of data orga-
nized into named columns, which is conceptually equal to a relation.
This is to be contrasted with an RDD and a DataSet.
672 Glossary