LectureNotes Data Warehousing
LectureNotes Data Warehousing
2
Foreword
Information assets are immensely valuable to any enterprise, and because of this, these assets must be
properly stored and readily accessible when they are needed. However, the availability of too much data
makes the extraction of the most important information difficult, if not impossible. View results from
any Google search, and you’ll see that the “data = information” equation is not always correct. That is,
too much data is simply too much.
Data warehousing is a phenomenon that grew from the huge amount of electronic data stored (start-
ing in the 90s) and from the urgent need to use those data to accomplish goals that go beyond the
routine tasks linked to daily processing. In a typical scenario, a large corporation has many branches,
and senior managers need to quantify and evaluate how each branch contributes to the global business
performance. The corporate database stores detailed data on the tasks performed by branches. To meet
the managers’ needs, tailor-made queries can be issued to retrieve the required data. In order for this
process to work, database administrators must first formulate the desired query (typically an SQL query
with aggregates) after closely studying database catalogs. Then the query is processed. This can take a
few hours because of the huge amount of data, the query complexity, and the concurrent effects of other
regular workload queries on data. Finally, a report is generated and passed to senior managers in the
form of a spreadsheet.
Many years ago, database designers realized that such an approach is hardly feasible, because it
is very demanding in terms of time and resources, and it does not always achieve the desired results.
Moreover, a mix of analytical queries and transactional routine queries inevitably slows down the sys-
tem, and this does not meet the needs of users of either type of query. Today’s advanced data warehous-
ing processes separate online analytical processing (OLAP) from online transactional processing (OLTP)
by creating a new information repository that integrates basic data from various sources, properly ar-
ranges data formats, and then makes data available for analysis and evaluation aimed at planning and
decision-making processes.
For example, some fields of application for which data warehouse technologies are successfully used
are:
• Trade: Sales and claims analyses, shipment and inventory control, customer care and public rela-
tions
• Health care service: Patient admission and discharge analysis and bookkeeping in accounts de-
partments
The field of application of data warehouse systems is not only restricted to enterprises, but it also
ranges from epidemiology to demography, from natural science to education. A property that is com-
mon to all fields is the need for storage and query tools to retrieve information summaries easily and
quickly from the huge amount of data stored in databases or made available on the Internet. This kind
3
A. Abelló and P. Jovanovic Data Warehousing and OLAP
of information allows us to study business phenomena, learn about meaningful correlations, and gain
useful knowledge.
In the following, we present an introduction to such systems and some insights to the techniques,
technologies and methods underneath.
4
Structure
The diagram below indicates dependencies among chapters from a conceptual viewpoint.
After the Introduction (Chapter 1), we can see Data Warehousing Architectures (Chapter 2). Then,
following the four characteristics of a DW defined by W. Inmon (namely subject orientation, integration,
historicity and non-volatility), we can find chapters about OLAP and the Multidimensional Model
(Chapter 3, enabling subject-oriented data analysis), Schema and Data Integration (Chapter 8), and
Spatio-Temporal Databases (Chapter 9, covering bi-temporal databases implementing historicity and
non-volatility). Related to relational implementations of OLAP tools, we have Query Optimization
(Chapter 4), and as a specific optimization technique Materialized Views (Chapter 5). As a next phase
in DW, we have the building of the Extraction, Transformation and Load flows (Chapter 6), which
would involve dealing with Data Quality issues (Chapter 7), as well as the implementation of the in-
tegration and bitemporality mentioned before. The last phase, making use of the data, involves the
visualization in the form of Dashboarding (Chapter 10). Finally, there is a view of improvements or
extensions in the form of A (brief) introduction to Data Warehousing 2.0 (Chapter 11).
5
A. Abelló and P. Jovanovic Data Warehousing and OLAP
6
Contents
1 Introduction 11
1.1 Data Warehousing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.1 The Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1.1 Kinds of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.2 ETL Tools: Extraction Transformation and Load . . . . . . . . . . . . . . . . . . . 15
1.1.2.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.2.2 Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.2.3 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2.4 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.3 Exploitation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Multimedia Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Query optimization 43
4.1 Functional architecture of a DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Query manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1.1 View manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1.2 Security manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1.3 Constraint checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1.4 Query optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Execution manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.4 Data manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.4.1 Buffer manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7
A. Abelló and P. Jovanovic Data Warehousing and OLAP
5 Materialized Views 59
5.1 Materialized Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Problems associated to views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 View Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Update Through Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3 Answering Queries Using Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.4 View Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Materialized View Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Multimedia Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Data Quality 73
7.1 Sources of problems in data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Data Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2.1 Classification of Data Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2.2 Dealing with data conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Data quality dimensions and measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.3 Timeliness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3.5 Trade-offs between data quality dimensions . . . . . . . . . . . . . . . . . . . . . . 77
7.4 Data quality rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4.1 Integrity constraints and dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4.2 Logic properties of data quality rules . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4.3 Fine-tuning data quality rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.5 Data quality improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.6 Object identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Multimedia Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
10 Dashboarding 103
10.1 Dashboard definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
10.2 Key Performance Indicators (KPIs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.3 Complexity of visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.4 Dashboarding guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.5 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.6 Multidimensional representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
References 120
A Acronyms 123
9
A. Abelló and P. Jovanovic Data Warehousing and OLAP
10
Chapter 1
Introduction
Nowadays, the free market economy is the basis of capitalism (the current global economic system) in
which the production and distribution of goods are decided by market businesses and consumers; giv-
ing rise to the supply and demand concept. In this scenario, being more competitive than the other
organizations becomes essential, and decision making raises as a key factor for the organization success.
Decision making is based on information. The more accurate information I get, the better decisions
I can make to get competitive advantages. That is the main reason why information (understood as the
result of processing, manipulating and organizing data in a way that adds new knowledge to the person
or organization receiving it) has become a key piece in any organization. In the past, managers’ ability
for foreseeing upcoming trends was crucial, but this largely subjective scenario changed when the world
became digital. Actually, any event can be recorded and stored for later analysis, which provides new
and objective business perspectives to help managers in the decision making process. Hence, (digital)
information is a valuable asset to organizations, and it has given rise to many well-known concepts such
as Information Society, Information Technologies and Information Systems among others.
For this reason, today, decision making is a research hot topic. In the literature, those applications
and technologies for gathering, providing access to, and analyzing data for the purpose of helping orga-
nization managers make better business decisions are globally known as Decision Support Systems, and
those computer-based techniques and methods used in these systems to extract, manipulate and analyze
data as Business Intelligence (BI). Specifically, BI can be defined as a broad category of applications and
technologies for gathering, integrating, analyzing, and providing access to data to help enterprise users
make better business decisions. BI applications include the activities of decision support systems, query
and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining.
BI implies having a comprehensive knowledge of any of the factors that affect an organization busi-
ness with one main objective: the better decisions you make, the more competitive you are. Under the
BI concept we embrace many different disciplines such as Marketing, Geographic Information Systems
(GIS), Knowledge Discovery or Data Warehousing.
In this work we focus on the latter, which is, possibly, the most popular, generic solution to give
answer to extract, store (conciliate) and analyze data (what is also know as the BI cycle). Fig. 1.1 depicts
a data warehousing system supporting decision-making. There, data extracted from several data sources
is first transformed (i.e., cleaned and homogenized) prior to be loaded in the data warehouse. The data
warehouse is a read-only database (meaning that data loading is not performed by the end-users, who
only query it), intended to be exploited by end-users by means of the exploitations tools (such as query
and reporting, data mining and on-line analytical processing -OLAP-). The analysis carried out over the
data warehouse represents valuable, objective input used in the organizations to build their business
strategy. Eventually, these decisions will impact on the data sources collecting data about organizations
and, again, we repeat the cycle to analyze our business reality and come up with an adequate strategy
suiting our necessities.
In the following sections we discuss in detail data warehousing systems and we will also focus on
their main components, such as the data warehouse, the metadata repository, the ETL tools and the
most relevant exploitation tool related to these systems: OLAP tools.
11
A. Abelló and P. Jovanovic Data Warehousing and OLAP
12
A. Abelló and P. Jovanovic Data Warehousing and OLAP
is loaded (i.e., homogenized, cleaned and filtered) into the data warehouse. Once loaded, it is ready to
be exploited by means of the exploitation tools.
• Subject oriented means that data stored gives information about a particular subject instead of
the daily operations of an organization. These data are clustered together in order to undertake
different analysis processes over it. This fact is represented in Fig. 1.2. There, tables from the
operational sources (which were thought to boost transaction performance) are broken into small
pieces of data before loading the data warehouse, which is interested in concepts such as clients
or products whose data can be spread all over different transactional tables. Now, we can analyze
concepts such as clients that were spread on different transactional system;
• Integrated means that data have been gathered into the data warehouse from a variety of sources
and merged into a coherent whole;
• Time-variant means that all data in the data warehouse is identified with a particular time period
(usually called Valid Time, representing when data are valid in the real world) and for example,
historical data (which is of no interest for OLTP systems) is essential; and finally,
• Non-volatile means that data are stable in the data warehouse. Thus, more data are added but
data are never removed (usually a transaction time attribute is added to represent when things are
recorded). This enables management to gain a consistent picture of the business.
Despite this definition was introduced almost 30 years ago, it still remains reasonably accurate.
However, a single-subject data warehouse is currently referred to as a data mart (i.e., a local or depart-
mental data warehouse), while data warehouses are more global, giving a general enterprise view. In the
literature, we can find other definitions like the one presented in [KRTR98], where a data warehouse is
defined as ”a copy of transaction data specifically structured for query and analysis”; this definition, despite
being simpler, is not less compelling, since it underlines the relevance of querying in a data warehousing
system. The data warehouse design is focused on improving queries performance instead of improv-
ing update statements (i.e., insert, update and delete) like transactional databases do. Moreover, the
data warehousing system end-users are high-ranked people involved in decision making rather than
those low/medium-ranked people maintaining and developing the organization information systems.
Next table summarizes main differences between an operational database and a data warehouse (or, in
general, a decisional system):
13
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Most issues pointed out in table 1.3 have been already discussed, but some others may still remain
unclear. Operational (mainly transactional -OLTP- systems) focus on giving solution to the daily busi-
ness necessities, whereas decisional systems, such as data warehousing, focus on providing a single,
detailed view of the organization to help us on decision making. Usually, a decision is not made on
the isolated view of current data, but putting this in the context of historical evolution, and data are
consequently never updated or deleted, but only added. This addition of data is done in bulk loads per-
formed by batch processes. Users never do it manually, so the data warehouse is considered “read-only”
from this perspective. Thus, data warehouses tend to be massive (incrementally integrate historical in-
formation from different sources, which is never deleted), and nowadays we can find the first petabyte
data warehouses (i.e., at least, one order of magnitude larger than OLTP systems). The reason is that
data warehouses not only load data (including historical data) integrated from several sources but also
pre-computed data aggregations derived from it. In decision making, aggregated (i.e., summarized)
queries are common (e.g., revenue obtained per year, or average number of pieces provided per month,
etc.) since they provide new interesting perspectives. This is better explained in the following section.
• Operational Vs. decisional data: Our data sources contain heaps of data. An operational system
stores any detail related to our daily processes. However, not all data are relevant for decision
making. For example, many retailers are interested in our postal code when shopping but they
show no interest in our address, since their decision making processes deal with city regions (rep-
resented by the postal code). Another clear example is the name and surname, which are not very
interesting in the general case. Oppositely, the age and gender tend to be considered as first-class
citizens in decision making. Indeed, one of the main challenges to populate a data warehouse is
to decide which data subset from our operational sources is needed for our decision making pro-
cesses. Some data are clearly useless for decision making (e.g., customer names), some are clearly
useful (e.g., postal codes), but others are arguable or not that clear (e.g., telephone numbers per
se can be useless to make decisions, but result useful as identifiers of customers). Thus, the de-
cisional data can be seen as a strategic window over the whole operational data. Furthermore,
operational data quality is poor, as update transactions happen constantly and data consistency
can be compromised by them. For this reason, the cleaning stage (see section 1.1.2) when loading
14
A. Abelló and P. Jovanovic Data Warehousing and OLAP
data in the data warehouse is crucial to guarantee reliable and exact data.
• Historical Vs. current data: Operational sources typically store daily data; i.e., current data pro-
duced and handled by the organization processes. Old data (from now on, historical data), how-
ever, was typically moved away from the operational sources and stored in secondary storage
systems such as tapes in form of backups. What is current or old data depends, in the end, on
each organization and its processes. For example, consider an operational source keeping track of
each sale in a supermarket. For performance reasons, the database only stores a one year window.
In the past, older data was dumped into backup tapes and left there, but nowadays, it is kept in
the decisional systems. Note some features about this dichotomy. A decisional system obviously
needs both. Posing queries regarding large time spans is common in decision making (such as
what was the global revenue in the last 5 years). Both data, though, come from the same sources
and current data will, eventually, become old. Thus, it is critical to load data periodically so that
the decisional system can keep track of them.
• Atomic, derived and aggregated data: These concepts are related to the data granularity at which
data are delivered. We refer to atomic data to the granularity stored by the operational sources.
For example, I can store my sales as follows: the user who bought it, the shop where it was sold,
the exact time (up to milliseconds), the price paid and the discount applied. However, decisional
systems often allow to compute derived and aggregated data from atomic data to give answer to
interesting business questions. On the one hand, derived data results from computing a certain
function over atomic data to produce non-evident knowledge. For example, we may be interested
in computing the revenue obtained from a user, which can be obtained by applying the discount
over the initial price and summing up all his sales. Another example would be data produced by
data mining algorithms (e.g., the customer profile defined by a clustering algorithm). Normally,
different attributes or values are needed to compute derived data. On the other hand, aggregates
results from applying an aggregation function for a certain value. For example, the average item
price paid per user, or the global sales in 2011. Thus, it can be seen as a specific kind of derived
data. Derived and aggregated data frequently queried are often pre-computed in decisional sys-
tems (although they can also be computed on-the-fly). The reason is that previous experiences
suggest to pre-compute the most frequent aggregates and derived data in order to boost perfor-
mance (the trade-off between update and query frequencies needs to be considered).
1.1.2.1 Extraction
Relevant data are obtained from sources in the extraction phase. You can use static extraction when a
data warehouse needs populating for the first time. Conceptually speaking, this looks like a snapshot of
operational data. Incremental extraction, used to update data warehouses regularly, seizes the changes
applied to source data since the latest extraction. Incremental extraction is often based on the log main-
tained by the operational DBMS. If a timestamp is associated with operational data to record exactly
15
A. Abelló and P. Jovanovic Data Warehousing and OLAP
when the data are changed or added, it can be used to streamline the extraction process. Extraction
can also be source-driven if you can rewrite operational applications to asynchronously notify of the
changes being applied, or if your operational database can implement triggers associated with change
transactions for relevant data. The data to be extracted is mainly selected on the basis of its quality. In
particular, this depends on how comprehensive and accurate the constraints implemented in sources
are, how suitable the data formats are, and how clear the schemas are.
1.1.2.2 Cleansing
The cleansing phase is crucial in a data warehouse system because it is supposed to improve data quality,
normally quite poor in sources. The following list includes the most frequent mistakes and inconsisten-
cies that make data “dirty”:
• Duplicate data: For example, a patient is recorded many times in a hospital patient management
system
• Inconsistent values that are logically associated: Such as addresses and ZIP codes
• Unexpected use of fields: For example, a SSN (social Security Number) field could be used im-
properly to store office phone numbers
• Inconsistent values for a single entity because different practices were used: For example, to spec-
ify a country, you can use an international country abbreviation (I) or a full country name (Italy);
similar problems arise with addresses (Hamlet Rd. and Hamlet Road)
• Inconsistent values for one individual entity because of typing mistakes: Such as Hamet Road
instead of Hamlet Road
In particular, note that the last two types of mistakes are very frequent when you are managing mul-
tiple sources and are entering data manually. The main data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing mistakes and to recog-
nize synonyms, as well as rule-based cleansing to enforce domain-specific rules and define appropriate
associations between values.
16
A. Abelló and P. Jovanovic Data Warehousing and OLAP
1.1.2.3 Transformation
Transformation is the core of the reconciliation phase. It converts data from its operational source
format into a specific data warehouse format. Establishing a mapping between the data sources and the
data warehouse is generally made difficult by the presence of many different, heterogeneous sources.
If this is the case, a complex integration phase is required when designing your data warehouse. The
following points must be rectified in this phase:
• Loose texts may hide valuable information. For example, BigDeal LtD does not explicitly show
that this is a Limited Partnership company.
• Different formats can be used for individual data. For example, a date can be saved as a string or
as three integers.
• Conversion and normalization that operate on both storage formats and units of measure to make
data uniform.
When populating a data warehouse, you most surely may need to sum up data properly and pre-
compute interesting data aggregations. So that, pre-aggregated data may be needed to be computed at
this point.
1.1.2.4 Loading
Loading into a data warehouse is the last step to take. Loading can be carried out in two ways:
• Refresh: Data warehouse data are completely rewritten. This means that older data are replaced.
Refresh is normally used in combination with static extraction to initially populate a data ware-
house.
• Update: Only those changes applied to source data are added to the data warehouse. Update is
typically carried out without deleting or modifying preexisting data. This technique is used in
combination with incremental extraction to update data warehouses regularly.
• Query & Reporting: This category embraces the evolution and optimization of the traditional
query & reporting techniques. This concept refers to an exploitation technique consisting of
querying data and generating detailed pre-defined reports to be interpreted by the end-user.
Mainly, this approach is oriented to those users who need to have regular access to the infor-
mation in an almost static way. A report is defined by a query and a layout. A query generally
implies a restriction and an aggregation of multidimensional data. For example, you can look for
the monthly receipts during the last quarter for every product category. A layout can look like a
table or a chart (diagrams, histograms, pies, and so on). Fig. 1.5 shows a few examples of layouts
for a query.
17
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Data Mining: Data mining is the exploration and analysis of large quantities of data in order to
discover meaningful patterns and rules. The data mining field is a research area per se, but as the
reader may note, this kind of techniques and tools suit perfectly to the final goal of data ware-
housing systems. Typically, data from the data warehouse is dumped into plain files that feed the
data mining algorithms. Thus, both techniques are traditionally used sequentially: populate the
data warehouse and then, select relevant data to be analyzed by data mining algorithms. Recently,
though, new techniques to tight the relationship between both worlds have been addressed, such
as performing the data mining algorithms inside the data warehouse and avoiding the bottleneck
produced by moving data out. However, this kind of approaches stay out of this course contents.
• OLAP Tools: OLAP stands for On-Line Analytical Processing, which was carefully chosen to con-
front the OLTP acronym (On-Line Transactional Processing). Its main objective is to analyze busi-
ness data from its dimensional or components perspective; unlike traditional operational systems
such as OLTP systems. For a deep insight on OLAP, see chapter 3.
Multimedia Materials
Data Warehouse definition (en)
18
Chapter 2
The following architecture properties are essential for a data warehouse system1 :
• Separation: Analytical and transactional processing should be kept apart as much as possible.
• Scalability: Hardware and software architectures should be easy to upgrade as the data volume,
which has to be managed and processed, and the number of users’ requirements, which have to be
met, progressively increase.
• Extensibility: The architecture should be able to host new applications and technologies without
redesigning the whole system.
• Security: Monitoring accesses is essential because of the strategic data stored in data warehouses.
• Administerability: Data warehouse management should not be overly difficult.
In the following, sections 2.1, 2.2, and 2.3 present a structure-oriented classification that depends
on the number of layers used by the architecture.
19
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Partners’ systems can also be available under some kind of agreement, which can include what
happens in case of changes in the availability or the format of data.
• Own operational systems always result the easiest data source, because they are well known and
we have control over their changes.
For these reasons, a virtual approach to data warehouses can be successful only if analysis needs are
particularly restricted and the data volume to analyze is not huge.
• Source layer: A data warehouse system uses heterogeneous sources of data. That means that data
feeding the data warehouse might come from inside the organization or from external sources
(such as the Cloud, the Web, or from partners). Furthermore, the technologies and formats used
to store these data may also be heterogeneous (legacy2 systems, Relational databases, XML files,
plain text files, e-mails, pdf files, Excel and tabular tables, OCR files, etc.). Furthermore, some
areas have their specificities; for example, in medicine other data inputs such as images or medical
tests must be treated as first-class citizens.
• Data staging: The data stored to sources should be extracted, cleansed to remove inconsistencies
and fill gaps, and integrated to merge heterogeneous sources into one common schema. The so-
called Extraction, Transformation, and Loading tools (ETL), can merge heterogeneous schemas,
extract, transform, cleanse, validate, filter, and load source data into a data warehouse. Tech-
nologically speaking, this stage deals with problems that are typical for distributed information
2 The term legacy system denotes corporate applications, typically running on mainframes or minicomputers, that are currently
used for operational tasks but do not meet modern architectural principles and current standards. For this reason, accessing
legacy systems and integrating them with more recent applications is a complex task. All applications that use a pre-Relational
database are examples of legacy systems.
20
A. Abelló and P. Jovanovic Data Warehousing and OLAP
systems, such as inconsistent data management and incompatible data structures. Section 1.1.2
deals with a few points that are relevant to data staging.
• Data warehouse layer: Information is stored to one logically centralized single repository: a data
warehouse. The data warehouse can be directly accessed, but it can also be used as a source
for creating data marts (in short, local/smaller data warehouses), which partially replicate data
warehouse contents and are designed for specific enterprise departments. Metadata repositories
(see section 2.4) store information on sources, access methods available, data staging, users, data
mart schemas, and so on.
• Analysis: In this layer, integrated data are efficiently and flexibly accessed to issue reports, dy-
namically analyze information, and simulate hypothetical business scenarios. Technologically
speaking, it should feature aggregate data navigators, complex query optimizers, and user-friendly
GUIs. Section 1.1.3 deals with different types of decision-making support analyses.
The architectural difference between data warehouses and data marts needs to be studied closer. The
component marked as a data warehouse in Fig. 2.2 is also often called the primary data warehouse or
corporate data warehouse. It acts as a centralized storage system for all the data being stored together.
Data marts can be viewed as small, local data warehouses replicating (and pre-computing as much
as possible) the part of a primary data warehouse required for a specific application domain. More
formally, A data mart is a subset or an aggregation of the data stored to a primary data warehouse. It in-
cludes a set of information pieces relevant to a specific business area, corporate department, or category
of users. The data marts populated from a primary data warehouse are often called dependent. Al-
though data marts are not strictly necessary, they are very useful for data warehouse systems in midsize
to large enterprises because:
• they are used as building blocks while incrementally developing data warehouses;
• they mark out the information required by a specific group of users to solve queries;
• they can deliver better performance because they are smaller (i.e., only partial history, not all
sources and not necessarily the most detailed data) than primary data warehouses.
21
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Sometimes, mainly for organization and policy purposes, you should use a different architecture in
which sources are used to directly populate data marts. These data marts are called independent. If
there is no primary data warehouse, this streamlines the design process, but it leads to the risk of incon-
sistencies between data marts. To avoid these problems, you can create a primary data warehouse and
still have independent data marts. In comparison with the standard two-layer architecture of Fig. 2.2,
the roles of data marts and data warehouses are actually inverted. In this case, the data warehouse is
populated from its data marts, and it can be directly queried to make access patterns as easy as possi-
ble. The following list sums up all the benefits of a two-layer architecture, in which a data warehouse
separates sources from analysis applications:
• In data warehouse systems, good quality information is always available, even when access to
sources is denied temporarily for technical or organizational reasons.
• Data warehouse analysis queries do not affect the management of transactions, the reliability of
which is vital for enterprises to work properly at an operational level.
• Data warehouses are logically structured according to the multidimensional model (see chapter
3), while operational sources are generally based on Relational or semi-structured models.
• A mismatch in terms of time and granularity occurs between OLTP systems, which manage current
data at a maximum level of detail, and OLAP systems, which manage historical and aggregated
data.
• Data warehouses can use specific design solutions aimed at performance optimization of analysis
and report applications.
Finally, it is worth to pay attention to the fact that a few authors use the same terminology to define
different concepts. In particular, those authors consider a data warehouse as a repository of integrated
and consistent, yet operational, data, while they use a multidimensional representation of data only
in data marts. According to our terminology, this “operational view” of data warehouses essentially
corresponds to the reconciled data layer in three-layer architectures.
2.4 Metadata
In addition to those kinds of data already discussed in section 1.1.1.1, metadata represents a key aspect
for the discussed architectures. The prefix “meta-” means “more abstract” (e.g., metarule, metaheuris-
tic, metalanguage, metaknowledge, metamodel, etc.). Prior to come up with a formal definition for
metadata, let us formally introduce the data and information concepts. According to the ISO definition,
“data is a representation of facts, concepts and instructions, done in a formalized manner, useful for
22
A. Abelló and P. Jovanovic Data Warehousing and OLAP
23
A. Abelló and P. Jovanovic Data Warehousing and OLAP
communication, interpretation and process, by human beings as well as automated means”. Informa-
tion, however, is something more. According to the ISO definition, “information, in the processing of
data and office machines, is the meaning given to data from the conventional rules used in their rep-
resentation”. The difference can be seen crystal clear with an example. Consider the following datum
extracted from a database: 1100. So what information represents this datum? to answer this, it will help
to know that this datum is in binary format, its type is an integer, represents the age attribute (in months)
from a table named dogs and this datum is updated every year (and last update was in December). All
these data about the 1100 datum needs to be stored in order to process it as information. This is what we
know as metadata, which is applied to the data used to define other data. In short, it is data that allows
to interpret data as information. A typical metadata repository is the Relational databases catalog.
In the scope of data warehousing, metadata play an essential role because it specifies source, values,
usage, and features of data warehouse data, which is fundamental to come up and justify innovative
analysis, but also because they define how data can be changed and processed at every architecture layer
(hence, fostering automation). Fig. 2.2 and 2.3 show that the metadata repository is closely connected
to the data warehouse. Applications use it intensively to carry out data-staging and analysis tasks. One
can classify metadata into two partially overlapping categories. This classification is based on the ways
system administrators and end users exploit metadata. System administrators are interested in inter-
nal (technical) metadata because it defines data sources, transformation processes, population policies,
logical and physical schemas, constraints, and user profiles and permissions. External (business) meta-
data are relevant to end users. For example, it is about definitions, actualization information, quality
standards, units of measure, relevant aggregations, derivation rules and algorithms, etc.
Metadata are stored in a metadata repository which all the other architecture components can access.
A tool for metadata management should:
• allow administrators to perform system administration operations, and in particular manage se-
curity;
• allow end users to navigate and query metadata;
• use a GUI;
24
Chapter 3
OLAP tools are intended to ease information analysis and navigation all through the data warehouse, for
extracting relevant knowledge of the organization. This term was first introduced by E.F. Codd in 1993
[CCS93], and it was carefully chosen to confront OLTP. In the context of data warehousing, as depicted
in Fig. 3.1, OLAP tools are placed in between the data warehouse and the front-end presentation tools.
OLAP tools are precisely defined by means of the FASMI (Fast Analysis of Shared Multidimensional
Information) test [Pen08]. According to it, an OLAP tool must provide Fast query answering to not
frustrate the end-user reasoning; offer Analysis tools, implement security and concurrent mechanisms
to Share the business Information from a Multidimensional point of view. This last feature is the most
important one since OLAP tools are conceived to exploit the data warehouse for analysis tasks based on
multidimensionality.
We can say that the multidimensionality plays for OLAP the same role as the Relational model for
Relational databases. Unfortunately, unlike Relational databases, there is not yet consensus about a
standard multidimensional model (among other reasons because major software vendors are not inter-
ested to reach such agreement). However, we can nowadays talk about a de facto multidimensional
data structure (or multidimensionality), and to some extent, about a de facto multidimensional alge-
bra, which we next present at three different levels: (1) what reality multidimensionality models, (2)
how this reality can be represented at the conceptual level, and (3) which are the logical (also physical)
alternative representations level.
25
A. Abelló and P. Jovanovic Data Warehousing and OLAP
26
A. Abelló and P. Jovanovic Data Warehousing and OLAP
basic operations.
For example, consider the data cube in Fig. 3.2. Now, the user could decide to slice it by setting
constraints (e.g., product = ’Eraser’, date = ’2-1-2006’, city <> ’Barcelona’, etc.) over any of the three
dimensional axis (see Fig. 3.3) or apply several constraints to more than one axis (see Fig. 3.4). In general,
after applying a multidimensional operator, we obtain another cube that we can further navigate with
other operations. As a whole, the set of operators applied over the initial cube is what we call the
navigation path.
As a result, multidimensionality enables analysts, managers, executives and in general those people
involved in decision making, to gain insight into data through fast queries and analytical tasks, allowing
them to make better decisions.
Figure 3.5: An example of roll-up / drill-down over a cube represented in tabular form
27
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Consider Fig. 3.6, which depicts a conceptual transactional UML model (similar to ER). These mod-
els were defined to:
• Reduce the amount of redundant data
• Eliminate the need to modify many records because of one modification, very efficient if data
change very often
However, it is also well-known that:
• Degrades response time in front of queries (mainly due to the presence of join operators)
• It is easy to make mistakes if the user is not an expert in computer science
These pros and cons suit perfectly operational (typically transactional) systems, but it does not suit
decisional systems like data warehouses, which aim at fast answering and easy navigation of data.
As discussed, multidimensionality is based on the fact / dimension dichotomy. Dimensional con-
cepts produce the multidimensional space in which the fact is placed. Dimensional concepts are those
concepts likely to be used as a new analytical perspective, which have traditionally been classified as
dimensions, levels and descriptors. Thus, we consider that a dimension consists of a hierarchy of
28
A. Abelló and P. Jovanovic Data Warehousing and OLAP
levels representing different granularities (or levels of detail) for studying data, and a level containing
descriptors (i.e., level attributes). We denote by atomic level the level at the bottom of the dimension
hierarchy (i.e., that of the finest level of detail) and by All level the level at the top of the hierarchy con-
taining just one instance representing the whole set of instances in the dimension. In contrast, a fact
contains measures of analysis. Importantly, note that a fact may produce not just one but several dif-
ferent levels of data granularity. Therefore, we say that a certain granularity contains individual cells
of the same granularity from the same fact. A specific granularity of data is related to one level for
each of its associated dimensions of analysis. Finally, one fact and several dimensions for its analysis
produce what Kimball called a star schema.
Finally, note that we consider {product × day × city} in Fig. 3.2 to be the multidimensional base
of the the finest fact granularity level (i.e., that related to the atomic levels of each dimension, which
is also known as the atomic granularity). Thus, it means that one value of each one of these levels
determines one cell (i.e., a sale with its price, discount, etc.). Importantly, this is a relevant feature
of multidimensionality. Levels determine factual data or, in other words, they can be depicted as func-
tional dependencies (the set of levels determine the fact, and each fact cell has associated a single value
from each level). That is the reason why level - fact relationships have 1-* (one-to-many) multiplici-
ties. In the multidimensional model, *-* (many-to-many) relationships are meaningless as they do not
preserve the model constraints.
Recall now the transactional schema used as example in Fig. 3.6. You should be able to distin-
guish why it cannot be considered a multidimensional schema. Oppositely, check the multidimensional
schemas in Figs. 3.7 and 3.8 and see the differences:
29
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Furthermore, they only include data relevant for decision making. Thus, they are simpler and do not
contain as much data as transactional schemas. Consequently, they are easier to use and understand,
and queries are fast and efficient over such schemas. Note another important property of these schemas:
dimensions are represented as a single object. For example, in a Relational implementation (see next
section for further details), it would imply there is only one table for all the dimension. Clearly, this
is against the second and third Relational normal forms, but denormalization of data is common in
data warehousing as we aim to boost querying performance and, in this way, we avoid joins, the most
expensive Relational operator. Denormalization is acceptable in such scenarios since the users are not
allowed to insert data, and they only query the data warehouse. Thus, since the ETL process is the only
responsible to insert data, denormalize such schemas is sound and acceptable (even encouraged).
Although the star-schema is the most popular conceptual representation, there are some others such
as the snowflake schema and the constellation schema. The first one (see the left-hand side schema on
Fig. 3.9) corresponds to a normalized star-schema; i.e., without denormalizing dimensions and hence
each concept is explicited separately. Finally, when a schema contains more than one fact, which share
dimensions is called a galaxy or constellation (see the right-hand side schema on Fig. 3.9), and facilitates
drilling-across (see algebraic operations bellow).
30
A. Abelló and P. Jovanovic Data Warehousing and OLAP
the Relational database makes this fact transparent for the users), allowing them to take advantage
of a well-known and established technology, whereas MOLAP systems are based on an ad hoc logical
model that can be used to represent multidimensional data and operations directly. The underlying
multidimensional database physically stores data as flatten/serialized arrays (and the access to it is
positional), Grid-files, R*-trees or UB-trees, which are among the techniques used for this purpose (see
Fig. 3.10).
As consequence, ROLAP tools used to deal (nowadays, this statement is starting to crumble) with
larger volumes of data than MOLAP tools (i.e., ad hoc multidimensional solutions), but their perfor-
mance for query answering and cube browsing is not as good (mainly because Relational technology
was conceived for OLTP systems and they tend to generate too many joins when dealing with multidi-
mensionality). Thus, new HOLAP (Hybrid On-line Analytical Processing) tools were proposed. HOLAP
architecture combines both ROLAP and MOLAP ones trying to obtain the strengths of both approaches,
and they usually allow to change from ROLAP to MOLAP and viceversa. Specifically, HOLAP takes
advantage of the standardization level and the ability to manage large amounts of data from ROLAP
implementations, and the query speed typical of MOLAP systems. HOLAP implies that the largest
amount of data should be stored in a Relational DBMS to avoid the problems caused by sparsity, and
that a multidimensional system stores only the information users most frequently need to access. If
that information is not enough to solve queries, the system will transparently access the part of the data
managed by the Relational system.
Figure 3.11: Example of a star-join schema corresponding to the right-hand star-schema in Fig. 3.8
(ignoring dimensions “Doctor” and “Therapy” for the sake of simplicity)
Although ROLAP tools have failed to dominate the OLAP market due to its severe limitations
(mainly slow query answering) [Pen05], at the beginning, they were the reference architecture. Indeed,
Kimball’s reference book [KRTR98] presented how a data warehouse should be implemented over a Re-
lational DBMS (Relational Database Management System) and how to retrieve data from it. To do so, he
introduced for first time the star-join (to implement star-schemas) and snowflake schemas (to implement
the conceptual schemas with the same name). At Relational level, the star schema consists of one table
for the fact and one denormalized table for every dimension, with the latter being pointed by foreign
keys (FK) from the fact table, which compose its primary key (PK) (see Fig. 3.11). The normalized3 ver-
sion of a star schema is a snowflake schema; getting a table for each level with a FK pointing to each of
its parents in the dimension hierarchy. Nevertheless, both approaches can be conceptually generalized
into a more generic one consisting in partially normalizing the dimension tables according to our needs:
completely normalizing each dimension we get a snowflake schema, and not normalizing them at all
results in a star schema. In general, normalizing the dimensions requires a very good reason, since it
produces a very little gain in the size of dimension and a big loss in the performance of queries (since it
generates more joins). Normalization was conceived to minimize redundancies that hinder performance
in the presence of updates. However, the DW is considered read-only and dimensional data is specially
3 https://youtu.be/SSio_jhAmzg
31
A. Abelló and P. Jovanovic Data Warehousing and OLAP
The FROM clause contains the ”Fact table” and the ”Dimension (or level in case of a snowflake
schema) tables”. These tables are properly linked in the WHERE clause by ”joins” (if a star-schema,
only between the fact and dimension tables. In case of snowflake schema, also between dimensional
tables) that represent concept associations. The WHERE clause also contains logical clauses restricting
a specific level attribute (i.e., a descriptor) to a constant using a comparison operator (used to slice
the produced cube). The GROUP BY clause shows the identifiers of the levels at which we want to
aggregate data. Those columns in the grouping must also be in the SELECT clause to identify the values
in the result. Finally, the ORDER BY clause is designed to sort the output of the query. As output, a
cube-query will produce a single data cube (i.e., a specific level of data granularity). For example, think
of the cube-query necessary to produce the cube (shown in tabular form) in Fig. 3.12, from the star-join
schema in Fig. 3.11.
For the sake of understandability, we present the algebra by means of an example. Consider a
snowflake implementation of the conceptual schema depicted in Fig. 3.13. The cube-query that would
32
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Note that no grouping is needed as we are just retrieving atomic data. Next, we show how this cube-
query would be modified by each multidimensional operator next introduced (we suggest the reader to
follow Fig. 3.14 to grasp the idea behind each operator):
• Selection or Dice: By means of a logic predicate over the dimension attributes, this operation
allows users to choose the subset of points of interest out of the whole n-dimensional space (re-
member to check Fig. 3.14).
Figure 3.15: Exemplification of an OLAP navigation path translation into SQL queries
In SQL, it means to and the corresponding comparison clause to the cube-query WHERE clause.
For example, consider the atomic cube-query presented as example. If we want to analyze the
sales data regarding to the city of Barcelona, we must perform a selection over the city di-
mension (see Fig. 3.15).
• Roll-up: Also called “Drill-up”, it groups cells in a Cube based on an aggregation hierarchy. This
operation modifies the granularity of data by means of a many-to-one relationship which relates
33
A. Abelló and P. Jovanovic Data Warehousing and OLAP
instances of two levels in the same dimension. For example, it is possible to roll-up monthly sales
into yearly sales moving from “Month” to “Year” level along the temporal dimension.
In SQL, it entails to replace the identifiers of the level from where we roll-up with those of the
level that we roll-up to. Thus, the SELECT, GROUP BY and ORDER BY clauses must be modi-
fied accordingly. Measures in the SELECT clause must also be summarized using an aggregation
function. In our example (see Fig. 3.15), we perform two different roll-ups: on the one hand, we
roll-up from product id to the All level. On the other hand, we roll-up from city to country.
Note that the country table is added to the FROM clause, and we replace the city identifier with
that of the country level in the SELECT, GROUP BY and ORDER BY clauses. Finally, we add the
proper links in the WHERE clause. About rolling-up from product to the All level, note that it is
equivalent to remove both the product identifiers and its links.
• Drill-down: This is the counterpart of Roll-up. Thus, it removes the effect of that operation by
going down through an aggregation hierarchy, and showing more detailed data.
In SQL, Drill-down can only be performed by undoing (i.e, changes introduced in the cube-query)
the Roll-up operation.
• ChangeBase: This operation reallocates exactly the same instances of a cube into a new n-dimensional
space with exactly the same number of points. Actually, it allows two different kinds of changes in
the space: rearranging the multidimensional space by reordering the dimensions, interchanging
rows and columns in the tabular representation (this is also known as pivoting), or adding/remov-
ing dimensions to/from the space.
In SQL it can be performed in two different ways. If we reorder the base (i.e., when “pivoting”),
we just need to reorder the identifiers in the ORDER BY and SELECT clauses. But if changing
the base, we need to add the new level tables to the FROM and the corresponding links to the
WHERE clause. Moreover, identifiers in the SELECT, ORDER BY and GROUP BY clauses must
be replaced appropriately. Following with the same example shown in Fig. 3.15, we can change
from {day × country × All} to {day × country}. Note that both bases are conceptually related
by means of a one-to-one relationship. Specifically, this case typically applies when dropping
a dimension (i.e., rolling-up to its All level and then changing the base). We roll-up to the
All for representing the whole dimensions instances as a single one and therefore, producing
the following base: {day × country × 1}. Now, we can changeBase to {day × country} without
introducing aggregation problems (since we changeBase through a one-to-one relationship).
• Drill-across: This operation changes the subject of analysis of the cube, by showing measures
regarding a new fact. The n-dimensional space remains exactly the same, only the data placed
in it change so that new measures can be analyzed. For example, if the cube contains data about
sales, this operation can be used to analyze data regarding stock using the same dimensions.
In SQL, we must add a new fact table to the FROM clause, its measures to the SELECT, and the
corresponding links to the WHERE clause. In general, if we are not using any semantic relation-
ship, a new fact table can always be added to the FROM clause if fact tables share the same base. In
our example, suppose that we have a stock cube sharing the same dimensions as the sales cube.
Then, we could drill-across to the stock cube and show both the stock and sales measures (see
Fig. 3.15).
• Set operations: These operations allow users to operate two cubes defined over the same n-
dimensional space. Usually, Union, Difference and Intersection are considered.
In this document, we will focus on Union. Thus, in SQL, we unite the FROM and WHERE clauses
of both SQL queries and finally, we or the selection conditions in the WHERE clauses. Importantly,
note that we can only union queries over the same fact table. Intuitively, it means that, in the
34
A. Abelló and P. Jovanovic Data Warehousing and OLAP
multidimensional model, the union is used to undo selections. We can unite our example query
to one identical but querying for data concerning Lleida instead of Barcelona.
Figure 3.16: Sketch summarizing the main design differences between a transactional and a decisional
system
35
A. Abelló and P. Jovanovic Data Warehousing and OLAP
D TIME
L L L
All Month Day
- monthYear - dayMonthYear
1
F
* Sales
-items (measure)
D
*
PRODUCT 1
L L
Item D
*1 PLACE
All - item_name L City L Region L
All
- city - region
For the sake of comprehension, the reader will note three slicers over each dimension. The reason is
to restrict the amount of values to be shown in the upcoming figures.
Where ⊕ denotes a cube composition table-wise. Now, go back to the previous cube-query intro-
duced and realize that that query only retrieves the four yellow cells. To obtain the other cells we need
to union four queries as shown below:
SELECT d1.itemName, d2.region, d3.monthYear, SUM(fact.items)
FROM Sales fact, Product d1, Place d2, Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint’,’Rubber’) AND d2.region=’Catalonia’
AND d3.monthYear IN (‘January2002’,’February2002’)
GROUP BY d1.itemName, d2.region, d3.monthYear
36
A. Abelló and P. Jovanovic Data Warehousing and OLAP
UNION
SELECT d1.itemName, d2.region, ‘Total’, SUM(fact.items)
FROM Sales fact, Product d1, Place d2, Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint’,’Rubber’) AND d2.region=’Catalonia’
AND d3.monthYear IN (‘January2002’,’February2002’)
GROUP BY d1.itemName, d2.region
UNION
SELECT ‘Total’, d2.region, d3.monthYear, SUM(fact.items)
FROM Sales fact, Product d1, Place d2, Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint’,’Rubber’) AND d2.region=’Catalonia’
AND d3.monthYear IN (‘January2002’,’February2002’)
GROUP BY d2.region, d3.monthYear
UNION
SELECT ‘Total’, d2.region, ‘Total’, SUM(fact.items)
FROM Sales fact, Product d1, Place d2, Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint’,’Rubber’) AND d2.region=’Catalonia’
AND d3.monthYear IN (‘January2002’,’February2002’)
GROUP BY d2.region
ORDER BY d1.itemName , d2.region , d3.monthYear;
Check carefully each united piece. They are almost identical. Indeed, formally, we are only changing
the GROUP BY clause to obtain the desired granularity. Furthermore, if we were interested in comput-
ing the totals for all three dimensions instead of two, we would need 7 unions; i.e., 8 different SQL
queries at different granularities only changing their GROUP BY clause. In other words, in addition
to those granularities computed in the example introduced we should compute the {Month, All, Item},
{Month, All, All} and {All, All, Item} granularities.
Figure 3.18 sketches the eight cubes to union (Sales per Day, Product and City; per Day and Product;
per Day and City; per Product and City; per Day; per Product; per City; and the overall total).
It is easy to realize that the amount of unions to perform grows at an exponential rate with regard to
the number of dimensions. Fortunately, the SQL’99 standard provides specific syntax to save us time.
As shown in the next query, we can use the GROUPING SETS keyword in the GROUP BY clause to
produce the desired granularities (in brackets after the keyword). In this way the other query clauses
are written only once:
SELECT d1.itemName , d2.region , d3.monthYear , SUM(fact.items)
FROM Sales fact , Product d1 , Place d2 , Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint ’, ’Rubber ’) AND d2.region=’Catalonia ’
AND d3.monthYear IN (‘January2002 ’,’February2002 )
GROUP BY GROUPING SETS ((d1.itemName, d2.region, d3.monthYear),
(d1.itemName, d2.region),
(d2.region, d3.monthYear),
(d2.region))
ORDER BY d1.itemName , d2.region , d3.monthYear;
The query output would be the following:
37
A. Abelló and P. Jovanovic Data Warehousing and OLAP
38
A. Abelló and P. Jovanovic Data Warehousing and OLAP
3.4.2 ROLLUP
As previously discussed, the number of totals grows at an exponential rate regarding the number of
dimensions in the cube. As a consequence, even if we can save writing the other query clauses by using
the GROUPING SETS modifier, it can still happen to be unbearable in some cases. In other words, the
GROUPING SETS avoids writing the same query n times, but we still have to write all the attribute
combinations needed to produce the desired granularities.
To facilitate even more computing aggregations, the SQL’99 standard introduced the ROLLUP key-
word. Given an attribute set, it computes all the aggregations by disregarding, on each grouping, the
right-most attribute in the set. Thus, it is not a set in the mathematical sense, as order does matter. Con-
sider the previous query that used the GROUPING SETS modifier and how it can be rewritten using the
ROLLUP keyword (pay attention to the attribute order in the GROUP BY and ORDER BY clauses):
SELECT d1.itemName , d2.region , d3.monthYear , SUM(fact.items)
FROM Sales fact , Product d1 , Place d2 , Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint ’, ’Rubber ’) AND d2.region=’Catalonia ’
AND d3.monthYear IN (‘January2002 ’,’February2002 ’)
GROUP BY ROLLUP (d2.region , d1.itemName , d3.monthYear );
ORDER BY d2.region , d3.monthYear , d1.itemName;
The query output would be:
itemName region monthYear items
Ballpoint Catalonia January02 275827
Rubber Catalonia January02 784172
Ballpoint Catalonia February02 290918
Rubber Catalonia February02 918012
Ballpoint Catalonia NULL 566745
Rubber Catalonia NULL 1702184
NULL Catalonia NULL 2268929
NULL NULL NULL 2268929
39
A. Abelló and P. Jovanovic Data Warehousing and OLAP
The first four rows correspond to the “GROUP BY d2.region, d1.itemName, d3.monthYear”; next
two to the “GROUP BY d2.region, d1.itemName”; next one to the “GROUP BY d2.region”; and the last
one to the “GROUP BY ()” (note that from the SQL’99 standard on it is allowed to write GROUP BY();
i.e., with the empty list). If we happen to have a single value for d2.region (e.g., Catalonia), the measure
value in the two last rows happens to be identical.
Realize, accordingly, that GROUP BY ROLLUP (a1 ,..,an ) corresponds to:
GROUP BY GROUPING SETS ((a1 , ..., an ),
(a1 , ..., an−1 ),
...
(a1 ),
())
Importantly, note these two relevant features. Firstly, the order you write the attributes in the
ROLLUP list does matter (it determines the aggregations to be performed). However, the attribute
order specified in the ORDER BY clause does not affect the query result (only how it is presented to the
user). Secondly, we can avoid producing the last row by fixing the corresponding dimension to a single
value (in our case, it is fixed to Catalonia, check the slicer over the Place dimension in the query) and
forcing this value to be present in all aggregations computed (i.e., placing it in the GROUP BY but out
of the ROLLUP expression):
SELECT d1.itemName , d2.region , d3.monthYear , SUM(fact.items)
FROM Sales fact , Product d1 , Place d2 , Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint ’, ’Rubber ’) AND d2.region=’Catalonia ’
AND d3.monthYear IN (‘January2002 ’,’February2002 ’)
GROUP BY d2.region , ROLLUP (d1.itemName , d3.monthYear );
ORDER BY d2.region , d3.monthYear , d1.itemName;
This query output is exactly the same as the one presented before but without the last row.
Note that we can rewrite the GROUPING SETS exemplifying query presented in previous section
(which contained four aggregations) as follows:
SELECT d1.itemName , d2.region , d3.monthYear , SUM(fact.items)
FROM Sales fact , Product d1 , Place d2 , Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint ’, ’Rubber ’) AND d2.region=’Catalonia ’
AND d3.monthYear IN (‘January2002 ’,’February2002 )
GROUP BY GROUPING SETS ((d2.region , ROLLUP (d1.itemName , d3.monthYear )),
(d2.region , d3.monthYear ))
ORDER BY d1.itemName , d2.region , d3.monthYear;
3.4.3 CUBE
ROLLUP saves us from writing down all attribute combinations in order to produce the desired gran-
ularities. However, we are still forced to write some of them (check the previous example). But it is
possible to produce all the combinations needed for the GROUPING SETS exemplifying query by using
the CUBE keyword. Check the following query; it produces the 9 cells of that example in a more concise
way:
SELECT d1.itemName , d2.region , d3.monthYear , SUM(fact.items)
FROM Sales fact , Product d1 , Place d2 , Time d3
WHERE fact.IDProduct=d1.ID AND fact.IDPlace=d2.ID AND fact.IDTime=d3.ID
AND d1.itemName IN (‘Ballpoint ’, ’Rubber ’) AND d2.region=’Catalonia ’
AND d3.monthYear IN (‘January2002 ’,’February2002 )
GROUP BY d2.region , CUBE (d1.itemName , d3.monthYear );
ORDER BY d1.itemName , d2.region , d3.monthYear;
40
A. Abelló and P. Jovanovic Data Warehousing and OLAP
In addition, CUBE and ROLLUP can be combined in the same GROUP BY expression to obtain the
desired attribute combinations. For example:
GROUP BY CUBE(a,b), ROLLUP(c,d) corresponds to:
Indeed, the SQL’99 standard allows combining CUBE, ROLLUP and GROUPING SETS to a certain
extent. We suggest to practice with a toy example and realize which combinations make sense and
which do not.
3.4.4 Conclusions
The CUBE, ROLLUP and GROUPING SETS keywords were introduced in the SQL’99 standard as mod-
ifiers for the GROUP BY clause. They were intended to facilitate aggregation computations and thus,
they can be considered as syntactic sugar. However, they are something else than pure syntactic sugar,
as they do improve the system performance since the query optimizer receives valuable additional
information about the queries to be carried out.
• A transactional conceptual schema has no further restrictions than those of the application do-
main. For example, it can be modeled using any UML feature. However, multidimensional
schemas are a simplified version, denoted by the star-shaped schemas.
• OLTP systems are traditionally implemented using the Relational technology, which has been
proven to suit their necessities. However, a multidimensional schema can be either implemented
via ROLAP or MOLAP approaches.
41
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Multimedia Materials
Multidimensionality (en)
Multidimensionality (ca)
Benefits of the Star Schema (en)
Benefits of the Star Schema (ca)
42
Chapter 4
Query optimization
A Database Management System (DBMS) is much more than a file system (see [GUW09]). Thus, before
looking at the details of the query optimization process, we will see a functional architecture of a DBMS.
A functional architecture is that representing the theoretical components of the system as well as their
interaction. Concrete implementations do not necessarily follow such architecture, since they are more
concerned with performance, while a functional architecture is defined for communication, teaching
and learning purposes.
Indeed, a DBMS offers many different functionalities, but maybe the most relevant to us is query
processing, which includes views, access control, constraint checking and, last but not least, optimiza-
tion. The latter functionality consists on deciding the best way to access the requested data (a.k.a. access
plan). Once it has been decided, the plan is passed through the execution manager that allows to handle
a set of operations as in a coordinated way. Finally, the data manager helps to avoid the disk bottleneck
by efficiently moving data from disk to memory and vice-versa, guaranteeing that no data is lost in case
of system failure.
1 Parallel and distributed computing dramatically influences the different components of this architecture.
43
A. Abelló and P. Jovanovic Data Warehousing and OLAP
From the point of view of the final user, there is no difference on querying data in a view or a table.
However, dealing with views is not an easy problem and we are not going to go into details in this
chapter. Nevertheless, we want just to briefly show in this section the difficulties it raises.
First of all, we must take into account that data in the view can be physically stored (a.k.a. mate-
rialized) or not, which raises new difficulties. If data in the view is not physically stored, in order to
transform the query over the views into a query over the source tables, we must replace the view name
in the user query by its definition (this is known as view expansion). In some cases, it is more efficient
to instruct the DBMS to calculate the view result and store it waiting for the queries. However, if we
do it, we have to be able to transform an arbitrary query over the tables into a query over the available
materialized views (this is known as query rewriting), which is kind of the opposite to view expansion (in
the sense that we have to identify the view definition in the user query and replace it by the view name).
If we are able to rewrite a query, we still have to decide whether it is worth rewriting it or not, i.e., the
redundant data can be used to improve the performance of the query (this is known as answering queries
using views).
Finally, updating data in the presence of views is also more difficult. Firstly, we would like to allow
users to express not only queries but also updates in terms of views (this is known as update through
views), which is only possible in few cases. Secondly, if views are materialized, changes in the tables are,
potentially, propagated to the views (this is known as view updating).
Ethics as well as legal issues raise the need to control access to data. We cannot allow any user to query
or modify all data in our database. This module is in charge of defining user privileges, and once this is
44
A. Abelló and P. Jovanovic Data Warehousing and OLAP
done, validate user statements by checking whether they are allowed to perform the associated action
or not.
1. Validation
(a) Check syntax
(b) Check permissions
(c) Expand views
(d) Check table schema
2. Optimization
(a) Semantic
(b) Syntactic
(c) Physical
3. Evaluation (i.e., disk access)
We should remember that it has three components, namely semantic, syntactic and physical opti-
mizers (see details in Section 4.2).
4.1.3 Scheduler
As you know, many users (up to tens or hundreds of thousands) can work concurrently on a database. In
such case, it is quite likely that they want to access not only the same table, but exactly the same column
and row. If so, they can interfere the task one another. The DBMS must provide some mechanisms
to deal with this problem. Roughly speaking, the way to do it is to restrict the execution order of the
operations (i.e., reads, writes, commits and aborts) of the different users. On getting a command, the
scheduler can either pass it directly to the data manager, queue it waiting for the appropriate time to
be executed, or definitely cancel it (resulting in the abortion of its transaction). The most basic and
commonly used mechanism to avoid interferences is Shared-eXclusive locking (see [BHG87]).
45
A. Abelló and P. Jovanovic Data Warehousing and OLAP
46
A. Abelló and P. Jovanovic Data Warehousing and OLAP
As sketched in Figure 4.2, the optimizer is a module inside the DBMS, whose input is an SQL query
and its output is an access plan expressed in a given procedural language. Its objective is to obtain an
execution algorithm as good as possible based on the contents of the database catalog:
• Available access structures (for example, B-tree, bitmaps and hash indexes).
The cost function to be minimized typically refers to machine resources such as disk space, disk
input/output, buffer space, and CPU time. In current centralized systems where the database resides
on disk storage, the emphasis is on minimizing the number of disk accesses.
The search space in this optimization problem is of exponential size with regard to the input query
(more specifically NP-complex in the number of Relations involved), so not really all possibilities can
be explored. Indeed, finding the optimum is so computationally hard, that it can result even in higher
costs than just retrieving the data. Therefore, DBMSs use heuristics to prune the vast search space,
which means that some times they do not obtain the optimum (altough they use to be close). Thus,
in general, a DBMS does not find the optimal access plan, but only an approximation (in a reasonable
time). It is important to know how the optimizer works to detect such deviations and correct them,
whenever possible (for example, adding or removing some indexes, partitions, etc.).
Even though its real implementation could not be that modular, we can study the optimization
process as if executed in three sequential steps (i.e., Semantic, Syntactic and Physical).
a) Integrity constraints
This module applies transformations to a given query and produces equivalent queries intended to
be more efficient, for example, standardization of the query form, flattening out of nested queries, and
the like. It aims to find incorrect queries (i.e., incorrect form or contradictory). Having an incorrect
form means that there is a better way to write the query from the point of view of performance, while
being contradictory means that its result is going to be the empty set. The transformations performed
47
A. Abelló and P. Jovanovic Data Warehousing and OLAP
depend only on the declarative, that is, static, characteristics of queries and do not take into account the
actual query costs for the specific DBMS and database concerned.
Replicating clauses over one side of an equality to the other is a typical example of transformation
performed at this phase of the optimization. This may look a bit naive, but it allows to use indexes over
both attributes. For example, if the optimizer finds ”a=b AND a=5”, it will transform it into ”a=b AND
a=5 AND b=5”, so that if there is and index over ”b”, it can also be used.
Another example of semantic optimization is removing disjunctions. For example, we may transform
the disjunction of two equalities over the same attribute into an ”IN”. Reducing this way the number of
clauses in the selection predicate would also reduce its evaluation cost. However, such transformation
is not easily detected in complex predicates and can only be performed in the most simple cases.
In general, semantic optimization is really poor in most (if not all) DBMSs, and only useful in simple
queries. Therefore, even though SQL is considered a declarative language, the way we write the sentence
can hinder some optimizations and affect its performance. Consequently, a simple way for a user to
optimize a query (without any change in the physical schema of the database) may be just rewriting it
in a different way.
• Nodes
– Root: Result
– Internal: Algebraic Operations
– Leaves: Relations
• Edges: Direct usage
This module determines the orderings of the necessary operators to be considered by the optimizer
for each query sent to it. The objective is to reduce the size of data passing from the leaves to the root
(i.e., the output of the operations corresponding to the intermediate nodes in the tree), which can be
mainly done in two different ways: (i) reduce the number of attributes as soon as possible (i.e., reduce
the width of the table), and (ii) reduce the number of tuples as soon as possible (i.e., reduce the length
of the table). To do this, we use the following equivalence rules:
48
A. Abelló and P. Jovanovic Data Warehousing and OLAP
II. Commuting the precedence of selection and join (see Figure 4.4b)
III. Commuting the precedence of selection and set operations (i.e., Union, Intersection and Differ-
ence, see Figure 4.4c)
IV. Commuting the precedence of selection and projection, when the selection attribute is projected
(see Figure 4.4d)
V. Commuting the precedence of selection and projection, when the section attribute is not projected
(see Figure 4.4e)
VI. Commuting the precedence of projection and join, when the join attributes are projected (see
Figure 4.4f)
VII. Commuting the precedence of projection and join, when some join attribute is not projected (see
Figure 4.4g)
VIII. Commuting the precedence of projection and union (see Figure 4.4h). Notice that intersection and
difference do not commute. For example, given R[A, B] = {[a, 1]} and S[A, B] = {[a, 2]}, then R[A] −
S[A] = ∅, but (R−S)[A] = {[a]}. Thus, similar to the case of join, to commute intersection/difference
49
A. Abelló and P. Jovanovic Data Warehousing and OLAP
with projection, we need to distinguish when the primary key of the tables is being projected (see
Figure 4.4i) or not (see Figure 4.4j).
IX. Commuting join branches (see Figure 4.4k)
X. Associating join tables (see Figure 4.4l)
DBMSs use the equivalence rules to apply two heuristics that usually drive to the best access plan: (i)
execute projections as soon as possible, and (ii) execute selections as soon as possible (notice that since
they are heuristics, sometimes this does not result in the best cost). Thus, we will follow the algorithm:
1. Split the selection predicates into simple clauses (usually, the predicate is firstly transformed into
Conjunctive Normal Form – CNF).6
2. Lower selections in the tree as much as possible.
3. Group consecutive selections (simplify them if possible).
4. Lower projections in the tree as much as possible (do not leave them just on a table, except when
one branch leaves the projection on the table and the other does not, see Figure 4.5).
5. Group consecutive projections (simplify them if possible).
It is important to notice that this algorithm only requires equivalence rules from I to VIII. The last
two equivalence rules (i.e., IX and X) would be used to generate alternative trees. However, for the sake
of understandability, we will assume this is not part of the syntactic optimization algorithm, but done
during the next physical step.
It is also part of the syntactic optimization to simplify tautologies (R ∩ ∅ = ∅, R − R = ∅, ∅ − R =
∅, R ∩ R = R, R ∪ R = R, R ∪ ∅ = R, R − ∅ = R), and detect disconnected parts of the query, if any.
Detecting disconnected tables (those with no join condition in the predicate) in a query can be easily
done. However, in this case no error uses to be thrown. Instead, a cartesian product is performed by
most (if not all) DBMSs. Moreover, in some cases, the tree is transformed into just a Directed Acyclic
Graph (DAG) by fusing nodes if they correspond to exactly the same Relational operation with the same
parameters (for example, the same selection operation in two different subqueries of the same SQL
sentence, see Figure 4.6).
6 Example of CNF: (x OR y) AND (z OR t OR . . . ) AND . . .
50
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Physical structures
• Access paths
• Algorithms
This is the main module of the query processor. It employs a search strategy that explores the space
of access plans determined by the algebraic tree produced in the previous stage. It compares these plans
based on estimates of their cost and selects the overall cheapest one to be used to generate the answer
to the original query. First of all, to do it, we transform the syntactic tree into a process tree. This is the
tree associated to the syntactic tree that models the execution strategy. Similar to the syntactic one, the
interpretation of the process tree is as follows:
• Nodes
– Root: Result
– Internal: Intermediate temporary tables generated by a physical operation
– Leaves: Tables (or Indexes)
Notice that the main difference between both trees is that the new one is more expressive and con-
crete in the sense that it represents data, which can come from a user table, a temporary table, or even
an index. Besides that, now the nodes represent real steps that will be executed (e.g., most projections
disappear by fusing them with the previous operation in the data flow of just being removed if they lay
on a table). Thus, to reduce the size of the search space that the optimization strategy must explore,
DBMSs usually impose various restrictions. Typical examples include: never generating unnecessary
intermediate results (i.e., intermediate selections and projections are processed on the fly). In a real op-
timizer, selections that are not on a table would also be fused with the previous operation, however we
will assume they are not. Moreover, grouping and sorting operations are also added now to the process
tree.
There are four steps in the physical optimizer:
1. Alternatives generation
51
A. Abelló and P. Jovanovic Data Warehousing and OLAP
During Step 1, we would firstly construct at this point all alternative algebraic trees by iterating on
the number of Relations joined so far (using equivalence rules IX and X). The memory requirements and
running time grow exponentially with query size (i.e., number of joins) in the worst case. Most queries
seen in practice, however, involve less than ten joins, and the algorithm has proved to be very effective
in such contexts. For a complicated query, the number of all orderings may be enormous.
Then, the physical optimizer determines the implementation choices that exist for the execution of
each operator ordering specified by the algebraic tree. These choices are related to the available join
methods for each join (e.g., nested loops, merge join, and hash join), if/when duplicates are eliminated,
and other implementation characteristics of this sort, which are predetermined by the DBMS implemen-
tation. They are also related to the available indices for accessing each Relation, which are determined
by the physical schema of each database.
During Step 2, this module estimates the sizes of the results of (sub)queries and the frequency dis-
tributions of values in attributes of these results, which are needed by the cost model. Some commercial
DBMSs, however, base their estimation on assuming a uniform distribution (i.e., all attribute values
having the same frequency). A more accurate estimation can be obtained by using a histogram, but this
is a bit more expensive and complex to deal with.
During Step 3, this module specifies the arithmetic formulas used to estimate the cost of access
plans. For every different join method, different index-type access, and in general for every distinct
kind of step that can be found in an access plan, there is a formula that gives an (often approximate)
cost for it. For example, a full table scan can be estimated as the number of blocks in the table B times
the average disk access time D (i.e., Cost(FullTableScan) = B · D)
Despite all the work that has been done on query optimization there are many questions for which we
do not have complete answers, even for the most simple, single-query optimizations involving only Re-
lational operators.7 Moreover, several advanced query optimization issues are active topics of research.
These include parallel, distributed, semantic, and aggregate query optimization, as well as optimization
with materialized views, and expensive selection predicates.
4.3 Indexing
Indexes are data structures that allow to find some record without scanning the whole dataset. They
are based on key-information pairs called entries. Usually, there is one of such entries for each record
in the dataset, but not necessarily. The key is the value of one (or more) of the attributes in the record
(typically an identifier), and the information can be a pointer, the record itself, etc.
The most used structures for entries are B-trees and hash-based. The former create an auxiliary
hierarchical structure that can be traversed from the root, making decisions in each node depending
on the keys, until entries are reached in the leaves. On the other hand, hash-based indexes divide the
entries into buckets according to some hash function over the key. Then, on looking for some key value,
we only need to apply the same hash function to that value to know in which bucket it is.
Usually, the indexed key corresponds to a single attribute, but nothing prevents us from using the
data coming from more than one (i.e., concatenating several attributes in a given order to compose
a somehow artificial key). Obviously, if we concatenate several values (e.g., in multidimensional fact
tables the key can be easily composed of more than four attributes), the entries, and hence the corre-
sponding index, will be larger (i.e., more expensive to maintain). For example, in case of building a
B-tree structure, this will have more levels and will be more costly to traverse. Also, the structure will
require updates more often, because now it involves more attributes that can potentially be changed.
Nevertheless, it can still be worth, since searches are much more precise and using multiple attributes
helps to discriminate better the records. This can also be done by using independent mono-attribute
indexes, but these are more expensive to search.
The real problem with multi-attribute indexes is that the order of the indexed attributes is relevant.
Indeed, to compare the value of attribute n, we need to fix the n − 1 values of all the attributes con-
catenated before that. This would be specially limiting in the case of multidimensional queries, where
52
A. Abelló and P. Jovanovic Data Warehousing and OLAP
the keys have many attributes and the user can define the search criteria on the fly (just one attribute
missing in the user query can invalidate the index use).
A list of pointers can be easily implemented as a fix-length vector (i.e., without extra cost or space
for separators). Thus, given pointers of four bytes and a table of ten rows, we would require 40 bytes
(i.e., 320 bits) to index it. Alternatively, we can encode the pointers as purely a bit structure, where
each bit corresponds to the existence of a value in a row in the table, and we keep a list of bits for each
different value in the domain of the attribute. Figure 4.7 exemplifies this for products (at LHS) and
Spanish autonomous communities (at RHS). Thus, if we place all bits in a matrix, we have that columns
represent domain values and rows correspond to rows in the table. A bit is set when the corresponding
row has the value. A bit set in the first row of the first column means that the first row in the table has
value “Ballpoint” for this attribute, and not any other value whose corresponding bits are not set.
Besides these matrixes being much smaller that the list of pointers (i.e., 80 bits for LHS and only 40
for RHS vs 320 bits for the list of ten pointers of four bytes each), they are also very easy to maintain
by just switching bits. Also, evaluating over bitmap indexes complex predicates with conjunctions,
disjunctions and negations is equivalent to doing the same operations at bit level, which is much easier
and efficient than operating lists of pointers.
8 RDBMSs automatically create a B-tree associated to every primary key declared.
53
A. Abelló and P. Jovanovic Data Warehousing and OLAP
From the point of view of query optimization, it is important to remember that we need to account
for the number of blocks being accessed. Therefore, being R the number of records in a block and SF
the selectivity factor of the query predicate, we can easily estimate the probability of accessing a block
like 1 − (1 − SF)R . This probability gives us the expected number of table blocks being accessed. To
the number of table blocks, we need to add the blocks required to store the bitmap, which as already
said is expected to be much smaller than a B-Tree, but depends on the number of different values in the
predicate (the more values, the more lists of bits we need to retrieve).
In general, we can say that bitmap indexes are better than B-trees for multi-value queries with com-
plex predicates. Indeed, they result in optimum performance for predicates with several conditions
over different attributes (each with a low selectivity), when the selectivity factor of the overall predicate
is less than 1%. Bitmap index is sometimes assumed to be useful only for point queries, but in some im-
plementations (e.g., Oracle), bitmaps can be used even for range queries. Without compression, bitmap
indexes would use more space than lists of pointers for domains of 32 values or more (assuming point-
ers of four bytes). However, we can easily assume that they require less space, due to compression. It
is because of this that they degrade performance in the presence of concurrent modifications of data,
since they require to lock too many granules for every individual update (all granules compressed in
the same block are locked at once).
4.3.1.1 Compression
Assuming mono-valued attributes, notice that every row in each matrix can only have one bit set. Con-
sequently, in the whole matrix, we have as many attributes set as rows, and the matrix results to be
really sparse (the more values, the more sparse the matrix is).9 It is well know that the more sparse a
matrix, the more compressable it is.
The first technique to encode the bit matrix is known as “bit-sliced”. It simply assigns an integer
value to each domain value (e.g., Barcelona could be 0, Tarragona could be 1, Lleida could be 2 and
Girona could be 3), and keep this assignment in a lookup table. Then, we assign the corresponding
integers to each row depending on its value. The final matrix corresponding to the bitmap index would
be the binary encoding of those integers, as exemplified in Figure 4.8.
Alternatively to bit-sliced, we can also use run-length encoding. In this case, we define a run as a
sequence of zeros with a one at the end (i.e., 0 · · · 01). Then, we need two numbers to encode the run,
where the first number (i.e., content length) determines the size of the second (i.e., true content, which
corresponds to the run length). The important thing is that content length and content (a.k.a. run
length) need to use a different coding mechanism to be able to distinguish them.10 Indeed, the content
is simply a binary encoding of the run length (e.g., for a run of length five, we would encode it like 101),
and the length of the content is imply a list of ones ending in a zero (e.g., for the previous number of
9 As a rule of thumb, we should define bitmap indexes over non-unique attributes whose distinct values have hundreds of
|T |
repetitions (i.e., ndist < 100 ).
10 Notice that using two codes is mandatory for variable length coding, because without knowing the beginning and end of
every number we could not disambiguate the coded bits (i.e., we need some separating mark, which in this case is the length of
the next coded number).
54
A. Abelló and P. Jovanovic Data Warehousing and OLAP
length three, we would encode such length like 110). Thus, after the encoding, a run 000001 of length
five would be encoded like 110101.
At this moment, it is important to realize that we can operate encoded bit-vectors without decom-
pressing them. Indeed, to perform a disjunction (a.k.a. OR), all we need to do is traverse in parallel
both encoded bit-vectors generating in the output a row at the end of a run of either of them. Being lA (i)
(respectively lB (i)) the length of the i th run of attribute A (respectively attribute B):
x
X x
X
A OR B → {a | ∃x, a = lA (i)} ∪ {a | ∃x, a = lB (i)}
i=1 i=1
For the conjunction (a.k.a. AND), we also traverse in parallel both encoded bit-vectors but now we
generate a row in the output when the end of a run coincides in both of them. Therefore:
x
X y
X
A AND B → {a | ∃x∃y, a = lA (i) = lB (i)}
i=1 i=1
4.3.1.2 Indirection
At this moment, it is important to point out that the bitmaps per se just give me the relative position
in the table of the desired rows. If rows would be of fix length, we could get the physical position by
just multiplying the offset by the row size. However, rows can actually be of variable length. Therefore,
if this is the case, we need some mechanism that translates row numbers into physical positions in the
disk.
The first simple possibility is creating an auxiliary index with the row number as key and the physi-
cal position as information in the entries. Although this is feasible and would still provide some benefits
over pure B-tree indexes, there is an alternative to avoid it. All we have to do is to determine and fix a
maximum number of rows in each disk block (usually known as Hakan factor). Thus, we will assume
each block has such maximum number of rows and we will index them as if they exist. For example,
if we decide (based on statistical information of the table) that each block can have a maximum of five
rows, but the first one has only four, we would index five anyway (kind of assuming the existence of
a phantom in the block). Obviously, this inexistent row will not have any value because it does not
really exist, so it will only generate more zeros in the bitmap index and hence longer runs (but never
more runs). Consequently, if we properly adjust the maximum number of rows per block, the effect of
assuming the existence of this fictitious rows is minimal in the size of the compressed bitmap index.
4.4 Join
Join is the most expensive operation we can find in process trees (together with sorting). That is why
we find many different algorithms implementing it: Row Nested Loops (a.k.a. Index Join), Block Nested
Loops, Hash Join, Merge Join, etc. We are going to pay attention now to the first two, because of being
the most basic ones.11
Row Nested Loops consists on simply scanning one of the tables (i.e., outer loop), and then try to
find matches row by row in the other table. In its simplest (also useless) version, it scans the second
table in an inner loop. It is obvious that scanning the second table for every row in the outer loop is
the worst option. Thus, an improvement of the algorithm is using an index (if this exists) instead of
the inner loop. This option is specially worth when the outer loop has few rows, so we only need to go
through the index few times.
At this point, it is important to remember that the disk access unit is the block and not the row.
Thus, we do not bring rows one by one in the outer loop, but we bring R rows at once. Therefore, an
alternative improvement to the algorithm, known as Block Nested Loops, is to indeed scan the second
table in an inner loop but only once per block (not once per row). Since the whole block is in memory,
we can try to match all its rows in a single scan. Actually, we can bring into memory as many blocks as
11 From here on, we assume they are binary operators, but there are also multi-way join algorithms available in many DBMSs.
55
A. Abelló and P. Jovanovic Data Warehousing and OLAP
they fit (not necessarily only one). Taking this to the extreme, we might bring all the external table into
memory and then scan the second one only once. To achieve this and making use of the commutative
property of join, we will always take the smallest table for the outer loop.
SELECT ...
FROM Sales f
WHERE f.prodId IN (SELECT d1.ID FROM Product d1 WHERE d1.articleName IN ("Ballpoint","Rubber"))
AND f.placeId IN (SELECT d2.ID FROM Place d2 WHERE d2.region="Catalunya")
AND h.timeId IN (SELECT d3.ID FROM Time d3 WHERE d3.month IN ("January02","February02"))
GROUP BY ...
4.4.1.1 Star-join
Multidimensional queries involve the join of a fact table against all its many dimension tables. However,
this can be hidden in the query, so that the pattern is not obvious. Thus, some semantic optimizers try
to transform the user query so that the main query corresponds to the fact table, and the access to the
dimension tables is encapsulated in independent subqueries. On finding this new shape (exemplified in
Figure 4.9, the physical optimizer will access the dimensions first and try to extract their IDs to later use
them to go through indexes (potentially bitmaps) over the primary key of the fact table. The results of
the subqueries can be even temporarily kept in memory to avoid retrieving them more than once from
disk, if this would be necessary.
4.4.1.2 Pipelining
When tables are really large (like in the case of fact tables), they do not fit in memory. Thus, the result
of every operation in the process tree should be stored in the disk, so that the next operator can take
over the processing. Clearly, writing and reading each intermediate result generates an extra cost, that
we should try to avoid. Indeed, some join algorithms allow to process one row at a time. Thus, every
row can be pipelined in memory through a series of joins as sketched in Figure 4.10. This way, first
row of R will be joined against S, then the result against T and so on and so forth until the end of the
process tree. When this first row has gone through all the joins, we can start processing the next row
56
A. Abelló and P. Jovanovic Data Warehousing and OLAP
and continue like that until we finish the whole table R. In doing it this way, we only need to have one
row of R in memory and avoid materialization of intermediate results. This does not work with all join
algorithms, but it does with Row/Block Nested loops. It is also important to notice that for this to really
work to full potential, the process tree must be left-deep like the one in the figure. In this case, the cost
of the query is given by the different access paths of the tables (i.e., that of their indexes if we consider
using Row Nested Loops).
4.4.1.3 Join-index
On talking about indexes containing entries of key-information pairs, we assumed that both the key
and the information correspond to a single table. Nevertheless, nothing prevents us from using keys in
one table (e.g., a dimension) to index data in another one (e.g., a fact table). Thus, we can use a single
index to enter through the dimensional values and then find the rows in the fact table (i.e., somehow
join them). For example, we could index sales by the region where it took place. The key of the index
would be the attribute region of the location dimension and the information would be a list of pointers
(or bitmap) to the corresponding sales in the fact table.
Multimedia Materials
Bitmap Indexes (en)
57
A. Abelló and P. Jovanovic Data Warehousing and OLAP
58
Chapter 5
Materialized Views
In the 60’s, before the Relational model, several alternatives co-existed to store data (e.g., hierarchical,
network, etc.). In all these cases, the user had to know the structure of files to be able to access the data,
and any change in those files had to be reflected in the applications using them. Thus, both applications
and data files were tightly coupled. Then, the big achievement of the Relational model was to provide
a higher level of abstraction, that made data management independent of how the data were physically
stored.
By that time, ANSI1 created SPARC2 Study Group on DataBase Management Systems. The main con-
tribution of that group was to propose a DBMS architecture (see [Jar77]). This architecture, sketched
in Figure 5.1, defined three different levels for DBMSs to implement. At the RHS, we had the physical
one corresponding to files and data structures like indexes, partitions, etc. To its left laid the tables ac-
cording to E. Codd’s Relational abstraction. Finally, at LHS, different views could be defined to provide
semantic relativism (i.e., each user can see the data from her viewpoint, namely terminology, format,
units, etc.).
ANSI/SPARC architecture provided, on the one hand, logical independence (i.e., changes in the ta-
bles should not affect the views from users perspective), and, on the other hand, physical independence
(i.e., changes in the files or data structures should not affect the way to access data). This Relational fea-
ture was really important, because it made a difference with regard to predecessors. In this way, views
are like windows to a database that provide access to only a portion of the data that is either of interest
or related to an application, and at the same time allow to automatically reshape those data according
to user needs. However, applications do not make any distinction between tables and views, since they
can be queried in exactly the same way.
Nevertheless, internally, a table has a schema (i.e., name and attributes), while a view has a schema,
but also a query that defines its content with regard to that of some tables (this is why, views are some-
times called “derived relations” or “named queries”). Thus, views existence is linked to that of the
corresponding tables, sometimes referred to as “base tables” (e.g., you cannot drop a table that has
views defined over it).
59
A. Abelló and P. Jovanovic Data Warehousing and OLAP
1. View expansion (i.e., transform a query over views into a query over base tables)
2. Answering queries using views (i.e., transform a query over base tables into one over MV)
3. View updating or View maintenance (i.e., propagate changes in the base tables to the correspond-
ing MVs)
4. Update through views (i.e., propagate changes expressed over views to the corresponding base
tables)
60
A. Abelló and P. Jovanovic Data Warehousing and OLAP
61
A. Abelló and P. Jovanovic Data Warehousing and OLAP
62
A. Abelló and P. Jovanovic Data Warehousing and OLAP
the whole current (outdated) view content. Nevertheless, it should also be obvious that this is not worth
when only one single row has changed and you can modify it individually leaving the rest unaffected.
The former is know as “complete” update, while the latter is called “incremental”. Incremental update
is always possible (even deferred), as soon as you keep track (in a separated file/table/log) of all the
required information to do it. Therefore, the problem is to decide which option is the best. When all
tuples have changed, it is better to perform a complete update, however, when very few changed, an
incremental one is more efficient. The problem is then to decided in the general case when to use one
or another depending on the estimated number to tuples affected by changes.
Resources are always finite, thus, even if there are many queries that potentially improve perfor-
mance, often it is not possible to materialize them all. It could be simply a lack of disk space, but
usually it is rather a lack of time to keep all of them up to date. Just realize that resources (a.k.a. time)
we can devote to maintain MVs are limited.3 Figure 5.6 depicts the relationship between performance
gain and space use in view materialization (we could draw the same for update time duration instead
of MBytes in the horizontal axis). We should interpret this in the sense that by devoting few MBytes
to view materialization, we can improve performance a lot. However, spending more and more space
(also time) is only going to result in marginal performance improvement. This does not mean that ma-
terializing any view magically improves performance. We need to wisely find the ones maximizing the
impact.
Only considering the usage of a group by clause, the number of candidate views to be materialized
is exponential in the number of attributes of the table (this is just worsened by considering also where
clauses in the queries). Consequently, an exhaustive search is simply impossible. Instead, we should use
heuristics and greedy algorithms to choose MVs (see [GR09]). For example, given a set of queries whose
performance we want to optimize, we should only consider views that have exactly the same group by
clause as some of those queries, or the union of some of them.
Multimedia Materials
Materialized View Problems (en)
3 The available time to maintain MVs is usually called “update window”, and inside it, users are banned from querying.
63
A. Abelló and P. Jovanovic Data Warehousing and OLAP
64
Chapter 6
There are many situations in which we have to move data from one database to another, which im-
plies Extracting the data, then potentially performing some Transformation to finally Load them in the
destination. Some situations where we can find this are:
• Restructure data to be used by other tools
• Transactional data safekeeping
• Use multiple sources together
• Improve data quality by removing mistakes and complete missing data
• Provide measures of confidence in data (e.g., filtering out erroneous records)
This kind of data intensive flows is so common that has created the need of specialised tools like
Microsoft Data Transformation Services1 , Informatica Power Center2 , Pentaho Data Integration3 , and
many others.
65
A. Abelló and P. Jovanovic Data Warehousing and OLAP
6.2 Definition
As we said, we can find ETL flows in many situations and systems, but we should focus now on DW.
If we pay attention to the different architectures in Chapter 2, it is easy to see that the ETL layer is
explicit in case of two- and three-layers. However, it is also implicit in some other places including the
single-layer architecture. In Fig. 2.2, we see it between the sources and the DW, but there is also data
movement between the DW and the DMs. In Fig. 2.3, it is again explicit between the sources and the
Reconciled data, but there is also data movement from this to the DW and then to the DM. Finally, it is
not that obvious, but some data movement is implicit in the virtual DW layer of Fig. 2.1, too.
These different possibilities are generalized in Fig. 6.1, where we distinguish between ETL (cor-
responding to two- and three-layers) and ETQ where the data are not loaded anywhere, but directly
Queried by the final user without any materialization (corresponding somehow to single-layer archi-
tecture). Even more generally, we can also find some authors and software providers proposing and
advocating for an ELT variant, where extracted data are firstly loaded into a powerful DBMS to then
transform them inside using ad-hoc SQL statements.
6.2.1 Extraction
Some data tasks require multiple and heterogeneous sources, that can be in different supports (e.g.,
hard disk, Cloud), different formats (e.g., Relational, JSON, XML, CSV), from different origins (e.g.,
transactional systems, social networks), and with different gathering mechanisms (e.g., digital, manual).
So, it will be the first task of the ETL to solve all these differences as data moves through the pipe.
Besides purely formatting, since our DW must be historic, it is specially important to pay attention
too to the differences in the temporal characteristics of sources:
Transient: The source is simply non-temporal, so if we do not poll it frequently, we’ll miss the changes
(any change between two extractions will be lost).
Semi-periodic: The source keeps a limited number of historical values (typically the current and the
previous one), which gives us some margin to space extractions, so reducing the disturbance to
the source.
Temporal: The source keeps an unlimited number of historical values (but may still be purged period-
ically), which gives complete freedom to decide the frequency of extraction.
Fig. 6.2 sketches the different mechanisms we have to extract data from a DBMS:
a) Application-assisted: Implies that we modify the application and intercept any change of the data
before it goes to the DBMS.
66
A. Abelló and P. Jovanovic Data Warehousing and OLAP
1) This can obviously be done by modifying the code (e.g., PhP), and injecting the required instruc-
tion to push any change in the data to the ETL pipe of the DW. The obvious problem of this
approach is that it will be expensive to maintain (any application accessing the DBMS must be
modified) and hardly sustainable.
2) A more elegant alternative is to modify the Call Level Interface (i.e., JDBC or ODBC driver) so
that every time a modification call is executed, this is replicated in the ETL pipe of the DW.
Obviously, this is more sustainable than the previous, but can be harder to implement and some
times expensive to execute (interactions to the DBMS are more expensive and this would still
impact the application performance).
b) Trigger-based: If the DBMS provides triggers, we can use them to intercept any change in the
database and propagate it to the DW. The limitation is obviously that not all DBMS provides triggers,
and they may also result too expensive.
c) Log-based: Any DBMS provides durability mechanisms in the form of either logs or incremental
backups. This can easily be used to efficiently extract the data. The main problem is that depending
on the software provider, it may be using proprietary mechanisms and formats, which are hard to
deal with.
d) Timestamp-based: Some databases are temporal (or even bi-temporal) and already attach times-
tamps to any stored value (or can be easily modified to do so). If allowed and does not generate much
overhead, this would be the best option, because offers the maximum flexibility to the extraction.
However, it requires having the control over the source and the power to impose the modification.
e) File comparison: A simple option is to perform periodic bulk extractions of the database (a.k.a.
backups). The first time, we would push all of it to the DW and keep a copy in some staging area.
From there on, every new extraction would compare against the previous one in the staging area to
detect the changes and only push those to the DW.
Besides these technical issues, in case of existing alternatives, we should also consider the interest
and the quality of the sources in the choice. Firstly, a source should always be relevant to the decision
making (i.e., it should provide some data to either a dimension or fact table). In general, having more
data does not mean making better decisions (it can be simply harder or even confusing if data are
irrelevant or wrong)6 . On the other hand, it should not be redundant, because it would simply be a
waste of resources. If there are some alternative sources for the same data, we should chose the one
6 https://dangerousminds.net/comments/spurious_correlations_between_nicolas_cage_movies_and_swimming_pool
67
A. Abelló and P. Jovanovic Data Warehousing and OLAP
of highest quality in terms of completeness, accuracy, consistency and timeliness. A good reason to
extract the same data from more than one source would be that we can obtain higher quality from them
together.
6.2.2 Transformation
The second part of the ETL consists of standardizing data representation and eliminating errors in
data. It can be seen in terms of the following activities, not necessarily in this order (actually, due to
its difficulty, its development is an iterative process estimated to take 80% of the knowledge discovery
process):
Selection has the objective of having the same analytical power (either describing or predicting our
indicators), but with much less data. Obviously, we can decide not to use some source, but even
on taking one source, we can use it partially by reducing:
a) Length (i.e., remove tuples) by sampling (might be interesting not to remove outliers), aggre-
gating (using predefined hierarchies) or finding a representative for sets of rows (i.e., cluster).
b) Width (i.e., remove attributes) by eliminating those that are correlated, performing an analysis
of significance, or studying the information gain (w.r.t. a classification).
Integration has the purpose of crossing independent data sources, which potentially have semantic
and syntactic heterogeneities. The former requires reshaping the schema of data, while the latter
is solved tranforming their character set and format.
Cleaning includes generating data profiles, splitting the data in some columns and standardize values
(e.g., people or street names). However, the main purpose is to improve data quality in terms of:
• Completeness by imputing a default value (or even manually assigning different ones if not
many are missing), which can be either a constant, some coming from a complementary
source/lookup table, the average/median/mode of all the existing ones, the average/medi-
an/mode of the corresponding class, or the one maximizing the information gain.
• Accuracy by detecting outliers performing some variance analysis, evaluating the distance to
a regression function, or identifying instances far from any cluster.
• Correctness, by checking constraints and business rules, or matching dictionaries/lookup
tables.
Feature engineering derives new characteristics of the data that are expected to have more predictive
power.
Preparation sets the data ready for a given algorithm or tool. In some cases, this requires to transform
categorical attributes into numerical ones (just creating some encoding), but others it is the other
way round and numerical attributes need to be discretized (e.g., by intervals of the same size,
intervals of the same probability, clustering, or analysis of entropy). Some algorithms are also
affected by differences in scales, so numerical attributes need to be normalized (e.g., dividing by
x |x−min| x−µ
the maximum max , dividing by the domain size |max−min| , dividing by the standard deviation σ ,
or simply dividing by some power of ten 10x j ). Some other algorithms require the transformation
of data into metadata by pivoting and converting rows into columns.
6.2.3 Load
The last phase is loading the data in the target of the flow7 . A typical technique to facilitate this in
a DW environment is using the concept of update window. This is a period of time (typically at night
or non-working days) during which analysist cannot access the DW (so they cannot interfere in the
loading). Separating user queries from ETL insertions, we firstly save the overhead of concurrency
7 As previously said, we could also send it directly to the user ready to be consumed in an ETQ
68
A. Abelló and P. Jovanovic Data Warehousing and OLAP
control mechanisms, which can be a significant gain by itself, but also allows to disable materialized
view updates and indexes (since users are not using them). Once the load is over, we can rebuild all
indexes and update all the materialized views in batch, which is much more efficient than doing it
incrementally and intertwined with insertions.
• Hand-coded ETL. As previously discussed, to implement an ETL, we do not really need an ETL
tool. It can be done programmatically with a skillful programming team, as well. The advantage
of this is that we are not limited by the features of the tool, we can reuse legacy routines as well
as know-how already available. A good example of this approach lately is MapReduce, which
facilitates processing schemaless raw data in read-once datasets, cooking them before being loaded
in a DBMS for further processing. However, in general, such programmatical approach only works
for relatively small projects. If the project has a certain volume and requires some sophisticated
processing, manual encoding of data transformations is not a good idea, specially from the point
of view of the maintenance and sustainability in the long term.
• ETL tools. Quoting Pentaho, “The goal of a valuable tool is not to make trivial problems mun-
dane, but to make impossible problems possible”. An ETL tool, like Pentaho Data Integration,
cloverETL, JasperETL, or Talend, firstly offers a GUI that facilitates the encoding of the flows.
Moreover, they provide some metadata management functionalities and allow to easily handle
complex data type conversions as well as complex dependencies, exceptions, data lineage and
dependency analysis. Also, they facilitate common cumbersome tasks like scheduling processes,
failure recovery and restart, and quality handling.
69
A. Abelló and P. Jovanovic Data Warehousing and OLAP
addition, unlike other business processes, ETL process has an important quality dimension related to
the quality of data. You can find more details about data quality in Chapter 7, while we dedicate this
section to ETL process quality dimensions. In particular, we emphasize the following important ETL
process quality dimensions, while more extensive list can be found in [The17].
• Performance. Time behavior and resource efficiency are the main aspects that have tradition-
ally been examined as optimization objectives for data processing tasks (e.g., query optimization).
Currently, there is almost no, or very limited automation for the ETL process optimization in ETL
tools, and it is mainly relying on the support provided by the data source engines (e.g., RDBMS,
Apache Pig) and for part of ETL process flow (e.g., pushing Relational algebra operators to the
data source engine whenever possible). In addition, ETL tools, like Pentaho Data Integration al-
lows manual tuning of the ETL process, by configuring the resource allocation (memory heap size)
or parallelizing ETL operations, a solution limited by the hardware capacity of the host machine.
However, many research attempts have been exploring the topic of optimizing the complete ETL
process, building upon the fundamental theory of query optimization [SVS05]. The main limi-
tation is still on the complex (“black-box”) ETL operations (those not being able to be expressed
using Relational algebra).
• Reliability. Reliability of ETL processes represents the probability that an ETL process will per-
form its intended operation during a specified time period under given conditions. At the same
time, in the presence of a failure, the process should either resume accordingly (recoverability) or
should be immune to the error occurred (robustness and fault tolerance) [SWCD09]. Fault-tolerance
for example can be improved either by replicating the flow (i.e., running in parallel multiple iden-
tical instances of a flow) or flow redundancy (i.e., providing multiple identical instances of a flow
and switching to one of the remaining instances in case of a failure). For recoverability, the most
usual technique is to introduce recovery check (i.e., persisting data on the disk after heavy opera-
tions, such that in the case of failure we can restart the ETL process execution from the later and
partially reusing already processed data).
• Auditability. Auditability represents the ability of the ETL process to provide data and business
rule transparency [The17]. This includes testability, or the degree to which the process can be
tested for feasibility, functional correctness and performance prediction; and traceability which
includes the ability to trace the history of the ETL process execution steps and the quality of
documented information about runtime. Well-documented ETLs (e.g., by means of providing
conceptual and logical design) enable better auditability of the entire process and easier testing of
its specified functionalities.
• Maintainability. Hard to quantify and usually overlooked dimension of the ETL process qual-
ity, which consequently increases the later development cost and the overall ETL project cost in
general. Conceptual and logical design enable better documented ETL processes while many ETL
tools as well integrate ETL process documentation to allow better maintainability of the project.
Potential metrics for the maintainability dimension are the size of an ETL process (number of
operations and the number of data sources) and the modularity (e.g., atomicity of its operations).
• Data flow is in charge of performing operations over data themselves in order to prepare them
for loading into a DW (or directly for exploiting them by end users). That is, data extraction
(reading from the data sources), various data transformation tasks (data cleaning, integration,
format conversions, etc.) and finally loading of the data to previously created target data stores of
the DW. Data flows are typically executed as a pipeline of operations (rather than a set of strictly
sequential steps).
70
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Control flow on the other side is responsible of orchestrating the execution of one or more data
flows. It does not work directly with data, but rather on managing the execution of data processing
(i.e., scheduling, starting, checking for possible errors occurred during the execution, etc.). Unlike
data flow, in control flow the order of execution is strictly defined by the sequence of activities,
meaning that one activity does not start its execution until all its input activities have finished.
This is especially important in the case of dependent data processing, where the results of one
data flow are needed before starting the execution of another.
Staging area. Typically, as a complement to the ETL tool, we need to devise a staging area where to
place temporal files. This facilitates recoverability, backup of the processes and auditing. The purpose
of this area is only to support the ETL processing and can contain from plain files to more complex
Relational tables, through XML or JSON.
Multimedia Materials
ETL Operations (en)
ETL Operations (ca)
71
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Operation Level Operation Type Pentaho Data Integration Talend Data Integration SSIS Oracle Warehouse Builder
Add constant
Constant Operator
Formula Character Map
tMap Expression Operator
Number ranges Derived Column
Attribute Attribute Value Alteration tConvertType Data Generator
Add sequence Copy Column
tReplaceList Transformation
Calculator Data Conversion
Mapping Sequence
Add a checksum
Unique Rows
Duplicate Removal tUniqRow Fuzzy Grouping Deduplicator
Unique Rows (HashSet)
Sort Sort Rows tSortRow Sort Sorter
Dataset
Reservoir Sampling Percentage Sampling
Sampling tSampleRow
Sample Rows Row Sampling
Group by tAggregateRow
Aggregation Aggregate Aggregator
Memory Group by tAggregateSortedRow
Dataset Copy tReplicate Multicast
Duplicate Row Clone Row tRowGenerator
tFilterRow
Filter Rows
Filter tMap Conditional Split Filter
Data Validator
Entry tSchemaComplianceCheck
Merge Join
Stream Lookup
Database lookup tJoin Merge Join Joiner
Join
Merge Rows tFuzzyMatch Fuzzy Lookup Key Lookup Operator
Multiway Merge Join
Fuzzy Match
Router Switch/Case tMap Conditional Split Splitter
Set Operation - Intersect Merge Rows (diff) tMap Merge Join Set Operation
Set Operation - Difference Merge Rows (diff) tMap Set Operation
Merge
Set Operation - Union Sorted MergeAppend streams tUnite Set Operation
Union All
Set field value
Set field value to a constant
String operations
Strings cut Derived Column Constant Operator
tMap
Replace in string Character Map Expression Operator
Attribute Addition tExtractRegexFields
Formula Row Count Data Generator
tAddCRCRow
Schema Split Fields Audit Transformation Mapping Input/Output parameter
Concat Fields
Add value fields changing sequence
Sample rows
Datatype Conversion Select Values tConvertType Data Conversion Anydata Cast Operator
Attribute Renaming Select Values tMap Derived Column
Projection Select Values tFilterColumns
tDenormalize
Pivoting Row Denormalizer Pivot Unpivot
Relation tDenormalizeSortedRow
Row Normalizer tNormalize
Unpivoting Unpivot Pivot
Split field to rows tSplitRow
If field value is null Constant Operator
Null if tMap Expression Operator
Value Single Value Alteration Derived Column
Modified Java Script Value tReplace Match-Merge Operator
SQL Execute Mapping Input/Output parameter
CSV file input ADO .NET / DataReader Source
Table Operator
Microsoft Excel Input tFileInputDelimited Excel Source
Flat File Operator
Source Operation Extraction Table input tDBInput Flat File Source
Dimension Operator
Text file input tFileInputExcel OLE DB Source
Cube Operator
XML Input XML Source
Text file output Dimension Processing
tFileOutpu Table Operator
Microsoft Excel Output Excel Destination
tDelimited Flat File Operator
Target Operation Loading Table output Flat File Destination
tDBOutput Dimension Operator
Text file output OLE DB Destination
tFileOutputExcel Cube Operator
XML Output SQL Server Destination
72
Chapter 7
Data Quality
Following from the general definition of quality, the data is considered of high quality if it is fit for its
intended use [BS16], meaning that the level of quality to be considered as acceptable, depends on the
real needs of end user, and the purpose of the underlying data.
Data quality is sometimes wrongly reduced just to accuracy (e.g., name misspelling, wrong birth
dates). However, other dimensions such as completeness, consistency, or timeliness are often necessary in
order to fully characterize the quality of information.
• Data ingestion. While data being initially retrieved, accuracy issues can be introduced due to
faulty data conversions or simply by making typos in the manual data entry. Similarly, data time-
liness can also be affected if the data are being fetched periodically in the form of batch feeds,
while being consumed through real-time interfaces implies expecting high freshness of the data. If
data are being fetched from consolidating various systems, consistency errors may occur due to the
heterogeneity of these source systems.
• Data processing. New errors can also be introduced while data are being processed for their
final use. For instance, process automation may lead to overlooking some special cases and thus
applying a default processing that result in inaccurate outputs (e.g., a missing case in a conditional
statement in the code can lead to a default option). Paradoxically, new accuracy issues and other
data imperfections can also be introduced during data cleansing actions (e.g., replacing missing
values with default constants or precalculated aggregates like means or mediums). Finally, we can
also affect the completeness and consistency of our data due to incautious data purging actions, by
mistakenly deleting some of data values.
• Inaction. Lastly, data quality can also be largely affected by not taking necessary actions to pre-
vent their occurrence. For instance, our data can become inconsistent if the changes that occur at
one place are not properly propagated to the rest of the data that relates to it. Also, various inac-
curacies can occur if the new system upgrades are not properly controlled, especially those that lack
backward compatibility (e.g., data types or operations being changed or removed from DBMS).
Generally, the same data that previously were considered of high quality can become flawed if the
new use of data is considered (see the definition of quality above as data fitting their intended use).
73
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Data conflict can be classified from two orthogonal perspectives: (a) based on the level at which they
occur (i.e., schema vs. instances), and (b) if they occur within a single source or among multiple sources
(see Figure 7.1).
Schema vs. instances. Schema-level conflicts are caused by the errors in the design of the data schemata
(e.g., duplicates caused by the lack of unique constraint, inconsistency due to mistake in the referential
integrity, structural schema conflicts among various data sources). Instance-level conflicts refer to errors
in the actual data contents, which are typically not visible nor preventable at the schema design time
(e.g., misspelling, redundancy or data duplicates and inconsistencies among various data sources). Both
schema- and instance-level conflicts can be further differentiated based on the scope at which they
occur: within an individual attribute, between attributes of a single record, between different records of
certain record type, and between different records of different record type. Single- vs. Multi-source. The
number of single-source data conflicts obviously largely (but not only) depends on the degree to which
it is governed by schema constraints that control permissible data values. Therefore, the data sources
that do not have native mechanisms or are more flexible to enforce a specific schema (e.g., CSV files,
JSON documents) are more prone to errors and inconsistencies. All the errors potentially present in a
single source are aggravated when multiple sources need to be integrated. Data may already come with
errors from single sources, but they may additionally be represented differently, may overlap or even
contradictory among various data sources. Problems that are typically caused by the fact that these
source systems are independently developed.
Conflicts in data can be either prevented from occurring or corrected once they occur. Obviously, pre-
venting data conflicts is preferable due to the typically high costs of data cleaning. This requires proper
schema design with well-defined data constraints or strict enforcement of constraints (e.g., by limiting
user errors in the graphical user interface). However, in some cases, and almost always for the case of
multiple sources, a posteriori data cleaning actions are necessary to guarantee a required level of data
quality for a particular analysis. Such actions may involve very complex set of data transformation and
are typically part of extract-transform-load (ETL) processes (see Chapter 6).
74
A. Abelló and P. Jovanovic Data Warehousing and OLAP
7.3.1 Completeness
Data completeness is most typically related to the issue of missing or incomplete data, and as such it is
considered one of the most important data quality problems.
Most often interpretation of the data completeness is the absence of null values, or more exactly, the
ratio of non-null values and the total number of values. Such interpretation can be easily represented
as a measure. First, at the level of a single attribute (Ai ) we have:
75
A. Abelló and P. Jovanovic Data Warehousing and OLAP
as the degree to which a given dataset describes the corresponding set of real-world objects. In this
case, measuring the completeness can be more difficult, as it may require additional metadata or cross
checking with real world (e.g., possibly by sampling). Lastly, another aspect of completeness refers to
the fact if a real-world property is represented as an attribute in a dataset or not. However, assessing
this aspect as well requires (often manual) inspection with the real world.
7.3.2 Accuracy
The quality of data can as well be affected by measurement errors, observation biases, or improper
representations. This aspect is covered by the accuracy dimension, which can be defined “as the extent
to which data are correct, reliable, and certified free of errors”. The meaning of accuracy is application-
dependent, and it can be expressed as the distance from the actual real-world value or the degree of
detail of an attribute value. For instance, a product sales volume variable in our data warehouse can
have a value 10.000$, while the value in reality (e.g., in accounting system) is 10.500$. While in general
this can be interpreted as inaccurate, in another use case, where the user is only interested in binary
sales categories (low: ≤ 20.000$, high: > 20.000$), the value in the data warehousing system can be
considered as accurate enough.
To assess the accuracy in a general case, we need to calculate the distance of the value stored in a our
database from the actual real value.
In the case of numerical attribute such distance is simply an arithmetic difference, while in the case
of other data types it may require more complex way of calculation. For example, Hamming distance
calculates the number of positions at which two strings of the same length differ, or its more complex
version, Levenshtein (or Edit) distance, that calculates the minimal number of character-level operations
(insertions, deletion or substitutions) required to change one string into another one.
Regardless of the way eA is calculated, we further assess the accuracy of an attribute as follows.
|R(eAi ≤ ε)|
QA (Ai ) = (7.4)
|R|
Notice that the threshold ε is application specific and it determines the level of tolerance for the
accuracy of a certain attribute.
As before, we can further apply this to the whole Relation as follows:
7.3.3 Timeliness
Another problem that data may suffer is that they are outdated, meaning that they are captured/ex-
tracted from the original source potentially before the new changes in the source may occurred. This
dimension is also referred to as freshness. The level of timeliness as well depends on the specific appli-
cation that the data is used for. For example, the last year’s accounting data may be considered as “old”,
but if the users are actually auditing company’s last year’s operations, these data is just what they need.
On the other hand, the real “age” of the data is also determined by the frequency of updates that
may occur over the data since the last capturing/extraction. For example, the system that uses UN’s
population and migration indicators may have data that are months old, but they may be considered as
“fresh” having that the UN publishes these indicators annually.
76
A. Abelló and P. Jovanovic Data Warehousing and OLAP
We thus need to consider both value’s age (i.e., calculated from the time when the datum is commit-
ted in the database, a.k.a. Transaction Time1 , as age(v) = now − transactionT ime) and its frequency of
updates per time unit (fu (v)) in order to assess its timeliness.
1
QT (v) = (7.6)
1 + fu (v)ȧge(v)
We can further extend this to an attribute (Ai ), as:
7.3.4 Consistency
Consistency refers to the degree to which data satisfies defined semantic rules. These rules can be
integrity constraints defined in the DBMS (i.e., entity - “a primary key of a customer cannot be null”,
domain - “customer age column must be of type integer”, referential - “for each order a customer record
must exists”), or more complex business rules describing the relationship among attributes (e.g., “a
patient must be female in order to have attribute pregnant set to true”), typically defined in terms of
check constraints or data edits2 . Recent research approaches also propose automatic derivation of such
rules from the training data by applying rule induction.
Having that B is a set of such rules for Relation R, we can calculate the ratio of tuples that satisfy all
the rules as:
1. Timeliness ⇐⇒ (Accuracy, Completeness, and Consistency). Having accurate (or complete or consis-
tent) data may require either checks or data cleaning activities that take time and thus can affect
negatively the timeliness dimension. Conversely, timely data may suffer from accuracy (or com-
pleteness or consistency) issues as their freshness was favored. Such trade-offs must always be
made depending on the application. For instance, in some Web applications, timeliness is often
preferred to accuracy, completeness and consistency, meaning that it is more important that some
information arrives faster to the end user, although it initially may have some errors or not all the
fields be completed.
2. Completeness ⇐⇒ (Accuracy and Consistency). In this case, the choice to be made is if it is “better”
(i.e., more appropriate for a given domain), to have less but accurate and consistent data, or to
have more data but possibly with errors and inconsistencies. For instance, for social statistical
analysis it is often required to have a significant and representative input data, thus we would
favor completeness over accuracy and consistency in such case. Conversely, when publishing
scientific experiment results, it is more important to guarantee their accuracy and consistency
than their completeness.
1 https://en.wikipedia.org/wiki/Transaction_time
2 https://en.wikipedia.org/wiki/Data_editing
77
A. Abelló and P. Jovanovic Data Warehousing and OLAP
78
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Inclusion dependency is maintained by defining the referential integrity constraint (foreign key)
between the attributes of two database tables. Intuitively, if Y is a set of attributes that defines
a primary key of table S and X is a set of attributes that defines a foreign key in table R that
references S.Y , it is guraranteed that a set of tuples over attributes R.X is subsumed by a set of
tuples over attributes S.Y (i.e., R.X ⊆ S.Y ).
• Liveliness: A table/view is lively if there is at least one consistent DB state, so that the table/view
contains tuples.
• Redundancy: An integrity constraint is redundant if the consistency of the DB does not depend on
it (none of the tuples it tries to avoid can never exist).
• State-reachability: A given set of tuples is reachable if there is at least one consistent DB state
containing those tuples (and maybe others).
• Query containment (subsumption): A query Q1 is contained in another Q2, if the set of tuples of
Q1 is always contained in the set of tuples of Q2 in every consistent DB state.
The process starts by identifying rule imperfections, followed by the analysis of such findings and
searching for the patterns in the data. In the third step, the rules are enhanced to eliminate as many
imperfections as possible. These three steps are executed iteratively until we are satisfied with the
enhancement or we simply run out of the resources. Such process largely relies on manually verifying
79
A. Abelló and P. Jovanovic Data Warehousing and OLAP
samples of real-world data and the comparison with the errors identified by applying rules. In the final
step, we handle the remaining imperfections, i.e., those that could not be improved but we are at least
aware of and are accounted for in the data quality assessment.
1. Acquisition of new data, which improves data by acquiring higher-quality data to replace the
values that raise quality problems;
3. Entity resolution (or record linkage), which identifies if the data records in two (or multiple)
tables refer to the same real-world object;
4. Data and schema integration, which define a unified view of the data provided by heterogeneous
data sources. Integration has the main purpose of allowing a user to access the data stored by
heterogeneous data sources through a unified view of these data. Data integration deals with
quality issues mainly with respect to two specific activities:
• Quality-driven query processing is the task of providing query results on the basis of a quality
characterization of data at sources.
• Instance-level conflict resolution is the task of identifying and solving conflicts of values refer-
ring to the same real-world object.
5. Source trustworthiness, which selects data sources on the basis of the quality of their data;
80
A. Abelló and P. Jovanovic Data Warehousing and OLAP
6. Error localization and correction, which identify and eliminate data quality errors by detecting
the records that do not satisfy a given set of quality rules. These techniques are mainly studied
in the statistical domain. Compared to elementary data, aggregate statistical data, such as av-
erage, sum, max, and so forth are less sensitive to possibly erroneous probabilistic localization
and correction of values. Techniques for error localization and correction have been proposed for
inconsistencies, incomplete data, and outliers.
7. Cost optimization, defines quality improvement actions along a set of dimensions by minimizing
costs.
Table 7.1: Example of three national registries (agencies) that represent the same business [BS16]
Agency Identifier Name Type of activity Address City
Agency 1 CNCBTB765SDV Meat production of Retail of bovine and 35 Niagara Street New York
John Ngombo ovine meats
Agency 2 0111232223 John Ngombo Grocer’s shop, bev- 9 Rome Street Albany
canned meat pro- erages
duction
Agency 3 CND8TB76SSDV Meat production in Butcher 4, Garibaldi Square Long Island
New York state of
John Ngombo
Example in Table 7.1 shows that the same object in the real world (i.e., the same business) can be
represented in different manners in three national registries (i.e., Agency 1, Agency 2, and Agency 3).
Some differences like different identifiers may be simply due to the different information systems from
where these tuples are coming from. Other attributes like name, type of activity, address, and city also
differ (although with some similarities), and this can be due to several reasons, like typos, deliberately
false declarations, or data updated at different times.
81
A. Abelló and P. Jovanovic Data Warehousing and OLAP
The high-level overview of the object identification process is depicted in Figure 7.3. Assuming for
simplicity two input data sources (A and B), the process includes the following activities:
1. Pre-processing has as a goal to normalize/standardize the data and correct evident errors (e.g.,
conversion of upper/lower cases) and reconcile different schemata.
2. Search space reduction. Performing entity resolution on the entire search space (i.e., Cartesian
product of tuples in inputs) would result in complexity O(n2 ), n being the cardinality of input
Relations. To make this process tractable we first need to reduce the given search space. This
is typically done by means of three different methods: (a) blocking, which implies partitioning a
file into mutually exclusive blocks and limiting comparisons to records within the same block, (b)
Sorted neighborhood consists of sorting a file and then moving a window of a fixed size on the file,
comparing only records within the window, and (c) Pruning (or filtering) implied removing from
the search space all tuples that cannot match each other, without actually comparing them.
3. Comparison and decision. This step includes an entity resolution (or record linkage) technique
that first selects a comparison function used to compare if two tuples represent the same object in
the real world, by calculating the “distance” between them (see Section 8.3.4.1 for more details),
and then decides if the compared tuples match, do not match, or potentially match based on the
analysis of the results of the comparison function. An example of the entity resolution algorithm
(R-Swoosh [GUW09]) is presented in Figure 7.3, while more details are provided in Section 8.3.4.3.
It may also happen that no decision can be made automatically and a domain expert has to be
involved to make the decision.
4. Quality assessment is finally performed based on the result of the previous comparison and data
quality indicators are measured to assess if the results are satisfactory. Minimizing possible matches
is a typical goal to avoid as much as possible the involvement of the domain expert. In addition,
minimizing false positives and false negatives are as well common goals in the quality assessment
step.
Multimedia Materials
Data Quality Measures (en)
Data Quality Measures (ca)
Data Quality Rules (en)
82
Chapter 8
The integration problem appears when a user needs to be able to pose one query, and get one single
answer, so that in the preparation of the answer data coming from several DBs is processed. It is impor-
tant to notice that the users of the sources coexist (and should be affected as little as possible) with this
new global user. In order to solve this, we can take three different approaches:
a) Manually query the different databases separately (a.k.a. Superman approach), which is not realistic,
since the user needs to know the available databases, their data models, their query languages, the
way to decompose the queries and how to merge back the different results.
b) Create a new database (a.k.a. DW or ERP) containing all necessary data.
c) Build a software layer on top of the data sources that automatically splits the queries and integrates
the answers (a.k.a. federation or mediation approach).
8.2.1 Definitions
Some terms like classes, entities, attributes can have different meanings depending on the context they
are applied. In order to avoid ambiguities, we first define how some common terms are going to be
employed. With this, we hope to aid explanations in further sections.
8.2.1.1 Concepts
In the context of this chapter, the term class and entity is used interchangeably to define a concept of the
domain, analogously to a Relational table in Relational databases or a class in object oriented designs,
such class is composed by attributes and might have relationship with other classes. Thus, we define
class C as a concept containing a set of attributes. Any concept can be represented by a class in a
83
A. Abelló and P. Jovanovic Data Warehousing and OLAP
UML Class diagram. For example, the class Playlist containing attributes name, lastModifiedData,
Description, #Followers; and the class Track containing attributes trackName, artistName, note;
can be represented as:
This representation is conceptual and agnostic to the way the data are physically stored. To illustrate
that, we display an snippet of the file Playlists.json from Spotify, which contains the data from which
the diagram in Figure 8.1 was generated. The full conceptual model for a given data source is then called
the schema.
Lastly, an instance will be an instantiation of a concept of the schema; i.e.: the playlist “Bjork Pitch-
fork” is an instance of the class playlist.
84
A. Abelló and P. Jovanovic Data Warehousing and OLAP
CB to represent the set of classes {CB1 , CB2 , ..., CBn } from one domain schema (represented in blue), and
CG to represent the set of classes {CG1 , CG2 , ..., CGm } of another domain schema (represented in green).
Figure 8.2: CB1 ≡ CG1 where CB1 = Song and CG1 = Track
When the classes have different names (e.g., Song in CB and Track
Class Naming
in CG ).
When the attributes have different names (e.g., songName and
Naming
trackName).
When two equivalent attributes have different type representa-
Type
tion between domains (e.g., genre in CB and in CG ).
When the attribute is single-valued in one domain but multi-
Single/Multi-valuation
valued in another (e.g., artists in CB and CG ).
When the attributes have the same type, but are represented in
Format
different formats (e.g., Album in CB and CG ).
Attribute When two attributes are numerical, but the unit to represent
Measure
them differs (e.g., duration in CB and CG ).
Representation
When two attributes are numerical, but the scale to represent
Scale
them differs (e.g., rating in CB and CG ).
When two attributes are numerical, but the dimension to repre-
Dimension
sent them differs (e.g., fileSize in CB and CG ).
When one attribute is the result of the composition of many
Composition attributes (e.g., dateOfRelease in CB and dayOfRelease, mon-
thOfRelease and yearOfRelease in CG ).
Table 8.1: Intra-Class heterogeneities
85
A. Abelló and P. Jovanovic Data Warehousing and OLAP
86
A. Abelló and P. Jovanovic Data Warehousing and OLAP
For enabling the mediator-based system to automatically handle the heterogeneity among various
data sources, there are three major steps to be considered (see Figure 8.5):
87
A. Abelló and P. Jovanovic Data Warehousing and OLAP
1. Schema alignment works at the schema level, and includes establishing the correspondences (a.k.a.
mappings) among the concepts (i.e., attributes or variables) of different data sources.
2. Entity resolution, also know as record matching or record linkage, is the process focused on finding
the instances (i.e., records, tuples or values) that represent the same entities in reality.
3. Record merge, also known as data fusion, is the final step of data integration, in which, after the
equivalent records are found, they are merged, either by combining them or keeping the more
“accurate” or more “trustworthy” one.
Schema alignment corresponds to what is know as schema integration, while the latter two (entity
resolution and record merge) correspond to the data integration process. In what follows, we present in
more detail these two processes.
1. Mediated/Integrated/Global schema, which represents a unified and reconciled view of the underly-
ing data sources [Len02]. It captures the semantics of the considered user domain, including the
complete set of concepts of the data sources and the relationships among them.
2. Attribute matching, which specifies which attributes of the data sources’ schemata correspond to
which concept of the global schema. This matching can be 1-1, but sometimes one concept of
the global schema may correspond to the combination of several attributes in the data sources’
schemata.
3. Schema mappings are then built based on the previously found attribute matchings and they spec-
ify the semantic relationships (i.e., correspondence) between the schema of a data source (i.e.,
local schema) and the previously established global schema, as well as the transformations needed
to convert the data from local to global schema.
• Sound mappings (qObtained ⊆ qDesired ) define the correspondence in which the data returned by the
mapping query is a subset of those required, but it may happen that some of the required data is
not returned by the query.
• Complete mappings (qObtained ⊇ qDesired ) on the other hand, define a correspondence in which the
data returned by the mapping query include all those data that are required, but it may happen
that some other data are returned as well (i.e., the returned data are a superset of the required
data).
• Exact mappings (qObtained = qDesired ) finally are combination of both sound and complete mappings,
meaning that the mapping query returns exactly the data required by the mapping, no more and
no less.
Furthermore, we can implement schema mappings by means of two main techniques, i.e., Global As
View (GAV) and Local As View (LAV).
In GAV, we characterize each element of global schema in terms of a view (or query) over the local
schemata. Such approach enables easier query answering over the global schema, by simply unfolding
the global schema concepts in terms of the mapped data sources, which is absolutely equivalent to
88
A. Abelló and P. Jovanovic Data Warehousing and OLAP
processing views in a centralized RDBMS. However, this technique can result too rigid, having that a
change over a single data source schema may require the change of multiple (and in the extreme case
all) schema mappings.
In the case of LAV, we characterize elements of the source schemata as views over the global schema.
LAV mappings are intended to be used in the approaches where changes in data source schemata are
more common having that changes in the local schema only require updating the subset of mappings
for the affected data source. However, the higher flexibility of LAV mappings comes with the higher
cost of answering the queries over the global schema, which implies the same logic as answering queries
using materialized views, already known as a computationally complex task [Hal01].
There are also some research attempts to generalize the previous two techniques, like Global/Local
As View, which maps a query over the global schema in terms of a query over the local schemata, thus
benefiting from the expressive power of both GAV and LAV techniques [FLM+ 99], and Peer To Peer
(P2PDBMS), in which case each data source acts as an autonomous “peer”, while separate (peer-to-peer)
mappings are defined among each pair of data sources [DGLLR07].
Despite their rigidness, GAV is the most widely used schema mapping approach in production data
integration systems because of its simplicity.
Similarity function (≈). In a general case, in order to find how “close” are two values, we need to
employ an effective similarity function. Depending on the expected level of discrepancy, and the type of
values, we can use different methods to measure their similarity:
89
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• A simple Hamming distance, which for two strings of the same length, measures the number of
positions at which the corresponding symbols are different (e.g., Hamming(“Jones00 , “Jomes00 ) = 1,
Hamming(“Indiana Jones00 , “Indyana Jomes00 ) = 2), or
• its more generalized version that works for any two strings, called Edit distance, which measures
the minimum number of operations (additions, deletions, substitutions) needed to convert one string
value to the another (a.k.a. Levenshtein distance).
However, these measures return an absolute “distance” between two strings, which may not always be
reliable to decide whether two string represent the same object or not (e.g., Levenshtein(“U K 0 , “U S 00 ) = 1
may be much more significant distance than Levenshtein(“U nyted Kinqdon0 , “U nited Kingdom00 ) = 3).
To address this issue,
• Jaccard similarity represents another option for the similarity function, and it measures the sim-
ilarity of two sets (where strings can be seen as sets of characters) relative to the their total size
A∩B
(i.e., A∪B ). Going back the example above, we can see that Jaccard(“UK’,“US”)=0.33 is compara-
tively smaller than Jaccard(“Unyted Kinqdon’,“United Kingdom”)=0.67, hence in this case Jaccard
similarity is more effective as the similarity function.
Finally, regardless of the method used, the resulting similarity value must be converted to a binary
(yes/no) decision, which is typically done by establishing a case-specific threshold (e.g., we can consider
that if Jaccard(A, B) ≥ 0.5, then A ≈ B).
90
A. Abelló and P. Jovanovic Data Warehousing and OLAP
}
Listing 8.2: R-Swoosh algorithm
The R-Swoosh algorithm considers at the input a set of records from the underlying data sources (I)
and starts with an empty output set O. In each iteration, it takes out one record r from I and it looks for
the “similar” record s in the output. If matching is found, r is removed from the I and s from O, and the
“merging” of these two records is put back to I so that the matching and merging with other records can
continue. Otherwise, the record does not have matching with any record so far and thus it is removed
from I and added to the output. Notice that in such way it is guaranteed that the previously merged
records will be potentially matched and merged with all the other records from the input. Finally, the
algorithm returns a set of all “uniquely” merged records (i.e., those that cannot be further merged with
any other records in the output).
Multimedia Materials
Overcoming Heterogeneity (en)
Overcoming Heterogeneity (ca)
Schema Integration (en)
91
A. Abelló and P. Jovanovic Data Warehousing and OLAP
92
Chapter 9
Most of human and organizational behavior has some indication of time and space. It is estimated that
up to 80% of data stored in databases include location or spatial components (see [VZ14]). Time as well
represent an important aspect of day-to-day phenomena (e.g., employee’s salary can change over time,
time history of patient diagnosis, disease prevalence can change over time at a certain territory).
All this represents an opportunity to acquire space/time-oriented knowledge about our data. For
instance, to verify that a territory is free of some infections disease, we need to monitor the trends
of number of new cases across the territory (e.g., counties, provinces, cities, and villages within the
country), over time (e.g., in the last three years). In addition, data values as well can have attached
the notion of the time-validity (e.g., customer can have certain home address in one period of time, and
another address in a different period of time, and both should coexist in the same database).
To enable such data analysis, we need effective and efficient methods for representing and process-
ing spatial and temporal information. Conventional databases typically use alphanumerics to describe
the geographical information (e.g., textual names of city/province/country, or integer value for their
respective populations), and they can only represent a state of an organization at a current moment,
without further notion of the time dimension.
In this chapter, we introduce the main concepts of temporal and spatial support in RDBMS, and the
extension to model a spatio-temporal DW.
93
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• On insert at time t1 , a right-open time interval [t1 , now) is added to the record, indicating that the
inserted record is currently in use.
• On update at time t2 , the time interval of the current record is updated to [t1 , t2 ] (i.e., closed), and
a replica of the record is created with the right-open time interval [t2 , now).
• On delete at time t3 , the record with the open time interval is updated to [t2 , t3 ] (i.e., closed), indi-
cating that the record is not in use anymore for DML operations, but the record is not physically
removed. Instead, it remains stored for keeping the complete history of record values.
By storing the complete history of records’ values in the table, we can reproduce a database table
state at any point of time in the past, while only the records with currently ”open” interval are used in
the present.
94
A. Abelló and P. Jovanovic Data Warehousing and OLAP
SQL extension that supports temporal data aspects is SQL3 (see [SBJS96]), which recollects and extends
the temporal support previously proposed and accepted for TSQL2 (see [S+ 95]). Although SQL3 was
never accepted as a standard, its temporal support components are included in the SQL:2011 standard,
implemented in many of the most used commercial DBMS (e.g., Oracle, Teradata, IBM DB2, MS SQL
Server). Such extension provides support for expressing temporal information about our data, and as
well processing temporal queries (e.g., different types of temporal joins and temporal range queries).
Oracle has a bi-temporal support through which it can handle both transaction time (through Flashback
Data Archive component) and valid time (through its Temporal Validity feature). Flashback Data Archive
allows to monitor query trends, keep table history, navigate through past and recover database table in
certain point in time. Such support is also important for leaving audit trails in the DBMS.
PostgreSQL has an extension othat provides a custom-made transaction time support, which is how-
ever not fully compliant with SQL:2011.1 The support is implemented through System Period Temporal
Tables, which besides the Current table with records being in use also stores a History table where the his-
torical versions of table records are stored. As expected, applying DML operations (UPDATE, DELETE,
TRUNCATE2 ) over the tables with temporal support does not cause permanent changes to the table, but
rather creates new versions of the data in the history table. To automatically support the corresponding
actions at the moment of applying operations, several triggers are implemented and fired when the op-
erations are applied. An example of System Period Temporal Tables system for the person table is showed
in Figure 9.1. Current table has the same structure and name as original table, except for four technical
columns (i.e., system time - time range specifying when this version is valid, entry id - id of entry that
can be used to group versions of the same entry or replace PK, transaction id - id of transaction that
created this version, and user id - id of user that executed modifying operation that created the entry).
Values for these columns are always generated by the system and the user cannot set them. To ensure
backward compatibility, these columns can be marked as implicitly hidden. History table also has the
same structure and indexes as current table, but it does not have any constraints. History tables are
insert only and the creator should prevent other users from executing updates or deletes by defining
the appropriate user rights. Otherwise the history can result inconsistent. Lastly, triggers for storing old
versions of rows to history table are inspired by referential integrity triggers, and are fired for each row
after UPDATE and DELETE, and before each TRUNCATE statement.
1 https://wiki.postgresql.org/wiki/SQL2011Temporal
2 TRUNCATE operation deletes all records in a table (like an unqualified DELETE) but much faster.
95
A. Abelló and P. Jovanovic Data Warehousing and OLAP
1. Reference system. A coordinate-based system used to uniquely define geometrical objects and
their position (projection) within the given system.
2. Spatial data types. Extension of database types to store geometry data (e.g., points, lines, poly-
gons).
3. Spatial operations. Extension of the database operation set to manipulate previously defined
spatial data types (e.g., check if one polygon contains or overlaps with another polygon, line or
point, union/intersection/difference of polygons).
Apart from this, given the complexity of spatial data types, traditional access methods need to be
extended to allow efficient querying and processing of spatial data - spatial indexes.
• Base geometries:
– Point, which represents a single point in the reference coordinating system, defined by a pair
of decimal values, i.e., longitude and latitude (e.g., town, health facility, airport),
3 https://www.ogc.org/standards/sfas
96
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Numeric, which receive as input one or more geometries and return a numeric value (e.g., length,
area, perimeter, distance, direction).
• Predicates, which receive as input one or more geometries and return a boolean value (e.g.,
IsEmpty, OnBorder, InInterior).
97
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Unary operations, which receive as input one geometry and return new geometry resulting from
the applied transformation (e.g., Boundary, Buffer, Centroid, ConvexHull).
• Binary operations, which receive as input two geometries and return new geometry resulting from
the applied transformation (e.g., Intersection, Union, Difference, SymDifference, Spatial join).
A special type of spatial predicate operations are those that indicate topological relationship that
exist among input geometries, i.e., topological operations (see some self-explanatory examples in Fig-
ure 9.3). Notice that Contains/Within is a specific version of the Covers/CoveredBy operation. Contains
requires that one geometry is located in the interior of the other geometry and hence that the boundaries
of the two geometries do not intersect, while Covers does not pose such restriction and includes also the
cases where the boundaries of two geometries may intersect.
Figure 9.3: Examples of topological relationships between two geometries (in gray and in black) [VZ14]
To compare any two geometries it is required to compare (for intersection) their exterior (i.e., all
of the space not occupied by a geometry), interior (i.e., space occupied by geometry), and boundary
(i.e., interface between a geometry’s interior and exterior). In particular, an intersection between two
geometries can result in: point (0), curve (1), surface (2), or no intersection (-1). For each topological
operation, it is then needed to define one or several pattern matrix4 (analogous to a regular expression),
over which the results of intersections among their interiors, exteriors, and boundaries are matched.
A pattern matrix can have the following cells: T - there is intersection of any type (point - 0, curve -
1, or surface - 2); F - there is no intersection (-1); * - intersection is irrelevant (-1, 0, 1, or 2); and the
intersection must be of a specific type (i.e., 0, 1, 2).
An example of a pattern matrix for the topological operation Contains is presented in Figure 9.4. It
is obvious to see from Figure 9.3 that in order for a geometry a to contain geometry b, there must be an
intersection between their interiors (of any kind, depending on the type of geometry), while the exterior
of a must not intersect with the interior nor with the boundary of b (having that b must be entirely
contained within a).
98
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Space that the index is covering. Here we distinguish between space-drive structures, which par-
tition the entire space under consideration and map indexed objects to that space, and data-driven
structures, which partition only the part of the space containing the indexed spatial objects.
• Objects that are indexed. Here we distinguish between point access methods, a.k.a. multidimen-
sional points, which can index points inside the considered space, and spatial access methods, a.k.a.
multidimensional regions, which can index more complex spatial types, like surfaces/polygons.
Here, we will present two representative examples of spatial indexes, R-tree and Quadtree.
9.2.4.1 R-tree
R-tree is a representative of data-driven structure and spatial access method, which is the base of most
famous commercial DBMS implementations of spatial indexes. It is a height-balanced tree (based on
B+-tree) that consists of (1) leaf nodes that consist of a collection of pairs [MBR of that geometry object,
reference to the geometry object on the disk]; and (2) intermediate nodes that consist of collections of
pairs [MBR enclosing all MBRs of the child node, reference to the child node in an R-tree]. An example
of an R-tree and corresponding indexed geometries is shown in Figure 9.6.
To perform filter or join queries over the indexed geometries, we first follow the MBRs from the
root of the R-tree (from coarser to finer level) until the leaf level is reached. Similarly, to insert a new
geometry, from the root of the R-tree, we follow the child node that needs minimum enlargements to
fit geometry’s MBR, until the leaf is reached. If the leaf is overfilled, the node is split and the entries
are redistributed over the current and new nodes. The main task of node split algorithms is to find a
”proper” distribution of all geometries over the two nodes after splitting. They typically also follow the
minimum enlargements criteria but they can become very complex in general. The original exhaustive
algorithm and some alternative approximations are given in [HHT05].
9.2.4.2 Quadtree
Quadtree is a representative of space-driven structure and spatial access method, typically used for repre-
senting binary (black & white) images, represented as 2n x2n matrix in which 1 stands for black and 0 for
white. The index structure is based on degree four tree of max height n. Having that it is a space-driven
99
A. Abelló and P. Jovanovic Data Warehousing and OLAP
structure, the root of the tree corresponds to the whole image (the entire space). Each tree node then
corresponds to an array of four equivalent image parts (from upper left to lower right) and depending
on the part ”color filling” can be: (1) intermediate, represented as a circle, meaning that the part has
multiple colors; and (2) leaf, represented as a black or white square, meaning that the part is completely
colored either in black or in white, respectively (see an example of Quadtree in Figure 9.7).
100
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Figure 9.8: MADS spatio-temporal model (spatial data types icons) [PSZ06]
The MADS conceptual model allow us to enrich traditional geographical dimensions with a specific
data type of each of the dimension levels (e.g., Airport as Point, City as Surface, and Country as MultiSur-
face; see Figure 9.10), as well as the topological relationship that exists between the dimensional levels
(e.g., Airport is CoveredBy City, which is CoveredBy Country). Notice that this enriches a traditional
dimension, typically based on the multiplicities of the levels in the level-to-level relationship (e.g., Air-
port belongs to one and only one City), with spatial specific semantics (e.g., Airport is CoveredBy by a
City, City is Within a Country).
• Distributive, those that can be computed in a distributed manner, i.e., partitioning data and ap-
plying aggregated function over smaller subsets, and then applying the same function over the
results of each partition. In traditional DW, examples of distributive aggregation functions are
sum(), count(), min/max(). In the case of spatial DW, examples of distributive spatial functions
101
A. Abelló and P. Jovanovic Data Warehousing and OLAP
are: convexHull() (the smallestconvexpolygon, that encloses all of the geometries in the input set),
spatialUnion(), spatialIntersection().
• Algebraic, those that can be computed by an algebraic function with N arguments, each of which
is obtained by applying a distributive aggregation function. In traditional DW, an example of
algebraic aggregation function is avg(), which can be computed by sum()/count(). In the case of
spatial DW, an example of algebraic spatial function is center of n points (centroid()). PFor a set of
n points (i.e., S = {(x1 , y1 ), (x2 , y2 ), ...(xn , yn )}), we compute a centroid(S) as ( n1 ni=0 xi , n1 ni=0 yi ).
P
• Holistic, in the case that there is no algebraic function that characterize the computation, mean-
ing that there is no way to distribute the computation the measure value. In traditional DW,
an example of holistic aggregation function is median(). In the case of spatial DW, an example
of holistic aggregation function is equipartition(), which partitions convex bodies into equal-area
pieces, creating the equipartition.
Multimedia Materials
Temporal Databases (en)
Spatial Data Types (en)
Spatial Operators (en)
Conceptual Design of Spatial DW (en)
102
Chapter 10
Dashboarding
With contributions of Daniele Perfetti
Data visualization can be seen as a use of computers to better comprehend existing data or to extract
additional (not easily seen) knowledge from the data that resulted from simulations, computations, or
measurements [McC87].
For instance, the importance of graphical representation of data can be seen in a famous example
of Anscombe’s quartet1 . In Figure 10.1a, we can see four different datasets of Anscombe’s quartet. Just
by analyzing their (almost exact) simple summary statistics (see Figure 10.1b), we could conclude that
these four datasets are identical.
However, only after plotting the datasets in Figure 10.2, we can see how they vary considerably in
the distribution. More specifically, the first chart (top left) shows a simple linear relationship, while
the second one (top right) is obviously not linear, thus requiring more general regression techniques.
The third chart (bottom left) is also linear, but its regression line is offset by the one outlier lowering the
correlation coefficient. Lastly, the fourth chart shows how the one high point produces a high correlation
coefficient while the other points do not indicate any relationship between the x and y variables.
This is only exacerbated in the case of Big Data, were these cannot be fully comprenhended by the
limited human senses and brain. Indeed, human eye can only sense few megabytes per second, but it
is even too much for the poorer brain learning capacity which is in the order of few tens of bits per
1 https://en.wikipedia.org/wiki/Anscombe%27s_quartet
103
A. Abelló and P. Jovanovic Data Warehousing and OLAP
second. Therefore, as sketched in Figure 10.3, we must assume some information loss in the process
of transforming raw data into knowledge that can be really consumed by our brain. Firstly, we have
to abstract the data in the form of relevant features with the appropriate units and scales, and then
structure them in the form of meaningful charts, so they convey the appropriate message.
Thus, we can identify the main three benefits of visualizing the data, namely to: (1) facilitate un-
derstanding of the data by providing simpler way to get an insight or information from large datasets,
(2) facilitate pattern identification in the data, like outlier detection or identifying trends to decide on
the next best steps, and (3) attract attention of the audiance by using appealing graphical images to
communicate the data insights.
104
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Dashboards have easily become one of the most popular visualization tools in business intelligence
and data warehousing, as they allow organizations to effectively measure, monitor, and manage business
performance [VZ14].
Typically, dashboards consist of three main elements, namely, (a) Diagram(s) visualizing the mea-
sured metrics, (b) Heading explaining the content of the dashboard and its purpose, and (c) Short ex-
planation (or interpretation) of the status and information in the diagram.
The following general characteristics are also recommended for an effective dashboard, namely: (a)
that it fits in a single computer screen (thus avoiding scrolling or paging to find the right information),
(b) that it contains minimal (necessary) information (so that it enables fast grasping of the needed in-
formation and avoiding unnecessary visual distractions), (c) that it shows the most important business
performance measures to be monitored, (d) that it updates automatically from the data (daily, weekly, or
when the data changes if more timely information is needed) in order to display up to date information,
and (e) that it allows interaction with the displayed data for their better understanding (e.g., enabling
filtering or drilling-down).
There are three main high-level classes of dashboards, depending on their purpose [VZ14]:
• Strategic dashboards, which provide a quick overview of the status of an organization for making
executive decision (long-term goals). The focus of strategic dashboards is not on what is going on
right now (no real-time data), but on the recent past performance.
• Operational dashboards are used to monitor organization’s operations, thus requiring more timely
data to track changing activities that could require immediate attention. Operational dashboards
are recommended to be simple to enable rapid visual identification of monitored measures and
quick reaction for correcting them.
• Analytical dashboards should support interaction with the data like drilling down into the un-
derlying details, in order to enable exploration to make sense of the data. They should allow not
only to examine what is going on, but also to examine the causes.
105
A. Abelló and P. Jovanovic Data Warehousing and OLAP
so-called SMART properties,3 that is, they should be: (a) Specific (i.e., more detailed than a related
objective), (b) Measurable (i.e., could be measured and obtained from the data), (c) Achievable (i.e.,
could be attained - similar to the Doable property of the objective), (d) Relevant (i.e., realistic, related
to the real business), and (e) Timely (i.e., time-bounded, with a pre-determined deadline).
First of all, as mentioned before, the dashboard should be kept simple, containing only the neces-
sary information, and using the proper amount of visual elements. The charts in the dashboard should
not be difficult to read. If an end user has to think much to read a chart or ask how to interpret it, the
design is probably not the most appropriate one. For instance, both charts in Figure 10.4 are overly
complicated (too many visual elements) so that the necessary information (i.e., revenue distribution
per months and quarters or earnings per city both in percentages and absolute values) can be grasped
quickly. The objective of an effective dashboard is to transmit the necessary information, not to demon-
strate how skilled the dashboard designer is.
However, the dashboards should neither be overly simplistic nor “boring”. It is often recommended
to use Gestalt principles5 of proximity and similarity to include combination of aesthetics that can con-
vey the information in visually more pleasing and understandable manner. For instance, in Figure 10.5,
3 https://www.mindtools.com/pages/article/smart-goals.htm
4 H.A. Downs. Focusing Digital Dashboards for Use: Considering Design, Planning and Context: https://comm.eval.org/
researchtechnologyanddevelopmenteval/viewdocument/focusing-digital-dashboards-for-use
5 https://en.wikipedia.org/wiki/Gestalt_psychology
106
A. Abelló and P. Jovanovic Data Warehousing and OLAP
on the left, we can see a simple bar chart showing the number of participants per month. Although this
chart is correct and simple enough to be understood, just by replacing basic bars by more intuitive aes-
thetics, i.e., the corresponding number of icons of participants, the chart (on the right) becomes much
more attractive for the end user.
While the use of such aesthetics can improve the attractiveness of our charts, they must be used
with care since their overuse may cause the counter effect, and draw the attention of the end user to the
design, instead of the information to be transmitted (see an example in Figure10.6).
Finally, there are several specific guidelines or best practices suggested for achieving a “good chart”,
depicted in Figure 10.7.
Last by not least, choosing the appropriate type of chart to convey the desired information is as
well an important step in the dashboard design. For instance, in Figure 10.8, two charts are showing
approval ratings per month, including two groups (approval and disapproval), and both based on the
same underlying dataset. Chart in Figure 10.8a is using stack graph, but with absolute values. Stack
graph is actually more useful when set up as 100% stack graph, since it would much easier show the
proportion among the two groups. However, as it can be seen in Figure 10.8b, line chart does a better
job in showing between group differences. In particular, in the line chart we can easily see that the
approval and disapproval ratings swap places in the last year, while in the stack chart this can be seen
only by reading the numbers in the bars, which practically makes the visualization futile.
107
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Figure 10.8: Showing between group differences (stack vs. line chart) (H. A. Downs)
principles are based on laws, guidelines, human biases, and general design considerations, and are
introduced in [BHL03], while here we discuss the ones that are most applicable to the dashboard design.
• Chunking refers to the technique where units of information are grouped into a limited number
of chunks, making it easier to process and remember by human. This technique is related to
the short-term memory usage that can accommodate up to four chunks at a time (+/- one). The
simplest example is grouping the digits of a long phone number so that it can be observed and
remembered more easily, where groups can as well have a specific purpose like country code,
province code (e.g., 34934137889 vs. +34 93 413 7889). The same technique applies to the design
of a web page containing a large amount of information, like in our case a dashboard containing
many charts and their related metadata. Charts should be then grouped by topics (not more than
4 or 5).
• Alignment principle suggests that the elements of a design should be aligned with one or more
other elements, creating a sense of cohesion and contributing to the design’s aesthetics and per-
ceived stability.
• Symmetry has long been associated with beauty. Starting from the symmetric shapes found in na-
ture, human body, and later transmitted to architecture and design. Combinations of symmetries
can create harmonious, interesting, and memorable designs.
• Gutenberg diagram is based on the observation that the reading gravity pulls the eyes from the
108
A. Abelló and P. Jovanovic Data Warehousing and OLAP
top-left to the bottom-right of the display (see Figure 10.9).6 It recommends that the homogeneous
elements in the design are organized in such manner to follow the reading gravity. In the case of
heterogeneous elements the visual weight of each element will draw the attention and drive the
eye movement, so the layout would not apply and could unnecessarily constraint the design.
• Highlight can be effective technique to bring attention to specific elements of the design. How-
ever, if not use with care, it can actually be counter productive and distract the reader, reducing
the performance of the design. There are several guidelines on how the highlighting can be used
effectively. For instance, only up to 10% of the visible design should be highlighted, bold should
preferably used (as it distracts less the reader), but italics and underline could also be used for
highlighting titles, labels, captions, and short word sequences. Color can also be an effective high-
lighting technique, but should be used less frequently and only in concert with other highlighting
techniques. Other techniques like inverting and blinking can add more noise to the design and be
even more distracting so they should be used only in the case when the objective is to highlight
highly critical information.
• Colors can be used, as mentioned before, as an effective highlighting technique, but also to group
elements or give them a specific meaning. Lastly, it can also improve the aesthetics of the design
if used with care, but if used improperly it can seriously harm the form and the function of the
design. There are two main concerns when it comes to using colors in our design: (a) Number of
colors should be limited to what human eye can process at a glance (about five different colors).
Moreover, it is strongly suggested that the design does not rely solely on the colors to transmit
the intended information, given that a significant portion of population has limited color vision,
(b) Color combinations, if used properly, can improve the aesthetics of the design. Either adjacent
colors on the color wheel, the opposing ones, or at the corners of the symmetric polygon inside
the color wheel (see Figure 10.10) are good choices. Such color combinations are typically found
in nature.
• Signal-to-Noise ratio refers to the ratio of relevant to irrelevant information in the design, and
intuitively it should be as high as possible. For instance, in Figure 10.11, the designs on the right
are drastically improved, by removing the unnecessary design elements (i.e., those that do not
convey any information) from those on the left.
• Closure refers to the tendency to perceive a set of individual elements as a single recognizable
pattern, rather than considering them separately. Although the individual elements do not form
6 Need to be aware that this is only true in occidental cultures and indo-european languages that follow such direction in
writing.
109
A. Abelló and P. Jovanovic Data Warehousing and OLAP
a perfect shape of a pattern (see Figure 10.12), human brain has the ability to fill in the gaps and
create a holistic picture, which results as a more interesting design. Involving closure can also
reduce the complexity of the design, since in some cases we can only provide a sufficient number
of elements to the design, while the brain creates the entire graphics.
• Good continuation is another principle that relates to the human perception of visual elements
(i.e., Gestalt principles). Specifically, it states that if the visual elements of a design are aligned
in some manner to form a proper geometry like line, human brain perceives them as a group and
related. Using it in a design, good continuation makes easier to read the information. For instance,
in Figure 10.13, the bar chart on the right is easier to read than that on the left as the end points
of its bars form a straight line that is more continuous.
• Five hat racks refers to a principle that is used to order the elements in a design. More specifically,
it states that there exist five ways of ordering information: category, time, location, alphabet,
and continuum. The first three are self-explanatory, while continuum refers to organization of
elements by magnitude (e.g., lowest to highest building in the world). The ordering in the design
should be chosen wisely as different organizations dramatically influence which aspects of the
information are emphasized.
• Hierarchies helps lowering the complexity of the design and transmitting an important informa-
tion about the relationships between the elements in order to increase the knowledge about the
structure of the system. The organization of hierarchies can be in the form of a (a) Tree, where
child elements are found easily below or to the right of their parent elements, (b) Nest, where
child elements are found within, contained by their parent elements, and (c) Stair, where several
child elements are stacked below or to the right of their parent element.
110
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Layering is used to manage the complexity of the designs by organizing the information into re-
lated groupings and then showing only the elements of one group at a time. Layering can be either
two-dimensional, in which case only the elements of one layer can be viewed at a time and they are
revealed either in a linear (sequentially in a line) or non-linear way (as a tree or graph/web); or
three-dimensional, in which case layers of information are placed on top of each other and multiple
layes can be viewed at once (e.g., a geographical map with a layer of capital cities or a layer of
temperature).
Lastly, there are several practical guidelines for having an effective dashboard design, stemming
from the cognitive perception abilities of a person (see [Sta15] for details):
• Elements in the design should be logically and conceptually related so that they could be observed
as a whole and easily compared.
• Number of elements should be at most seven (plus two if really necessary), which is the number
of elements an average person can keep in the short term memory.
• Coloring in the design should be limited to a necessary minimum, extrapolating on the important
information in the dashboard, but not serving as the only visualization method.
111
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Figure 10.14: Coxcomb chart showing the causes of mortality of the British army in the east [BHL03]
With two-dimensional we can already employ more complex visualizations to analyze the poten-
tial pair-wise relationships or correlations amongst the different data attributes (e.g., 2D scatterplots
or heatmaps). See an example in Figure 10.16. Such analysis can be extended to more dimensions
(e.g., 3D scatterplot), either by introducing another (depth) dimension in the coordinate system (see
Figure 10.17a) or by introducing other visual elements, like size (bubble chart), color, or hue. How-
ever, such visualizations can become more complex making them harder to comprehend by the viewer.
Some of them like scatterplot, can be ”unfolded” into a matrix so that all the data can be visualized (see
Figure 10.17b).
Apart from considering the number of dimensions, the choice of an appropriate chart to visualize
our multidimensional data can depend on other aspects of data themselves or the user analytical needs.
To guide users in choosing the correct chart for their analysis, we can use a categorization method based
on specific questions (see [Per15]). The following categories/questions could be considered:
• Cardinality of the domain, which when low (≤ 50) may indicate the use of charts that show all the
individual values (e.g., trend lines or network), while when high would require more cumulative
or aggregated values (e.g., bar chart, heat map).
112
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• Type of variable, where continuous variables can be showed with charts like line charts, his-
tograms while for categorical variables we can use for example bar chart, pie chart, or dendro-
gram7 .
• Filtering is useful when we are dealing with large amounts of data or when a user only wants to
report on a specific subset of data that meets certain conditions. In such case, it is appropriate to
use a chart that allows to filter data, by setting an interval or a threshold value.
• Goal of visualization, i.e., what information or message we want to convey to the viewer (e.g.,
composition, distribution, comparison, or relationship among different variables). Based on a specific
goal, an appropriate chart should be chosen (e.g., using dendogram charts to visualize relationship
among variables or histograms to show variable distribution).
• Representation typology should be chosen according to which data type the user wants to show,
some charts can be more useful than others. For example, for showing location-based data it is
better to use a geographical map, as well as the simplest way to display a hierarchy is through a
hierarchical chart.
7 https://en.wikipedia.org/wiki/Dendrogram
113
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Such categorization method for selecting the corresponding “best” choice chart for the specific data
and user analytical needs is depicted in Figures 10.18 (for one dimension), 10.19 (for two dimensions),
and 10.20 (for many dimensions).
114
A. Abelló and P. Jovanovic Data Warehousing and OLAP
Figure 10.18: Selecting charts for one dimension data analysis [Per15]
115
A. Abelló and P. Jovanovic
116
Figure 10.19: Selecting charts for two dimension data analysis [Per15]
Data Warehousing and OLAP
A. Abelló and P. Jovanovic
117
Figure 10.20: Selecting charts for many dimension data analysis [Per15]
Data Warehousing and OLAP
A. Abelló and P. Jovanovic Data Warehousing and OLAP
118
Chapter 11
Data warehousing has been around for about three decades now and has become an essential part of
the information technology infrastructure.1 Data warehousing originally grew in response to the cor-
porate need for information. Thus, a data warehouse is a construct that supplies integrated, granular,
and historical data to the corporation. This simple definition has rendered extremely hard to match.
Nowadays, after all the experiences gathered in data warehousing, authors start to distinguish between
first-generation and next-generation data warehouses.
Basically, there are two main concerns behind this new paradigm: the inclusion of ALL the relevant
data (from inside the organization and even from external sources such as the Web) and the REAL use
of metadata. However, the second one is somehow a direct cause of the first one. After the success of
data warehousing, the organizations widened their look towards data and wanted to include in their
decision making processes alternative data such as unstructured data (of any kind, from plain text and
e-mails to voice over IP), which makes even more difficult the data consolidation process. At this point
is when metadata emerges as a keystone to keep and maintain additional semantics to data. Specifically,
consider the following shaping factors:
• In first-generation data warehouses, there was an emphasis on getting the data warehouse built
and on adding business value. In the days of first-generation data warehouses, deriving value
meant taking predominantly numeric-based, transactional data and integrating those data. Today,
deriving maximum value from corporate data means taking ALL corporate data and deriving
value from it. This means including textual, unstructured data as well as numeric, transactional
data.
• In first-generation data warehouses, there was not a great deal of concern given to the medium on
which data was stored or the volume of data. But time has shown that the medium on which data
are stored and the volume of data are, indeed, very large issues. In 2008, the first petabyte data
warehouses have been announced by Yahoo and Facebook. Such amount of data is even beyond
the limits of current Relational DBMS, and new data storage mechanisms have been proposed.2
• In first-generation data warehouses, it was recognized that integrating data was an issue. In today’s
world it is recognized that integrating old data is an even larger issue than what it was once
thought to be.
• In first-generation data warehouses, cost was almost a non-issue. In today’s world, the cost of data
warehousing is a primary concern.
1 This chapter is mainly based on the book “DW 2.0. The Architecture for the Next Generation of Data Warehousing”, published
by Morgan Kaufmann and authored by W.H. Inmon, Derek Strauss and Genia Neushloss, 2008 [ISN08].
2 Namely, the NoSQL -Not Only SQL- wave has focused on the storage problem for huge data stores.
119
A. Abelló and P. Jovanovic Data Warehousing and OLAP
• In first-generation data warehousing, metadata was neglected. In today’s world metadata and
master data management are large burning issues. Concepts such as data provenance, ETL and
data legacy are built on top of metadata.
• In the early days of first-generation data warehouses, data warehouses were thought of as a novelty.
In today’s world, data warehouses are thought to be the foundation on which the competitive use
of information is based. Data warehouses have become essential.
• In the early days of data warehousing, the emphasis was on merely constructing the data ware-
house. In today’s world, it is recognized that the data warehouse needs to be malleable over time
so that it can keep up with changing business requirements (which, indeed, change frequently).
All these issues represent challenges by themselves and still the community and major vendors are
working on them. However, it is not part of this course objectives to gain insight, but the interested
reader is addressed to [ISN08].
120
Bibliography
[BHG87] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in
Database Systems. Addison-Wesley, 1987.
[BHL03] Jill Butler, Kritina Holden, and William Lidwell. Universal principles of design. Rockport publishers
Gloucester, MA, USA, 2010, 112–113., 2003.
[BS16] Carlo Batini and Monica Scannapieco. Data and Information Quality: Dimensions, Principles and Tech-
niques. Springer, 2016.
[CCS93] E. F Codd, S.B. Codd, and C.T. Salley. Providing OLAP (On Line Analytical Processing) to Users-
Analysts: an IT Mandate. E. F. Codd and Associates, 1993.
[Che76] P. P. S. Chen. The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on
Database Systems, 1(1):9–36, 1976.
[DGLLR07] Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. On reconciling
data exchange, data integration, and peer data management. In Proceedings of the twenty-sixth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 133–142, 2007.
[Do] Hong Hai Do. Data conflicts. In Ling Liu and M. Tamer Özsu, editors, Encyclopedia of Database Systems,
Second Edition, pages 565–569. Springer.
[DS15] Xin Luna Dong and Divesh Srivastava. Big data integration. Synthesis Lectures on Data Management,
7(1):1–198, 2015.
[FH76] Ivan P Fellegi and David Holt. A systematic approach to automatic edit and imputation. Journal of the
American Statistical association, 71(353):17–35, 1976.
[FLM+ 99] Marc Friedman, Alon Y Levy, Todd D Millstein, et al. Navigational plans for data integration. AAAI/I-
AAI, 1999:67–73, 1999.
[FM18] Ariel Fuxman and Renée J. Miller. Schema mapping. In Ling Liu and M. Tamer Özsu, editors, Ency-
clopedia of Database Systems, Second Edition. Springer, 2018.
[GR09] M. Golfarelli and S. Rizzi. Data Warehouse Design. Modern Principles and Methodologies. McGraw-Hill,
2009.
[GSSC95] Manuel Garcı́a-Solaco, Fèlix Saltor, and Malú Castellanos. Semantic Heterogeneity in Multidatabase
Systems, pages 129–202. Prentice Hall International (UK) Ltd., GBR, 1995.
[GUW09] Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. Database systems - the complete book (2.
ed.). Pearson Education, 2009.
[Hal01] Alon Y Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, 2001.
[HHT05] Marios Hadjieleftheriou, Erik G. Hoel, and Vassilis J. Tsotras. Sail: A spatial index library for efficient
application integration. GeoInformatica, 9(4):367–389, 2005.
[Inm92] W. H. Inmon. Building the Data Warehouse. John Wiley & Sons, Inc., 1992.
[Ioa96] Yannis E. Ioannidis. Query optimization. ACM Comput. Surv., 28(1):121–123, 1996.
[ISN08] W.H. Inmon, Derek Strauss, and Genia Neushloss. DW 2.0. The Architecture for the Next Generation of
Data Warehousing. Morgan Kaufmann, 2008.
[Jar77] Donald Jardine. The ANSI/SPARC DBMS Model. North-Holland, 1977.
[Kim96] R. Kimball. The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses.
John Wiley & Sons, Inc., 1996.
[KRTR98] R. Kimball, L. Reeves, W. Thornthwaite, and M. Ross. The Data Warehouse Lifecycle Toolkit: Expert
Methods for Designing, Developing and Deploying Data Warehouses. John Wiley & Sons, Inc., 1998.
121
A. Abelló and P. Jovanovic Data Warehousing and OLAP
[Len02] M. Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of 21th ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems, pages 233–246. ACM, 2002.
[LS09] A. Labrinidis and Y. Sismanis. View Maintenance. In Ling Liu and M. Tamer Özsu, editors, Encyclope-
dia of Database Systems, pages 3326–3328. Springer, 2009.
[May07] Arkady Maydanchik. Data quality assessment. Technics publications, 2007.
[McC87] Bruce Howard McCormick. Visualization in scientific computing. Computer graphics, 21(6), 1987.
[MTT18] Yannis Manolopoulos, Yannis Theodoridis, and Vassilis J. Tsotras. Spatial indexing techniques. In Ling
Liu and M. Tamer Özsu, editors, Encyclopedia of Database Systems, Second Edition. Springer, 2018.
[Pen05] Nigel Pendse. Market Segment Analysis. OLAP Report, Last updated on February 11, 2005.
http://www.olapreport.com/Segments.htm (Last access: July 2009).
[Pen08] Nigel Pendse. What is OLAP? OLAP Report, Last updated on March 3, 2008.
http://www.olapreport.com/fasmi.htm (Last access: July 2009).
[Per15] Daniele Perfetti. Business and visualization requirements for OLAP analysis of chagas disease data.
Master’s thesis, University of Bologna, Italy, 2015.
[PSZ06] Christine Parent, Stefano Spaccapietra, and Esteban Zimányi. Conceptual modeling for traditional and
spatio-temporal applications: The MADS approach. Springer Science & Business Media, 2006.
[S+ 95] RT Snodgrass et al. The tsql2 temporal query. Language, 1995.
[Sat18] Kai-Uwe Sattler. Data quality dimensions. In Ling Liu and M. Tamer Özsu, editors, Encyclopedia of
Database Systems, Second Edition, pages 612–615. Springer, 2018.
[SBJS96] Richard T Snodgrass, Michael H Böhlen, Christian S Jensen, and Andreas Steiner. Adding valid time
to sql/temporal. ANSI X3H2-96-501r2, ISO/IEC JTC, 1, 1996.
[Sta15] Miroslaw Staron. Dashboard development guide how to build sustainable and useful dashboards to
support software development and maintenance. 2015.
[SV03] Alkis Simitsis and Panos Vassiliadis. A methodology for the conceptual modeling of ETL processes.
In Johann Eder, Roland T. Mittermeir, and Barbara Pernici, editors, The 15th Conference on Advanced
Information Systems Engineering (CAiSE ’03), Klagenfurt/Velden, Austria, 16-20 June, 2003, Workshops
Proceedings, Information Systems for a Connected Society, volume 75 of CEUR Workshop Proceedings.
CEUR-WS.org, 2003.
[SV18] Alkis Simitsis and Panos Vassiliadis. Extraction, Transformation, and Loading. In Ling Liu and
M. Tamer Özsu, editors, Encyclopedia of Database Systems, Second Edition, pages 1432–1440. Springer,
2018.
[SVS05] Alkis Simitsis, Panos Vassiliadis, and Timos K. Sellis. Optimizing ETL processes in data warehouses. In
Karl Aberer, Michael J. Franklin, and Shojiro Nishio, editors, Proceedings of the 21st International Con-
ference on Data Engineering, ICDE 2005, 5-8 April 2005, Tokyo, Japan, pages 564–575. IEEE Computer
Society, 2005.
[SWCD09] Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. Qox-driven ETL design:
reducing the cost of ETL consulting engagements. In Ugur Çetintemel, Stanley B. Zdonik, Donald
Kossmann, and Nesime Tatbul, editors, Proceedings of the ACM SIGMOD International Conference on
Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pages 953–
960. ACM, 2009.
[The17] Vasileios Theodorous. Automating User-Centered Design of Data-Intensive Processes. PhD thesis, Uni-
versitat Politècnica de Catalunya, Barcelona, January 2017.
[TL03] Juan Trujillo and Sergio Luján-Mora. A UML based approach for modeling ETL processes in data
warehouses. In Il-Yeol Song, Stephen W. Liddle, Tok Wang Ling, and Peter Scheuermann, editors,
Conceptual Modeling - ER 2003, 22nd International Conference on Conceptual Modeling, Chicago, IL, USA,
October 13-16, 2003, Proceedings, volume 2813 of Lecture Notes in Computer Science, pages 307–320.
Springer, 2003.
[Vas09] V. Vassalos. Answering Queries Using Views. In Ling Liu and M. Tamer Özsu, editors, Encyclopedia of
Database Systems, pages 92–98. Springer, 2009.
[Vel09] Y. Velegrakis. Updates through Views. In Ling Liu and M. Tamer Özsu, editors, Encyclopedia of
Database Systems, pages 3244–3247. Springer, 2009.
[VZ14] Alejandro A. Vaisman and Esteban Zimányi. Data Warehouse Systems - Design and Implementation.
Data-Centric Systems and Applications. Springer, 2014.
122
Appendix A
Acronyms
BI Business Intelligence
CPU Central Process Unit
CSV Comma-separated Values
DAG Directed Acyclic Graph
DB Database
DBMS Database Management System
DDL Data Definition Language (subset of SQL)
DM Data Mart
DML Data Management Language (subset of SQL)
DW Data Warehouse
ELT Extraction, Load and Transformation
ER Entity-Relationship
ERP Enterprise Resource Planning
ETL Extraction, Transformation and Load
ETQ Extraction, Transformation and Query
FASMI Fast Analysis of Shared Multidimensional Information
FK Foreign Key
GUI Graphical User Interface
HOLAP Hybrid OLAP
I/O Input/Output
ISO International Organization for Standardization
JDBC Java Database Connectivity
LHS Left Hand Side
MBR Minimum Bounding Rectangle
MOLAP Multidimensional OLAP
NUMA Non-Uniform Memory Access
ODBC Object Database Connectivity
OLAP On-Line Analytical Processing
OLTP On-Line Transactional Processing
PDI Pentaho Data Integration
PK Primary Key
RDBMS Relational DBMS
RHS Right Hand Side
ROLAP Relational OLAP
SQL Structured Query Language
UML Unified Modelling Language
123
A. Abelló and P. Jovanovic Data Warehousing and OLAP
124
Appendix B
Glossary of terms
ACID: Transaction model adopted by all RDBMS and some NOSQL. It stands for:
• Atomicity: Either all operations inside the transaction are committed, or none of them is executed.
• Consistency: If the database was consistent before the transaction, so it is after it.
• Isolation: The execution of a transaction does not interfere with the execution of others, and vice-versa.
• Durability: If a transaction is committed, its changes in the database cannot be lost.
Access Path/Method: Concrete data structure (a.k.a. index) and algorithm used to retrieve the data required by
one of the leaves of the process tree.
Access Plan: Procedural description of the execution of a query. This is the result of query optimization and is
typically summarised/depicted in terms of a process tree.
ANSI/SPARC: The Standards Planning And Requirements Committee of the American National Standards Insti-
tute defined a three-level architecture in 1975 to abstract users from the physical storage, which is still in use
in the current RDBMSs.
Architecture: A system (either hardware or software) architecture is the blueprint of the system that corresponds
to its internal structure, typically in terms of independent modules or components and their interactions.
It is said to be “functional” if it explicitly shows the functionalities of the system. An architecture can
be centralized if the system is expected to run in a single location/machine, or distributed if it includes
the components needed to orchestrate the execution in multiple and independent locations/machines. The
architecture is said to be parallel if different components can work at the same time (i.e., not necessarily
one after another). Distributed and parallel architectures are typically organized in terms of one coordinator
component and several workers orchestrated by it.
Block: The transfer unit between the disk and the memory. This is also the smallest piece of disk assigned to a file.
Catalog: Part of the database corresponding to the metadata.
Data (or Database) Model: Set of data structures, constraints and query language used to describe the Database
(e.g., Relational, Object-oriented, Semi-structured, Multidimensional).
Data Type: Kind of content stored in an attribute or column (e.g., integer, real, string, date, timestamp). Besides
basic data types, we can also find complex (or user-defined) ones (e.g., an structured address which is com-
posed by the string of the street name, the house number, zip code, etc.).
Database: Set of interrelated files or datasets (a.k.a. tables or Relations in the Relational model). According to its
use, a database if called “operational” or “transactional” if it is used in the operation of the company (e.g.,
to register and contact customers, or manage the accounting), or “decisional” if it is used to make decisions
(e.g., to create dashboards or predictive models). According to the location of the files, a database can be
centralized if they reside in a single machine, or distributed if they reside in many different machines.
Database Management System (DBMS): Software system that manages data. The most popular one are those
following the Relational model (a.k.a. RDBMS), or the more modern mixture or Relational and object features
(a.k.a. Object-Relational DBMS). A Distributed DBMS (a.k.a. DDBMS) is that able to manage distributed
databases. Otherwise, it is said to be centralized.
Entry: Smallest component of indexes, composed by a key and the information associated to it. In B-trees, we find
entries in the leaves, while in hash indexes, the entries are in the buckets.
125
A. Abelló and P. Jovanovic Data Warehousing and OLAP
ETL: Data flow in charge of extracting the data from the sources, transforming them (i.e., formatting, normalizing,
checking constraints and quality, cleaning, integrating, etc.), and finally loading the result in the DW.
Metadata: Data that describes the user data (e.g., their schema, their data types, their location, their integrity
constraints, their creation date, their description, their owner). There are usually stored in the catalog of the
DBMS, or in an external, dedicated repository.
Process tree: Representation at the physical level of the access plan of a query.
Relation: Set of tuples representing objects or facts in the real world. They are usually represented as tables and
stored into files, but they can also be derived in the form of views. Each tuple then corresponds to a row in
the table and a record in the file. Tuples, in turn, are composed by attributes, which correspond to columns
in the table and fields in the records of the file. In statistical lingua, a relation represents a dataset containing
instances described by features.
Relational Model: Theoretical model of traditional databases consisting of tables with rows (a.k.a. tuples) and
columns (a.k.a. attributes), defined by Edgar Codd in 1970. Its main structure is the Relation.
Schema: Description of the data in a database following the corresponding data model (e.g., in the Relational
model, we describe the data by giving the relation name, attribute names, data types, primary keys, foreign
keys, etc.). We can find schemas representing the same data at the conceptual (i.e., close to human con-
cepts and far from performance issues; e.g., UML), logical (i.e., theoretico-mathematical abstraction; e.g.,
Relational model) or physical (i.e., concrete data structures used in a DBMS; e.g., indexes and partitions in
PostgreSQL) levels.
Selectivity [Factor]): Percentage of data returned by a query or algebraic operator, with regard to the maximum
possible number of elements returned. We say that selectivity is “low” when the query or operator returns
most of the elements in the input (i.e., the percentage is high).
Syntactic tree: Abstraction of the process tree in terms or Relational algebra operators.
Transaction: Set of operations over a database that are executed as a whole (even if they affect different files/-
datasets/tables). Transactions are a key concept in operational databases.
Transactional: Refers to databases or systems (a.k.a. OLTP) based on transactions, and mainly used in the opera-
tion of the business (i.e., not decisional/analytical).
126