Lecture 10: Online Analytical Processing (OLAP) : Learning Goals
Lecture 10: Online Analytical Processing (OLAP) : Learning Goals
Learning Goals
What is OLAP?
Relationship between DWH & OLAP
Data Warehouse & OLAP go together.
Analysis supported by OLAP
Data Warehouse provides the best support for analysis while OLAP carries out the
analysis task. Although there is more to OLAP technology than data warehousing, the
classic statement of OLAP is “decision making is an iterative process; which must
involve the users”. This statement implicitly points to the not so obvious fact that while
OLTP systems keep the users’ records and assist the users in performing the routine
operations tasks of their daily work. OLAP systems on the other hand generate
information users require to do the non-routine parts of their jobs. OLAP systems deal
69
with unpredicted circumstances (for which if not impossible, it is very difficult to plan
ahead). If the data warehouse doesn’t provide users with helpful information quickly in
an intuitive format, the users would be turned down by it, and will revert back to their
old and faithful, yet inefficient but tried out procedures. Data warehouse developers
sometimes refer to the importance of intuitiveness to the users as the “field of dreams
scenario” i.e. naively assuming that just because you build it doesn’t mean they will
come and use it.
One of the valid options of starting is with reference to time, and the query
corresponding to the thought process of the decision maker turns out to be “What was
the quarterly sales during last year?” The result of this query would be sales figures for
the four quarters of last year. There can be a number of possible outcomes, for example,
the sale could be down in all the quarters, or it could be down in three out of four
quarters with four possible options. Another possibility could be sales down in any two
quarters with six possible scenarios. For the sake of simplicity, over here we assume
that the sales were down in only one quarter. If the results were displayed as
histograms, the decision maker would immediately pick the quarter in which sales were
down, which in this case comes out to be the last quarter.
70
Now that the “lossy” quarter has been identified, the decision maker would naturally like to
probe further, to go to the root of the problem, and would like to know what happened
during the last quarter that resulted into a fall in profit? The objective here is to identify the
root cause, once identified, the problem can be fixed. So the obvious question that comes
into the mind of the decision maker is what is special about the last quarter? Before
jumping directly into the last quarter to come up with hasty outcomes, the careful decision
maker wants to look at the sales data of the last year, but with a different perspective,
therefore, his/her question is translated into two queries i.e. (i) What was the quarterly sales
at regional level during last year? (ii) What was the quarterly sales at product level during
last year? There could be another sequence of questions i.e. probing deeper into the “lossy”
quarter directly, which is followed in the next sequence of queries.
It may seem that the decision maker was overly careful, but he was not. Recognize that
the benefit of looking at a whole year of data enables the decision maker to identify if
there was a trend of sales going down, that resulted in least sales during the last quarter
or the sales only went down during the last quarter? The finding was that the sales went
down only during the last quarter, so the decision maker probed further by looking at
the sales at monthly basis during the last quarter with reference to product, as well as
with reference to region. Here is a finding, the sales were doing all right with reference
to the product, but they were found to be down with reference to the region, specifically
the Northern region.
Now the decision maker would like to know what happened in the Northern region in
the last quarter with reference to the products, as just looking at the Northern regional
alone would not give the complete picture. Although not explicitly shown in the figure-
?, but the current exploratory query of the decision maker is further processed by taking
the sales figures of last quarter on monthly basis and coupled together with the product
purchased at store level. Now the hard work pays off, as the problem has finally been
identified i.e. high cost of products purchased during the last quarter in the Northern
region. Well actually it was not hard work in its true literal meaning, as using an OLAP
tool, this would have taken not more than a couple of minutes, maybe even less.
Analysis is Ad-hoc
Analysis is interactive (user driven)
Analysis is iterative
Answer to one question leads to a dozen more
Analysis is directional
Drill Down
Roll Up
Drill Across
Now if we carefully analyze the last example, some points are very clear, the first one
being analysis is ad-hoc i.e. there is no known order to the sequence of questions. The
order of the questions actually depends on the outcome of the questions asked, the
domain knowledge of the decision maker, and the inference made by the decision maker
71
about the results retrieved. Furthermore, it is not possible to perform such an analysis
with a fixed set of queries, as it is not possible to know in advance what questions will
be asked, and more importantly what answers or results will be presented corresponding
to those questions. In a nut-shell, not only the querying is ad-hoc, but the universal set
of pre-programmed queries is also very large, hence it is a very complex problem.
As you would have experienced while using an MIS system or done while developing
and MIS system, such systems are programmer driven. One would disagree with that,
arguing that development comes after system analysis, when user has given his/her
requirements. The problem with decision support systems is that, the user is unaware of
their requirements, until they have used the system. Furthermore, the number and type
of queries are fixed as set by the programmer, thus if the decision maker has to use the
OLTP system for decision making, he is forced to use only those queries that have been
pre-written thus the process of decision making is driven not by the decision maker. As
we have seen, the decision making is interactive i.e. there is noting in between the
decision support system and the decision maker.
In a traditional MIS system, there is an almost linear sequence of queries i.e. one query
followed by another, with a limited number of queries to go back to with almost fixed
and rigid results. As a consequence, running the same set of queries in the same order
will not result in something new. As we discussed in the example, the decision support
processes is exploratory, and depending on the results obtained, the queries are further
fine-tuned or the data is probed deeply, thus it is an iterative process. Going through the
loop of queries as per the requirement of the decision maker fine tunes the results, as a
result the process actually results in coming up with the answers to the questions.
10.3 Challenges
Now that we have identified how decision making proceeds, this could lure someone
with an OLTP or MIS mindset to approach decision support using the traditional tools
and techniques. Let’s look at this (wrong) approach one point at a time. We identified
one very important point that decision support requires much more and many more
queries as compared to an MIS system. But even if more queries are written, the queries
still remain predefined, and the process of decision making still continues to be driven
by the programmer (who is not a decision maker) instead of the decision maker. When
something is programmed, it kind of gets hard-wired i.e. it can not be changed
significantly, although minor changes are possible by changing the parameter values.
That’s where the fault lies i.e. hard-wiring. Decision making has no particular order
associated with it.
72
One could naively say OK if the decision making process has to be user driven, let the
end user or business user do the query development himself/herself. This is a very
unrealistic assumption, as the business user is not a programmer, does not know SQL,
and does not need to know SQL. An apparently realistic solution would be to let the
SQL generation be on-the-go. This approach does make the system user driven without
having the user to write SQL, but even if on-the-go SQL generation can be achieved,
the data sets used in a data warehouse are so large, that it is no way possible for the
system to remain interactive!
Challenges
Contradiction
Want to compute answers in advance, but don't know the questions
Solution
Compute answers to “all” possible “queries”. But how?
So what is the solution then? We need a paradigm shift from the way we have been
thinking. In a decision support environment, we are looking at the big picture; we are
interested in all sales figures across a time interval or at a particular time interval, but
not for a particular person. So the solution lies in computing all possible aggregates,
which would then correspond to all possible queries. Over here a query is a
multidimensional aggregate at a certain level of the hierarchy of the business.
To understand this concept, let’s look at Figure-10.2 that shows multiple levels or
hierarchies of geographies. Pakistan is logically divided into provinces, and each
province is administratively divided into divisions, each division is divided into districts
and within districts are cities. Consider a chain store, with outlets in major cities of
Pakistan, with large cities such as Karachi and Lahore having multiple outlets, which
you may have observed also. Now the sales actually take place at the business outlet
which could be in a zone, such as Defense and Gulberg in our example, of course there
may be many more stores in a city, as already discussed.
73
“All” possible queries (level aggregates)
The foundation for design in this environment is through use of dimensional modeling
techniques which focus on the concepts of “facts” and “dimensions” for organizing
data. Facts are the quantities, numbers that can be aggregated (e.g., sales $, units sold
etc.) that we measure and dimensions are how we filter/report on the quantities (e.g., by
geography, product, date, etc.). We will discuss in detail, actually have number of
lectures on dimensional modeling techniques.
Figure-10.3 shows that using the transactional databases (after Extract Transform Load)
OLAP data cubes are generated. Over here multidimensional cube is shown i.e. a
MOLAP cube. Data retrieved as a consequence of exploring the MOLAP cube is then
used as a source to generate reports or charts etc. Actually in typical MOLAP
environments, the results are displayed as tables, along with different types of charts,
such as histogram, pie etc.
10.6 Difference between OLTP and OLAP
75
The table summarizes the fundamental differences between traditional OLTP systems
and typical OLAP applications. In OLTP operations the user changes the database via
transactions on detailed data. A typical transaction in a banking environment may
transfer money from one account to another account. In OLAP applications the typical
user is an analyst who is interested in selecting data needed for decision support.
He/She is primarily not interested in detailed data, but usually in aggregated data over
large sets of data as it gives the big picture. A typical OLAP query is to find the average
amount of money drawn from ATM by those customers who are male, and of age
between 15 and 25 years from (say) Jinnah Super Market Islamabad after 8 pm. For this
kind of query there are no DML operations and the DBMS contents do not change.
Fast: Delivers information to the user at a fairly constant rate i.e. O(1) time. Most
queries answered in under 5 seconds.
Analysis: Performs basic numerical and statistical analysis of the data, pre-defined by
an application developer or defined ad-hocly by the user.
Information: Accesses all the data and information necessary and relevant for the
application, wherever it may reside and not limited by volume.
Explanation:
FAST means that the system is targeted to deliver response to moat of the end user
queries under 5 seconds, with the simplest analyses taking no more than a second and
very few queries taking more than 20 seconds (for various reasons to be discussed).
Because if the queries take significantly longer time, the thought process is broken, as
users are likely to get distracted, consequently the quality of analysis suffers. This speed
is not easy to achieve with large amounts of data, particularly if on-the-go and ad-hoc
complex calculations are required.
ANALYSIS means that the system is capable of doing any business logic and statistical
analysis, which is applicable for the application and to the user, and also keeps it easy
enough for the user i.e. at the point-and-click level. It is absolutely necessary to allow
the target user to define and execute new ad hoc calculations/queries as part of the
analysis and to view the results of the data in any desired way/format, without having to
do any programming.
SHARED means that the system is not supposed to be a stand-alone system, and should
implement all the security requirements for confidentiality (possibly down to cell level)
for a multi-user environment. The other point is kind of contradiction of the point
discussed earlier i.e. writing back. If multiple DML operations are needed to be
performed, concurrent update locking should be executed at an appropriate level. It is
true that not all applications require users to write data back, but it is true for a
considerable majority.
76
MULTIDIMENSIONAL is the key requirement to the letter and at the heart of the
cube concept of OLAP. The system must provide a multidimensional logical view of
the aggregated data, including full support for hierarchies and multiple hierarchies, as
this is certainly the most logical way to analyze organizations and businesses. There
is no “magical” minimum number of dimensions that must be handled as it is too
application specific.