0% found this document useful (0 votes)
23 views46 pages

Data Warehousing - C02 - OLAP

The document discusses OLAP and how it supports analytical processing and ad-hoc querying of data warehouses. It describes challenges with traditional querying and how OLAP addresses these by pre-computing answers to all possible queries through data cubes that allow fast drill-down, roll-up, slice and dice, and pivot operations.

Uploaded by

Thanh Hà Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views46 pages

Data Warehousing - C02 - OLAP

The document discusses OLAP and how it supports analytical processing and ad-hoc querying of data warehouses. It describes challenges with traditional querying and how OLAP addresses these by pre-computing answers to all possible queries through data cubes that allow fast drill-down, roll-up, slice and dice, and pivot operations.

Uploaded by

Thanh Hà Trần
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Warehousing

Lecture-2
Online Analytical Processing
(OLAP)

1
DWH & OLAP

•Relationship between DWH & OLAP

•Data Warehouse & OLAP go together.

•Analysis supported by OLAP

2
Supporting the human thought process
THOUGHT PROCESS QUERY SEQUENCE

An enterprise wide fall in profit What was the quarterly sales


during last year ??


Profit down by a large percentage What was the quarterly sales at
consistently during last quarter regional level during last year ??
only. Rest is OK

What was the quarterly sales at


What is special about last quarter product level during last year?
?
What was the monthly sale for
last quarter group by products
Products alone doing OK, but
North region is most problematic.
What was the monthly sale for
last quarter group by region
OK. So the problem is the high
cost of products purchased
in north. What was the monthly sale of
products in north at store level
group by products purchased

How many such query sequences can be programmed in advance?


3
Analysis of last example
•Analysis is Ad-hoc
•Analysis is interactive (user driven)
•Analysis is iterative
• Answer to one question leads to a dozen more

•Analysis is directional
• Drill Down

• Roll Up More in
subsequent
• Pivot slides

4
Challenges…
•Not feasible to write predefined queries.
• Fails to remain user_driven (becomes programmer
driven).

• Fails to remain ad_hoc and hence is not interactive.

•Enable ad-hoc query support


• Business user can not build his/her own queries (does
not know SQL, should not know it).

• On_the_go SQL generation and execution too slow.

5
Challenges

•Contradiction
• Want to compute answers in advance, but don't know
the questions

•Solution
• Compute answers to “all” possible “queries”. But
how?

• NOTE: Queries are multidimensional aggregates at


some level

6
OLAP: Facts & Dimensions

• FACTS: Quantitative values (numbers) or “measures.”


• e.g., units sold, sales $, Co, Kg etc.

• DIMENSIONS: Descriptive categories.


• e.g., time, geography, product etc.

• DIM often organized in hierarchies representing levels of detail in


the data (e.g., week, month, quarter, year, decade etc.).

7
Where Does OLAP Fit In?
• It is a classification of applications, NOT a database
design technique.

• Analytical processing uses multi-level aggregates,


instead of record level access.

• Objective is to support very


I. fast
II. iterative and
III. ad-hoc decision-making.

8
Where does OLAP fit in?

?

Transaction
Data
Data
Loading

OLAP


Reports

Decision
Maker
Data Cube
(MOLAP) Presentation
Tools

9
OLTP vs. OLAP
Feature OLTP OLAP
Level of data Detailed Aggregated
Amount of data per Small Large
transaction
Views Pre-defined User-defined
Typical write Update, insert, delete Bulk insert
operation
“age” of data Current (60-90 days) Historical 5-10 years and
also current
Number of users High Low-Med
Tables Flat tables Multi-Dimensional tables
Database size Med (109 B – 1012 B) High (1012 B – 1015 B)
Query Optimizing Requires experience Already “optimized”
Data availability High Low-Med
10
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant rate. Most
queries answered in under five seconds.

Analysis: Performs basic numerical and statistical analysis of the


data, pre-defined by an application developer or defined ad-hocly by
the user.

Shared: Implements the security requirements necessary for sharing


potentially confidential data across a large user population.

Multi-dimensional: The essential characteristic of OLAP.

Information: Accesses all the data and information necessary and


relevant for the application, wherever it may reside and not limited
by volume.

...from the OLAP Report by Pendse and Creeth.

11
OLAP Implementations

1.MOLAP: OLAP implemented with a multi-dimensional


data structure.

2. ROLAP: OLAP implemented with a relational database.

3.HOLAP: OLAP implemented as a hybrid of MOLAP and


ROLAP.

4.DOLAP: OLAP implemented for desktop decision


support environments.

12
Multidimensional OLAP (MOLAP)

13
MOLAP Implementations
OLAP has historically been implemented using a
multi_dimensional data structure or “cube”.

▣ Dimensions are key business factors for analysis:


◾Geographies (city, district, division, province,...)
◾Products (item, product category, product department,...)
◾Dates (day, week, month, quarter, year,...)

▣ Very high performance achieved by O(1) time lookup


into “cube” data structure to retrieve pre_aggregated
results.

14
MOLAP Implementations
🞭 No standard query language for querying MOLAP
- No SQL !

🞭 Vendors provide proprietary languages allowing


business users to create queries that involve pivots,
drilling down, or rolling up.
- E.g. MDX of Microsoft

- Languages generally involve extensive visual (click and drag)


support.

- Application Programming Interface (API)’s also provided for


probing the cubes.

15
Aggregations in MOLAP

 Sales volume as a function of (i) product, (ii) time,


and (iii) geography

 A cube structure created to handle this.


Dimensions: Product, Geography, Time
S
W
E
N
Milk
Industry Province Year 23

Product
Bread 8
Category Division Quarter Eggs 45
Butter 13
Product District Month Week Jam 12
Juice 10
City Day
w1 w2 w3 w4 w5 w6

Zone
Time
16
Cube operations
• Drill down: get more details
• e.g., given summarized sales as above, find breakup of sales by city
within each region, or within Sindh

• Rollup: summarize data


• e.g., given sales data, summarize sales for last year by product
category and region

• Slice and dice: select and project


• e.g.: Sales of soft-drinks in Karachi during last quarter

• Pivot: change the view of data

17
Querying the cube
40,00
0 Juices Soda Drinks 14,000
35,00 Juices Soda Drinks
0 12,000
30,00
0 Drill-Down 10,000
25,00
8,000
0
20,00 6,000
-
0
2001 2002 4,000
15,00
0
Roll-Up
10,00 2,000
0
5,00 -
0 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
2001 2002

12,000
OJ RK 8UP PK MJ BU AJ
10,000

8,000 Drill-down
6,000

4,000

2,000

-
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 18
2001 2002
Querying the cube: Pivoting
40,000
Juices Soda Drinks
35,000
30,000
25,000
20,000
15,000
10,000
5,000
-
2001 2002

18,000
2001 2002
16,000
14,000
12,000
10,000
8,000
6,000
4,000
2,000
-
Orange Mango Apple Rola- 8-UP Bubbly- Pola-
juice juice juice Kola UP Kola

19
MOLAP evaluation
Advantages of MOLAP:

 Instant response (pre-calculated aggregates).

 Impossible to ask question without an answer.

 Value added functions (ranking, % change).

20
MOLAP evaluation
Drawbacks of MOLAP:

Long load time ( pre-calculating the cube


may take days!).

Very sparse cube (wastage of space) for


high cardinality (sometimes in small
hundreds).

21
MOLAP Implementation issues
Maintenance issue: Every data item received must be
aggregated into every cube (assuming “to-date”
summaries are maintained). Lot of work.

Storage issue: As dimensions get less detailed (e.g., year


vs. day) cubes get much smaller, but storage
consequences for building hundreds of cubes can be
significant. Lot of space.

22
Partitioned Cubes
• To overcome the space limitation of MOLAP, the cube is
partitioned.

• The divide&conquer cube partitioning approach helps alleviate


the scalability limitations of MOLAP implementation.

• One logical cube of data can be spread across multiple physical


cubes on separate (or same) servers.

• Ideal cube partitioning is completely invisible to end users.

• Performance degradation does occurs in case of a join across


partitioned cubes.

23
Partitioned Cubes: How it looks Like?
Men’s clothing
Children clothing
Bed linen

Time

Product

Geography

Sales data cube partitioned at a major cotton


products sale outlet
24
Virtual Cubes
Used to query two dissimilar cubes by creating a third “virtual” cube by a join
between two cubes.

• Logically similar to a relational view i.e. linking two (or more) cubes along
common dimension(s).

• Biggest advantage is saving in space by eliminating storage of redundant


information.

Example: Joining the store cube and the list price cube along the product
dimension, to calculate the sale price without redundant storage of the sale
price data.

25
Relational OLAP (ROLAP)

26
The necessary of ROLAP
Issue of scalability i.e. curse of dimensionality for
MOLAP

• Deployment of significantly large dimension tables as


compared to MOLAP using secondary storage.

• Aggregate awareness allows using pre-built summary


tables by some front-end tools.

• Star schema designs usually used to facilitate ROLAP


querying (in next lecture).

27
ROLAP as a “Cube”
🞭 OLAP data is stored in a relational database (e.g. a star
schema)
🞭 The fact table is a way of visualizing as a “un-rolled”
cube.
🞭 So where is the cube?
🞭It’s a matter of perception
🞭Visualize the fact table as an elementary cube.

Fact Table

Product
Month Product Zone Sale K Rs.
M1 P1 Z1 250
M2 P2 Z1 500

Time
28
How to create “Cube” in ROLAP
• Cube is a logical entity containing values of a certain fact at a
certain aggregation level at an intersection of a combination of
dimensions.

• The following table can be created using 3 queries

Month_ID
SUM M1 M2 M3 ALL
(Sales_Amt)
Product_ID

P1
P2
P3
Total
29
How to create “Cube” in ROLAP using SQL
🞭 For the table entries, without the totals
SELECT S.Month_Id, S.Product_Id,
SUM(S.Sales_Amt)
FROM Sales
GROUP BY S.Month_Id, S.Product_Id;

🞭 For the row totals


SELECT S.Product_Id, SUM (Sales_Amt)
FROM Sales
GROUP BY S.Product_Id;

🞭 For the column totals


SELECT S.Month_Id, SUM (Sales)
FROM Sales
GROUP BY S.Month_Id;

30
Problem With Simple Approach
• Number of required queries increases exponentially with
the increase in number of dimensions.

• Its wasteful to compute all queries.

• In the example, the first query can do most of the work of the
other two queries

• If we could save that result and aggregate over Month_Id and


Product_Id, we could compute the other queries more efficiently

31
CUBE Clause

• The CUBE clause is part of SQL:1999

• GROUP BY C UBE (v1, v2, …, vn)

• Equivalent to a collection of GROUP BYs, one for each of the


subsets of v1, v2, …, vn

32
ROLAP & Space Requirement
If one is not careful, with the increase in number of
dimensions, the number of summary tables gets very
large

Consider the example discussed earlier with the following


two dimensions on the fact table...

Time: Day, Week, Month, Quarter, Year, All Days


Product: Item, Sub-Category, Category, All
Products

33
EXAMPLE: ROLAP & Space Requirement
A naïve implementation will require all combinations of summary
tables at each and every aggregation level.

 

24 summary tables, add in
geography, results in 120 tables

34
ROLAP Issues

• Maintenance.

• Non standard hierarchy of dimensions.

• Non standard conventions.

• Explosion of storage space requirement.

• Aggregation pit-falls.

35
ROLAP Issue: Maintenance

Summary tables are mostly a maintenance issue (similar


to MOLAP) than a storage issue.
• Notice that summary tables get much smaller as
dimensions get less detailed (e.g., year vs. day).
• Should plan for twice the size of the unsummarized
data for ROLAP summaries in most environments.
• Assuming "to-date" summaries, every detail record that
is received into warehouse must aggregate into EVERY
summary table.

36
ROLAP Issue: Hierarchies
Dimensions are NOT always simple hierarchies
Dimensions can be more than simple hierarchies i.e.
item, subcategory, category, etc.
The product dimension might also branch off by trade style
that cross simple hierarchy boundaries such as:
• Looking at sales of air conditioners that cross
manufacturer boundaries, such as COY1, COY2,
COY3 etc.
• Looking at sales of all “green colored” items that even cross
product categories (washing machine, refrigerator, split-AC,
etc.).
• Looking at a combination of both.
37
ROLAP Issue: Convention
Conventions are NOT absolute

Example: What is calendar year? What is a week?

• Calendar:
01 Jan. to 31 Dec or

01 Jul. to 30 Jun. or

01 Sep to 30 Aug.

• Week:
Mon. to Sat. or Thu. to Wed.

38
ROLAP Issue: Storage space explosion

Summary tables required for non-standard grouping

Summary tables required along different definitions


of year, week etc.

Brute force approach would quickly overwhelm the


system storage capacity due to a combinatorial
explosion.

39
ROLAP Issues: Aggregation pitfalls

• Coarser granularity correspondingly decreases potential


cardinality.

• Aggregating whatever that can be aggregated.

• Throwing away the detail data after aggregation.

40
How to Reduce Summary tables?
Many ROLAP products have developed means to reduce
the number of summary tables by:

• Building summaries on-the-fly as required by end-user


applications.

• Enhancing performance on common queries at coarser


granularities.

• Providing smart tools to assist DBAs in selecting the "best”


aggregations to build i.e. trade-off between speed and space.

41
Performance vs. Space Trade-Off

• Maximum performance boost implies using lots of disk


space for storing every pre-calculation.

• Minimum performance boost implies no disk space with


zero pre-calculation.

• Using meta data to determine best level of pre-


aggregation from which all other aggregates can be
computed.

42
Performance vs. Space Trade-off using Wizard

100 Aggregation answers


most queries
80 
% Gain

60

40  Aggregation
answers few queries

20

2 4 MB 6 8
Hybrid OLAP (HOLAP)

44
HOLAP

• Target is to get the best of both worlds.


• HOLAP is a combination of ROLAP and MOLAP
• HOLAP (Hybrid OLAP) allow co-existence of pre-built
MOLAP cubes alongside relational OLAP or ROLAP
structures.
• HOLAP servers allow for storing large data volumes
of detailed data
45
Other Types of OLAP

• Web OLAP (WOLAP)


• Desktop OLAP (DOLAP)
• Mobile OLAP (MOLAP)
• Spatial OLAP (SOLAP)
• Real-time OLAP (ROLAP)
• Cloud OLAP (COLAP)
• Big Data OLAP (BOLAP)
• In-memory OLAP (IOLAP)

46

You might also like