Data Warehousing & DATA MINING (SE-409) : Lecture-4

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Data Warehousing & DATA

MINING (SE-409)
Lecture-4
Online Analytical Processing (OLAP)

Huma Ayub
Software Engineering department

University of Engineering and Technology, Taxila


DWH & OLAP

• Relationship between DWH &


OLAP

• Data Warehouse & OLAP go


together.

• Analysis supported by OLAP


Ahsan Abdullah 2
Supporting the human thought process
THOUGHT PROCESS QUERY SEQUENCE

An enterprise wide fall in profit What was the quarterly sales during
last year ??


Profit down by a large percentage What was the quarterly sales at
consistently during last quarter only. regional level during last year ??
Rest is OK

What was the quarterly sales at


product level during last year?
What is special about last quarter ?

What was the monthly sale for last


quarter group by products
Products alone doing OK, but North
region is most problematic.
What was the monthly sale for last
quarter group by region
OK. So the problem is the high cost of
products purchased
in north. What was the monthly sale of
products in north at store level group
by products purchased

How many such query sequences can be programmed in advance?


Ahsan Abdullah 3
Analysis of last example
• Analysis is Ad-hoc [no predefine sequence]
• Analysis is interactive (user driven) [content
change with click]
• Analysis is iterative
– Answer to one question leads to a dozen more

• Analysis is directional
– Drill Down [details.
More inYear->month->week
– Roll Up subsequent
slides
– Pivot
Ahsan Abdullah 4
Challenges…
• Not feasible to write predefined queries.
– Fails to remain user_driven (becomes programmer
driven).

– Fails to remain ad_hoc and hence is not interactive.

• Enable ad-hoc query support


– Business user can not build his/her own queries
(does not know SQL, should not know it).

– On_the_go SQL generation and execution too slow.

Ahsan Abdullah 5
Challenges
• Contradiction
– Want to compute answers in advance, but don't
know the questions

• Solution
– Compute answers to “all” possible “queries”. But
how?

– NOTE: Queries are multidimensional aggregates at


some level

Ahsan Abdullah 6
“All” possible queries (level aggregates)
ALL ALL

Province Frontier ... Punjab

Division Mardan ... Peshawar Lahore ... Multan

District Peshawar Lahore

City Lahore ... Gugranwala

Zone Defense
Ahsan Abdullah ...Gulberg 7
OLAP: Facts & Dimensions

• FACTS: Quantitative values (numbers) or “measures.”


– e.g., units sold, sales $, Co, Kg etc.

• DIMENSIONS: Descriptive categories.


– e.g., time, geography, product etc.

– DIM often organized in hierarchies representing levels


of detail in the data (e.g., week, month, quarter, year,
decade etc.).

Ahsan Abdullah 8
Where does OLAP fit in?

?

Transaction
Data
Data
Loading

ELT

OLAP


Reports

Decision
Maker
Data Cube
(MOLAP) Presentation
Tools

Ahsan Abdullah 9
OLTP vs. OLAP
Feature OLTP OLAP
Level of data Detailed Aggregated
Amount of data per Small Large
transaction
Views Pre-defined User-defined
[Programmer]
Typical write Update, insert, delete Bulk insert
operation
“age” of data Current (60-90 days) Historical 5-10 years and
also current [Active
DW]
Number of users High Low-Med
Tables Flat tables [Highly Multi-Dimensional tables
normalized]
Database size Med (109 B – 1012 B) High (1012 B – 1015 B)
Query Optimizing Requires experience Already “optimized”
10
Ahsan Abdullah
Data availability High Low-Med
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant rate.
Most queries answered in under five seconds.

Analysis: Performs basic numerical and statistical analysis of the


data, pre-defined by an application developer or defined ad-hocly
by the user.

Shared: Implements the security requirements necessary for


sharing potentially confidential data across a large user population.

Multi-dimensional: The essential characteristic of OLAP.

Information: Accesses all the data and information necessary and


relevant for the application, wherever it may reside and not limited
by volume.
...from the OLAP Report by Pendse and Creeth.

Ahsan Abdullah 11
Multidimensional OLAP (MOLAP)

Ahsan Abdullah 12
OLAP Implementations
1. MOLAP: OLAP implemented with a multi-dimensional
data structure.

2. ROLAP: OLAP implemented with a relational database.

3. HOLAP: OLAP implemented as a hybrid of MOLAP and


ROLAP.

4. DOLAP: OLAP implemented for desktop decision


support environments.

Ahsan Abdullah 13
MOLAP Implementations
OLAP has historically been implemented using a
multi_dimensional data structure or “cube”.

• Dimensions are key business factors for analysis:


– Geographies (city, district, division, province,...)
– Products (item, product category, product department,...)
– Dates (day, week, month, quarter, year,...)

• Very high performance achieved by O(1) time lookup


into “cube” data structure to retrieve pre_aggregated
results.

Ahsan Abdullah 14
MOLAP Implementations
• No standard query language for querying MOLAP
- No SQL !

• Vendors provide proprietary languages allowing business


users to create queries that involve pivots, drilling down, or
rolling up.
- E.g. MDX of Microsoft

- Languages generally involve extensive visual (click and drag)


support.

- Application Programming Interface (API)’s also provided for probing


the cubes.

Ahsan Abdullah 15
Aggregations in MOLAP
 Sales volume as a function of (i) product, (ii) time,
and (iii) geography

 A cube structure created to handle this.


S
Dimensions: Product, Geography, Time W
E
Hierarchical summarization paths N
Milk
Industry Province Year 23

Product
Bread 8
Category Division Quarter Eggs 45
Butter 13
Product District Month Week 12
Jam
Juice 10
City Day
w1 w2 w3 w4 w5 w6

Zone Ahsan Abdullah


Time 16
Cube operations
• Drill down: get more details
– e.g., given summarized sales as above, find breakup of
sales by city within each region, or within Sindh

• Rollup: summarize data


– e.g., given sales data, summarize sales for last year by
product category and region

• Slice and dice: select and project


– e.g.: Sales of soft-drinks in Karachi during last quarter

• Pivot: change the view of data


Ahsan Abdullah 17
Roll-UP
Drill Down
Slice and Dice
Slice and Dice
Pivot
MOLAP evaluation
Advantages of MOLAP:

 Instant response (pre-calculated aggregates).

 Impossible to ask question without an answer.

Value added functions (Sorted order, mean , pie


charts, ranking, % change).

Ahsan Abdullah 23
MOLAP evaluation
Drawbacks of MOLAP:

 Long load time ( pre-calculating the cube may


take days!).curse of dimensionality

 Very sparse cube (wastage of space) for high


cardinality (sometimes in small hundreds). e.g.
number of heaters sold in Jacobabad or Sibi.

Ahsan Abdullah 24
MOLAP Implementation issues
Maintenance issue: Every data item received
must be aggregated into every cube (assuming
“to-date” summaries are maintained). Lot of
work.

Storage issue: As dimensions get less detailed


(e.g., year vs. day) cubes get much smaller, but
storage consequences for building hundreds of
cubes can be significant. Lot of space.

Ahsan Abdullah 25
Partitioned Cubes
• To overcome the space limitation of MOLAP, the cube is
partitioned.

• The divide&conquer cube partitioning approach helps


improve the scalability limitations of MOLAP implementation.

• One logical cube of data can be spread across multiple


physical cubes on separate (or same) servers.

• Ideal cube partitioning is completely invisible to end users.

• Performance degradation does occurs in case of a join across


partitioned cubes.

Ahsan Abdullah 26
Partitioned Cubes: How it looks Like?
Men’s clothing
Children clothing
Bed linen

Time

Product

Geography

Sales data cube partitioned at a major cotton products sale


outlet
Ahsan Abdullah 27
Virtual Cubes
Used to query two dissimilar cubes by creating a third
“virtual” cube by a join between two cubes.

• Logically similar to a relational view i.e. linking two (or


more) cubes along common dimension(s).

• Biggest advantage is saving in space by eliminating


storage of redundant information.

Example: Joining the store cube and the list price cube
along the product dimension, to calculate the sale price
without redundant storage of the sale price data.

Ahsan Abdullah 28

You might also like