OLAP and Data Warehousing: Slides Courtesy Of: Julia Stoyanovitch
OLAP and Data Warehousing: Slides Courtesy Of: Julia Stoyanovitch
Surajit Chaudhuri
Microsoft Research, Redmond, WA, USA
[email protected]
Umeshwar Dayal
Hewlett-Packard Labs., Palo Alto, CA, USA
[email protected]
What is OLAP?
On-Line Analytical Processing
Information technology to help the knowledge
worker (executive, manager, analyst) make faster
and better decisions.
OLAP is an element of decision support systems
(DSS).
3
OLTP Queries: Examples
4
OLAP Queries: Examples
Analyze comparative sales of the different colors
of VW Golf by state
7
OLAP Query Tools
Goal of OLAP is to support ad-hoc querying for
the business analyst (Power user)
Business analysts are familiar with spreadsheets
Extend spreadsheet analysis model to work with
warehouse data
» Large data set
» Semantically enriched to understand business terms
(e.g., time, geography)
» Combined with reporting features
Multidimensional view of data is the foundation of
OLAP.
© Surajit Chaudhuri, Umeshwar Dayal 8
Multidimensional Data Model
Database is a set of facts (points) in a
multidimensional space
A fact has a measure dimension
» quantity that is analyzed, e.g., sale amount, budget
A set of dimensions with respect to which data is
analyzed
» e.g., store, product, date associated with a sale amount
Dimensions form a sparsely populated coordinate
system
Each dimension has a set of attributes
» e.g., owner, city and county of store
WI
CA Attributes
St
NY
Red
date (year, month, day)
10
Color
Silver 15
Category State Quarter
Black 10
1 2 3 4 5 67
Date Product City Month Week
STATE
Y Sum
E by
A
R Year
Sum By State
2003 63 81 144
2005 75 35 110
14
Generalization: The Data Cube
Base tuples
Aggregate tuples:
» one aggregation for each subset of dimensions
(powerset)
» exponential number of subsets, but can optimize the
computation
Example
» N = 3 dimensions
– model = {Golf, Jetta}
– color = {red, black, white}
– state = {NY, CA, WI}
» How many aggregate tuples in the data cube?
– face – 1D agg; edge – 2D agg; corner – 3D agg
15
Operations on Multidimensional
Data Model
Aggregation (roll-up) of detailed data to create summary
data
Navigation to detailed data (drill-down) from summary
Selection (slice) defines a subcube
– Project the cube on fewer dimensions by specifying
coordinates of remaining dimensions
– e.g., sales where state = NY and month = Jan
Calculation
– Within a dimension, e.g., (sales - expense) by state
– Across dimensions
Ranking
– top 3% of states by average sales
Window Queries
16
© Surajit Chaudhuri, Umeshwar Dayal
Roll-up and Drill-Down
Roll-Up: Use of aggregation
» dimension reduction:
– e.g., total sales by state by color
– e.g., total sales by state
» navigating attribute hierarchy:
– e.g., sales by city -> total sales by state -> total sales by
country
– e.g., total sales by city and year -> total sales by state and year
-> total sales by country
Drill-Down: Inverse operation of roll-up
» Provides the data set that was aggregated
– e.g., show “base” data for total sales figure for CA state
18
More Examples
Q: Given a query, which values from the CUBE do
we need to retrieve?
19
Pivot (Rotate)
th
on
LA
C it
M
SF
NY
y
Juice 10
Product
Cola 50
Milk 20
Cream 12
City
Toothpaste 15
Soap 10
1 2 3 4 5 67
Month
Product
Fact data: Sales volume in $100
Result: cross tabulation
© Surajit Chaudhuri, Umeshwar Dayal 20
Warehouse Database Schema
Entity-Relationship design techniques not
appropriate
Design should reflect multidimensional view
Typical schemas:
» Star Schema
» Snowflake Schema
» Fact Constellation Schema
25
Bit Map Index
An alternative representation of RID-list
Comparison, join and aggregation operations are
reduced to bit arithmetic
Specially advantageous for low-cardinality
domains
» Significant reduction in space and I/O (30:1)
» Adapted for higher cardinality domains
» Compression (e.g., run-length encoding) exploited
» Upper Bound of 2R words for any bitmap over R rows
[Hasan & Sinha, 1997]
1 2 3 4 5
0 0 1 0 0
0 0 0 0 1
0 0 0 0 1
0 0 0 1 0 27
Join Index
Traditional index maps the value in a column to a
list of rows with that value
Join index maintain relationships between
attribute value of a dimension and the matching
rows in the fact table
Join index may span multiple dimensions
(composite join index)
» Use join index to identify regions of cartesian product
that are of interest
» Few people in Southern California may buy umbrellas
Metadata
Repository
OLAP
Servers OLAP
Data Warehouse
External
sources Extract
Transform Query/Reporting
Operational Transport
dbs Serve
Data Mining
Data sources
Data Marts Front-End Tools
Source tables Scan Sort runs Merge runs Table insert Target tables