Schemas
Schemas
∙ Data Models
− relations
− stars & snowflakes
− cubes
∙ Operators
− slice & dice
− roll-up, drill down
− pivoting
− other
CSE601 1
Multi-Dimensional Data
∙ Measures - numerical (and additive) data
being tracked in business, can be analyzed
and examined
∙ Dimensions - business parameters that
define a transaction, relatively static data
such as lookup or reference tables
∙ Example: Analyst may want to view sales
data (measure) by geography, by time, and
by product (dimensions)
CSE601 2
The Multi-Dimensional Model
“Sales by product line over the past six months”
“Sales by store between 1990 and 1995”
Store Info Key columns joining fact table
to dimension tables Numerical Measures
...
CSE601 3
Multidimensional Modeling
CSE601 4
Dimensional Modeling
∙ Dimensions are organized into hierarchies
− E.g., Time dimension: days → weeks → quarters
− E.g., Product dimension: product → product line →
brand
∙ Dimensions have attributes
Time Store
Date StoreID
Month City
Year State
Country
Region
CSE601 5
Dimension Hierarchies
Store Dimension Product Dimension
Total Total
Region Manufacturer
District Brand
Stores Products
CSE601 6
Schema Design
∙ Most data warehouses use a star schema to represent
multi-dimensional model.
∙ Each dimension is represented by a dimension table that
describes it.
∙ A fact table connects to all dimension tables with a
multiple join. Each tuple in the fact table consists of a
pointer to each of the dimension tables that provide its
multi-dimensional coordinates and stores measures for
those coordinates.
∙ The links between the fact table in the center and the
dimension tables in the extremities form a shape like a star.
CSE601 7
Star Schema (in RDBMS)
CSE601 8
Star Schema Example
CSE601 9
Star Schema
with Sample
Data
CSE601 10
The “Classic” Star Schema
⬥ A relational model with a one-to-many relationship
between dimension table and fact table.
⬥ A single fact table, with detail and summary data
⬥ Fact table primary key has only one key column per
dimension
⬥ Each dimension is a single table, highly denormalized
∙ Benefits: Easy to understand, intuitive mapping between the
business entities, easy to define hierarchies, reduces # of physical
joins, low maintenance, very simple metadata
∙ Drawbacks: Summary data in the fact table yields poorer
performance for summary levels, huge dimension tables a problem
CSE601 11
Need for Aggregates
CSE601 12
Aggregating Fact Tables
∙ Aggregate fact tables are summaries of the
most granular data at higher levels along the
dimension hierarchies.
e r a r chy
Hi
ls Product key
leve Store key
Product Store name
Category Territory
Department Product key
Time key Region
Store key
Unit sales
Multi-way aggregates:
Time key Sale dollars
Territory – Category – Month
Date Month
Quarter (Data values at higher level)
CSE601 Year 13
The “Fact Constellation” Schema
District Fact
Table Region Fact
District_ID Table
PRODUCT_KEY Region_ID
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
Dollars
Units Units
Price Price
CSE601 14
Aggregate Fact Tables
Dimensi
Dimensi Dimensio on
on n table
table table
Fact
table
Fact
Dimensi Dimens table
on ion
table table
Fact
table
Dimensio
Dimensio n
Dimensio
n n table
table table
CSE601 16
Snowflake Schema
CSE601 17
Sales: Snowflake Schema
Category key
Product
Brand key category Region key
Brand name Region
Category name
key
Product key Territory key
Product Territory
name
Sales fact
name
Product Region key
code Product key
Brand key Time key Salesrep key
Product Customer Salesperson
key name
…. Territory key
Salesrep
CSE601 18
Snowflaking
CSE601 19
Snowflake Schema
∙ Advantages:
− Small saving in storage space
− Normalized structures are easier to update and maintain
∙ Disadvantages:
− Schema less intuitive and end-users are put off by the
complexity
− Ability to browse through the contents difficult
− Degrade query performance because of additional joins
CSE601 20
What is the Best Design?
CSE601 21
Aggregates
∙ Add up amounts for day 1
∙ In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
81
CSE601 22
Aggregates
∙ Add up amounts by day
∙ In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
CSE601 23
Another Example
∙ Add up amounts by day, product
∙ In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
rollup
drill-down
CSE601 24
Aggregates
∙ Operators: sum, count, max, min,
median, ave
∙ “Having” clause
∙ Using dimension hierarchy
− average by region (within store)
− maximum by month (within date)
CSE601 25
Data Cube
dimensions = 2
CSE601 26
3-D Cube
day 2
day 1
dimensions = 3
CSE601 27
Example
roll-up to region
Dimensions:
NY
Time, Product, Store
e
SF
or
roll-up to brand
St
LA
Attributes:
10
Product (upc, price, …)
Juice
Store …
Product
Milk 34
…
Coke 56
Hierarchies:
Cream 32 Product → Brand → …
Soap 12 roll-up to week Day → Week → Quarter
Bread 56
M T W Th F S S
Store → Region → Country
Time
56 units of bread sold in LA on M
CSE601 28
Cube Aggregation: Roll-up
Example: computing sums
day 2 ...
day 1
129
rollup
drill-down
CSE601 29
Cube Operators for Roll-up
day 2 ...
day 1
sale(s1,*,*)
129
sale(s2,p2,
*) sale(*,*,*)
CSE601 30
Extended Cube
day 2
day 1 sale(*,p2,*)
CSE601 31
Aggregation Using Hierarchies
day 2 store
day 1
region
country
(store s1 in Region A;
stores s2, s3 in Region B)
CSE601 32
Slicing
day 2
day 1
TIME = day 1
CSE601 33
Slicing &
Pivoting
CSE601 34
Summary of Operations
∙ Aggregation (roll-up)
− aggregate (summarize) data to the next higher dimension
element
− e.g., total sales by city, year → total sales by region, year
∙ Navigation to detailed data (drill-down)
∙ Selection (slice) defines a subcube
− e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’
∙ Calculation and ranking
− e.g., top 3% of cities by average income
∙ Visualization operations (e.g., Pivot)
∙ Time functions
− e.g., time average
CSE601 35
Query & Analysis Tools
∙ Query Building
∙ Report Writers (comparisons, growth, graphs,…)
∙ Spreadsheet Systems
∙ Web Interfaces
∙ Data Mining
CSE601 36