The Problem: Data Warehouse Design
The Problem: Data Warehouse Design
The problem
2
Data Warehouse design
• The design of a data warehouse is different from the design of
a traditional db
o Data have different characteristics
o Design is based on the available data sources
o Design is driven by different criteria
Source selection
Source analysis
Conceptual design
Logical design
Design
Physical design
4
Data Warehouse Design
• Data Warehouses are based on the multidimensional model
Work Load
User Data Volume
requirements Logical Model
Fact
schema
Logical
schema
WorkLoad
Data Volume Physical
DBMS Design
Physical
schema
6
Requirements elicitation
• In order to select facts it is important to understand which are
the users requirements
Conceptual Model
8
Fact Schema
Let us analyze all the
representation needs Product
and possibilities dimensions
Date
SALE Shop
Quantity
Gross income fact
Unitary Price
Nr_tickets
measures
10
DFM and E/R
category
Product
Product
Product_id
Date
SALE Gross income
Shop unitary price
Quantity quantity
Gross income
Unitary Price
Nr_tickets date SALE
Nr_tickets
date
Shop
Shop_id
11
Dimensional attribute
• A dimensional attribute must assume discrete values, so that
it can contribute to represent a dimension
12
Hierarchy
• A dimensional hierarchy is a directional tree where
o Nodes are dimensional attributes
o Edges describe n:1 associations between pairs of dimensional attributes
o Root is the considered dimension
holiday
week
13
ü On 10/10/2001, ten ‘Brillo’ detergent packets were sold at the BigShop for a total
amount of 25 euros
14
Events and aggregations (2)
• A hierarchy describes how it is possible to group and
select primary events
15
part type
shop city
month
part type
Sparsity
month shop
Aggregatio part
n operators
date
16
Events and aggregations (3)
• Given a set of dimensional attributes (pattern), each tuple of
their values identifies a secondary event that aggregates (all)
the corresponding primary events
• For each dimensional attribute, a value is associated with the
secondary event; this value summarizes the values taken by
the corresponding measure in the primary events
• For example the sales can be grouped by Product and Month:
ü in October 2001, 230 ‘Brillo’ detergent packets were sold
at the BigShop for a total amount of 575 euros
17
Secondary event
• The sales can be further grouped by Product, Month, and City
• If we consider city, product and month as dimensional
attributes, the tuple
(city: ‘Rome’, product: ‘Brillo’, month: 10/2001)
identifies another secondary event
• It aggregates all the sales related to the product ‘Brillo’ in
shops of ‘Rome’ during the month October 2001
18
Descriptive attributes
• A descriptive attribute contains additional information about a
dimensional attribute
• They are uniquely determined by the corresponding dimensional attribute
• They are relevant for analytical purposes only as selection predicates
Product
SALE
date Shop
Quantity
Gross income address
Unitary Price
phone
Nr_tickets
19
Optional edges
• Some edges of a fact schema could be optional
SALE
date Shop
Quantity
Gross income address
Unitary Price
phone
Nr_tickets
20
Optional dimensions
Diet The attribute Promotion:
Product • only assumes a value for products in
promotion
• the other sales are characterized by the
remaining attributes
SALE
date Shop
Quantity
Gross income address
Unitary Price
phone
Nr_tickets
Promotion
21
Cross-dimensional attributes
• A cross-dimensional attribute is a dimensional or a descriptive
attribute whose value is obtained by combining values of
some dimensional attributes
ü For example, IVA (VAT) is computed based on the product category and the state
22
Convergence
• It is related to the structure of a hierarchy
ü Two dimensional attributes can be connected by more than two distinct directed edges
ü For example:
Shop à city à countyà state
or
Shopà sale district à state
23
Example
category
type
trademark
Product
Convergence
24
Hierarchy Sharing
• In a fact schema, some portions of a hierarchy might be
duplicated
• As a shorthand we allow hierarchy sharing
• If the sharing starts with a dimension attribute, it is necessary
to indicate the roles on the incoming edges
• Necessary condition: the unicity of the value must hold on
both branches
25
Hierarchy Sharing
use
CALL time
caller
Number date
district phone
called
Duration
number
26
Multiple edges
• Recall: the dimension values must be uniquely determined by
the fact
• Some attributes, or some dimensions, may be related by a
many-to-many relationship
SALE
Number
author book Gross income date month year
27
Measure Aggregation
• Aggregation requires to specify an operator to combine values
related to primary events into a unique value related to a
secondary event (e.g. sum of sold quantity aggregated by
month)
• A measure is additive w.r.t. a given dimension iff the SUM
operator is applicable to that measure along that dimension
28
Measure Classification:
Additivity
• Additive measures (flow or rate measures): Can be
meaningfully summarized using addition along all dimensions
o E.g., sales amount can be summarized when the hierarchies in Store, Time, and Product
dimensions are traversed
• Semiadditive measures (stock or level measures): Can be
meaningfully summarized using addition along some (not all)
dimensions
o E.g., inventory quantities, can be aggregated in the Store dimension, but cannot be
aggregated in the Time dimension
• Nonadditive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any dimension
o E.g., item price, cost per unit, exchange rate
29
BUT
Sum(tickets with type(product) =t1) = 3 !!!
30
Aggregability
address INVENTORY
date month year
Level in stock
state city storehouse AVG,MIN
Incoming Qty
35
semester
Course
attendance course
student
Faculty
count
37
Conceptual design
38
Conceptual design
• Conceptual design takes into account the documentation
related to the integrated, reconciled input database
o Conceptual schema (e.g. Entity/Relationship)
o Logical schema (e.g. relational, XML… )
39
Top-down methodology
1. Fact definition (a subject oriented collection of data !!)
2. For each fact:
1. Attribute tree definition
2. Attribute tree editing
3. Dimension definition
4. Measure definition
5. Fact schema creation
40
marketing
Starting from the E/R schema division district nr.
group manager division head state
Marketing Sale
Division in State
Group District (1,1) (1,N)
(1,N) (1,N) (1,N)
for for of
type category country
(1,1) (1,1) (1,N) (1,1)
41
Starting from the Relational Schema
Product(product,weight,dimension,trademark:TradeMark,type:Type)
Shop(shop,address,phone,salemanager,(ditrictnr,state):District,city:City)
Ticket(nrticket,date,shop:Shop)
Sale(product:Product,nrticket:Ticket,quantity,unitaryprice)
Storehouse(storehouse,address)
City(city,country:Country)
Country(country,state:State)
State(state)
District(district,state:State)
Prod_Storehouse(product:Product,storehouse:Storehouse)
TradeMark(trademark,madein:City)
Type(type,marketinggroup:MarketingGroup,category:Category)
MarketingGroup(marketinggroup,manager)
Category(category,division:Division)
Division(division,divisionhead)
42
Fact definition
• Facts correspond to events that dynamically happen in the
organization
43
Fact definition
• Good fact candidates: entities or relationships representing
frequently updated data
44
45
Attribute tree: example
state country address
sales
dept head city date manager
phone
quantity city
department trademark country state
category
Product + district nr . +
type product ticket nr . state
marketing group
district nr .
manager unitary
price
Dimension
root
address Prod+storehouse
storehouse
47
48
Attribute tree editing: example
state country
address
sales
dept head date manager
city phone
quantity city
department trademark country state
category
Product + district nr . +
type product ticket nr . state
marketing group
unitary district nr .
manager dimension price
address Prod+storehouse
storehouse
49
quantity city
department trademark country state
weight shop
category
product + district nr . +
ticket nr . state
type product
marketing group
50
Dimension definition
• Dimensions can be chosen among the children of the root
• Time should always be a dimension
o Historical source: time is an attribute
o Snapshot source: not always time is directly represented. In this case it is necessary to
add time.
55
quantity city
department trademark country state
weight shop
category
product + district nr . +
ticket nr . state
type product
marketing group
dimension
56
Measure definition
• If the fact identifier (set of attributes) is included in the set of
dimensions, then numerical attributes that are children of the
root (fact) are measures
• Further measures are defined by applying aggregate functions
to numerical attributes of the tree
o Generally: sum, average, min, max, count
57
quantity city
department trademark country state
weight shop
category
product + district nr . +
ticket nr . state
type product
marketing group
measure
58
Glossary
• In the glossary, an expression is associated with each measure
o The expression describes how we obtain the measure at the different levels of
aggregation starting from the attributes of the source schema
59
phone
quantity city
department trademark country state
weight shop
category
product + district nr . +
ticket nr . state
type product
marketing group
Quantity = SUM(Sale.quantity)
Gross income=SUM(Sale.quantity*Sale.unitaryprice)
Unitary price=AVG(Sale.unitaryprice)
Nr-tickets=COUNT(*)
60
Fact schema creation
• The attribute tree is translated into a fact schema including
dimensions and measures
o Dimension hierarchies correspond to subtrees having as roots the different dimensions
(with the least granularity)
o The fact name corresponds to the name of the selected entity
61
62
Exercise
• The ER schema is a portion of a database related to a video
content streaming service. Starting from this DB, we want to
build a DW to make decisions regarding the catalog of
contents for the following season and advertising to
customers.
• In particular, we want to analyze:
o Which are the TV series that have been preferred in the last year (highest number of
views); it is requested also the possibility to have details about the individual seasons or
single episodes;
o Which are the most successful series (highest number of views) for a type of customer
or a geographical area
63
Name (1,N)
TV Network Production
ID URL
ID Description
Cost (1,1)
(1,N)
(1,1)
ID
Series Include Season
Subscription
(1,N)
(1,N)
composition
(1,1)
ID Name
Data ID Data Length
Surname (1,N)
(1,1) (1,N)
has Customer associated View (1,1)
of Episode
(1,N) (1,1)
(1,1)
Titile Description
Name from
ID ID Name
(1,N)
Cast
(1,1) (1,N)
64