0% found this document useful (0 votes)
6 views27 pages

The Problem: Data Warehouse Design

Uploaded by

fernando8morea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views27 pages

The Problem: Data Warehouse Design

Uploaded by

fernando8morea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Warehouse design

• Milano, XX mese 20XX


Cinzia Cappiello
A.A. 2023-2024

The problem

Relational DBs have the following Consequences


problems:
Complexity o f the applications Raw data are used at the operations level
High response time for answering to complex Raw data are scarcely used at the strategic level
queries

2
Data Warehouse design
• The design of a data warehouse is different from the design of
a traditional db
o Data have different characteristics
o Design is based on the available data sources
o Design is driven by different criteria

• The design of a data warehouse aims to maintain a low


number of entities but high coverage

Data Warehouse Design


User Internal DBs Further
requirements info sources

Source selection

Analysis Translation into a common conceptual model

Source analysis

Conceptual schemata Integration


Integration

Conceptual design

Logical design
Design
Physical design

4
Data Warehouse Design
• Data Warehouses are based on the multidimensional model

• A standard conceptual model for DW does not exist

• The Entity/Relationship model cannot be used in the DW


conceptual design

Work Load
User Data Volume
requirements Logical Model

Reconciled Conceptual Logical


schema Design Design

Fact
schema

Logical
schema

WorkLoad
Data Volume Physical
DBMS Design

Physical
schema

6
Requirements elicitation
• In order to select facts it is important to understand which are
the users requirements

• Requirements elicitation is conducted by interviewing the


people that have to perform the analysis

Conceptual Model

8
Fact Schema
Let us analyze all the
representation needs Product
and possibilities dimensions

Date
SALE Shop

Quantity
Gross income fact
Unitary Price
Nr_tickets
measures

From E/R to Dimensional Fact


Model (DFM)
• A fact describes an entity or an N to M relationship among its
dimensions. Entities that are often updated (e.g., sales) are
good candidate for being transformed in facts.

• The fact value must uniquely determine the value of each


dimension, e.g. a sale uniquely determines the day in which it
has been done. This is represented as
sale à day, month, year

• Naming convention: the dimensions of a same fact schema


must have distinct names

10
DFM and E/R
category
Product
Product
Product_id

Date
SALE Gross income
Shop unitary price
Quantity quantity
Gross income
Unitary Price
Nr_tickets date SALE

Nr_tickets
date
Shop
Shop_id

11

Dimensional attribute
• A dimensional attribute must assume discrete values, so that
it can contribute to represent a dimension

• Dimensional attributes can be organized into hierarchies

12
Hierarchy
• A dimensional hierarchy is a directional tree where
o Nodes are dimensional attributes
o Edges describe n:1 associations between pairs of dimensional attributes
o Root is the considered dimension

holiday

date month trimester year

week

13

Events and aggregations


• A primary event is an occurrence of a fact; it is represented by
means of a tuple of values

ü On 10/10/2001, ten ‘Brillo’ detergent packets were sold at the BigShop for a total
amount of 25 euros

14
Events and aggregations (2)
• A hierarchy describes how it is possible to group and
select primary events

• The root of a hierarchy represents the finest aggregation


granularity present in the warehouse (e.g.sales one by
one, or by day, or by week, depending on what the
designer deems appropriate)

15

Events and aggregations

part type
shop city

month
part type
Sparsity
month shop

Aggregatio part

n operators
date

16
Events and aggregations (3)
• Given a set of dimensional attributes (pattern), each tuple of
their values identifies a secondary event that aggregates (all)
the corresponding primary events
• For each dimensional attribute, a value is associated with the
secondary event; this value summarizes the values taken by
the corresponding measure in the primary events
• For example the sales can be grouped by Product and Month:
ü in October 2001, 230 ‘Brillo’ detergent packets were sold
at the BigShop for a total amount of 575 euros

17

Secondary event
• The sales can be further grouped by Product, Month, and City
• If we consider city, product and month as dimensional
attributes, the tuple
(city: ‘Rome’, product: ‘Brillo’, month: 10/2001)
identifies another secondary event
• It aggregates all the sales related to the product ‘Brillo’ in
shops of ‘Rome’ during the month October 2001

18
Descriptive attributes
• A descriptive attribute contains additional information about a
dimensional attribute
• They are uniquely determined by the corresponding dimensional attribute
• They are relevant for analytical purposes only as selection predicates

Product

SALE
date Shop
Quantity
Gross income address
Unitary Price
phone
Nr_tickets

19

Optional edges
• Some edges of a fact schema could be optional

Diet (it only assumes a value for food)


Product

SALE
date Shop
Quantity
Gross income address
Unitary Price
phone
Nr_tickets

20
Optional dimensions
Diet The attribute Promotion:
Product • only assumes a value for products in
promotion
• the other sales are characterized by the
remaining attributes
SALE
date Shop
Quantity
Gross income address
Unitary Price
phone
Nr_tickets

Promotion

21

Cross-dimensional attributes
• A cross-dimensional attribute is a dimensional or a descriptive
attribute whose value is obtained by combining values of
some dimensional attributes
ü For example, IVA (VAT) is computed based on the product category and the state

22
Convergence
• It is related to the structure of a hierarchy
ü Two dimensional attributes can be connected by more than two distinct directed edges
ü For example:
Shop à city à countyà state
or
Shopà sale district à state

23

Example
category
type

trademark
Product

SALE Sale district


holiday
Quantity Shop
trimester Gross income
year month date Unitary price city county state
Nr . tickets

Convergence

24
Hierarchy Sharing
• In a fact schema, some portions of a hierarchy might be
duplicated
• As a shorthand we allow hierarchy sharing
• If the sharing starts with a dimension attribute, it is necessary
to indicate the roles on the incoming edges
• Necessary condition: the unicity of the value must hold on
both branches

25

Hierarchy Sharing

use
CALL time
caller
Number date
district phone
called
Duration
number

It is in fact a shorthand to represent


the duplication of the whole hierarchy

26
Multiple edges
• Recall: the dimension values must be uniquely determined by
the fact
• Some attributes, or some dimensions, may be related by a
many-to-many relationship
SALE
Number
author book Gross income date month year

• we denote them by multiple edges


• they are dealt with in a special way at logical design time

27

Measure Aggregation
• Aggregation requires to specify an operator to combine values
related to primary events into a unique value related to a
secondary event (e.g. sum of sold quantity aggregated by
month)
• A measure is additive w.r.t. a given dimension iff the SUM
operator is applicable to that measure along that dimension

28
Measure Classification:
Additivity
• Additive measures (flow or rate measures): Can be
meaningfully summarized using addition along all dimensions
o E.g., sales amount can be summarized when the hierarchies in Store, Time, and Product
dimensions are traversed
• Semiadditive measures (stock or level measures): Can be
meaningfully summarized using addition along some (not all)
dimensions
o E.g., inventory quantities, can be aggregated in the Store dimension, but cannot be
aggregated in the Time dimension
• Nonadditive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any dimension
o E.g., item price, cost per unit, exchange rate

Elzbieta Malinowski & Esteban


Zimányi 2008
29

29

The n.of tickets is non-additive (and in general non-


aggregable) w.r.t. the product
• By n. of tickets we mean the n. of “buyings” i.e. the
ticket count Ticket Product Type
• The association between product and ticket is S1 P1 T1
many-to-many
• E.g. by summing up the ticket count on the product S1 P2 T1
type we count the same type twice if it is the type
of products that are in the same ticket S2 P1 T1
S2 P3 T2
how many tickets containing p1 ? à 2
how many tickets containing p2 ? à 1
how many tickets containing p3 ? à 1
how many tickets with products of type t1 ? à 2

BUT
Sum(tickets with type(product) =t1) = 3 !!!

30
Aggregability

address INVENTORY
date month year

Level in stock
state city storehouse AVG,MIN
Incoming Qty

This arc means that the measure Level in


stock is non-additive w.r.t. the time
dimension , but it is possible to aggregate it
using the AVG and MIN operators

35

Empty fact schemata


year

semester

Course
attendance course
student
Faculty
count

A fact schema is empty if there are no measures.


In fact, the default measure is the count

37
Conceptual design

38

Conceptual design
• Conceptual design takes into account the documentation
related to the integrated, reconciled input database
o Conceptual schema (e.g. Entity/Relationship)
o Logical schema (e.g. relational, XML… )

39
Top-down methodology
1. Fact definition (a subject oriented collection of data !!)
2. For each fact:
1. Attribute tree definition
2. Attribute tree editing
3. Dimension definition
4. Measure definition
5. Fact schema creation

40

marketing
Starting from the E/R schema division district nr.
group manager division head state

Marketing Sale
Division in State
Group District (1,1) (1,N)
(1,N) (1,N) (1,N)

for for of
type category country
(1,1) (1,1) (1,N) (1,1)

Type of Category of Country


(1,1) (1,N)
(0,N) (1,1) (1,N)
unitary
sale
of price
manager
of
diet weight data
(1,1) (1,1)
(0,N) (1,N) (1,1) (0,N) (1,1) (1,N)
Product sale Ticket in Shop in City
(1,N)
quantity
dimension product nr.ticket shop phone address city
of
(1,N)
(1,N)
(1,1) (1,N) (1,1)
of made
Storehouse TradeMark
in

storehouse address trademark

41
Starting from the Relational Schema
Product(product,weight,dimension,trademark:TradeMark,type:Type)
Shop(shop,address,phone,salemanager,(ditrictnr,state):District,city:City)
Ticket(nrticket,date,shop:Shop)
Sale(product:Product,nrticket:Ticket,quantity,unitaryprice)
Storehouse(storehouse,address)
City(city,country:Country)
Country(country,state:State)
State(state)
District(district,state:State)
Prod_Storehouse(product:Product,storehouse:Storehouse)
TradeMark(trademark,madein:City)
Type(type,marketinggroup:MarketingGroup,category:Category)
MarketingGroup(marketinggroup,manager)
Category(category,division:Division)
Division(division,divisionhead)

42

Fact definition
• Facts correspond to events that dynamically happen in the
organization

o In an E/R schema, it can correspond to an entity F or to an association among n entities


E1, E2, …, En
o In a relational schema, a fact corresponds to a relation (table) R

43
Fact definition
• Good fact candidates: entities or relationships representing
frequently updated data

• Static archives: NO!

• Remark: when a fact is identified, it becomes the root of a


new fact schema

44

Attribute tree definition


• The attribute tree is composed by:
o Nodes, corresponding to attributes (simple or complex) of the source schema
o Root, corresponding to the primary key of the fact F
o For each node, the corresponding attribute uniquely determines its descendant
attributes

45
Attribute tree: example
state country address
sales
dept head city date manager
phone

quantity city
department trademark country state

weight ticket nr . shop

category

Product + district nr . +
type product ticket nr . state

marketing group
district nr .
manager unitary
price
Dimension
root
address Prod+storehouse
storehouse

47

Attribute tree editing


• The editing phase allows to remove some attributes which are
irrelevant for the data mart
o Pruning of a node v: the subtree rooted in v
is deleted

o Grafting of a node v: the children of v are directly connected to the father of v

48
Attribute tree editing: example
state country
address
sales
dept head date manager
city phone

quantity city
department trademark country state

weight ticket nr . shop

category

Product + district nr . +
type product ticket nr . state

marketing group
unitary district nr .
manager dimension price

address Prod+storehouse
storehouse
49

Attribute tree editing: example


state address
sales
dept head city manager
phone

quantity city
department trademark country state

weight shop

category
product + district nr . +
ticket nr . state
type product

marketing group

manager unitary date


price

50
Dimension definition
• Dimensions can be chosen among the children of the root
• Time should always be a dimension
o Historical source: time is an attribute
o Snapshot source: not always time is directly represented. In this case it is necessary to
add time.

55

Dimensions definition: example


address
state sales
dept head city manager
phone

quantity city
department trademark country state

weight shop

category
product + district nr . +
ticket nr . state
type product

marketing group

manager unitary date


price

dimension

56
Measure definition
• If the fact identifier (set of attributes) is included in the set of
dimensions, then numerical attributes that are children of the
root (fact) are measures
• Further measures are defined by applying aggregate functions
to numerical attributes of the tree
o Generally: sum, average, min, max, count

• It is possible that a fact has no measures (empty)

57

Measure definition: example


state address
sales
dept head city manager
phone

quantity city
department trademark country state

weight shop

category
product + district nr . +
ticket nr . state
type product

marketing group

manager unitary date


price

measure

58
Glossary
• In the glossary, an expression is associated with each measure
o The expression describes how we obtain the measure at the different levels of
aggregation starting from the attributes of the source schema

59

Measure definition: example


dept head
state
city
sales
manager
address

phone

quantity city
department trademark country state

weight shop

category
product + district nr . +
ticket nr . state
type product

marketing group

manager unitary date


price

Quantity = SUM(Sale.quantity)
Gross income=SUM(Sale.quantity*Sale.unitaryprice)
Unitary price=AVG(Sale.unitaryprice)
Nr-tickets=COUNT(*)

60
Fact schema creation
• The attribute tree is translated into a fact schema including
dimensions and measures
o Dimension hierarchies correspond to subtrees having as roots the different dimensions
(with the least granularity)
o The fact name corresponds to the name of the selected entity

61

Fact schema creation: example


dept head
manager department
marketing
group category
type
weight trademark
product
manager
holiday SALE
sales district
trimester date Quantity shop
Gross income
Unitary price (AVG)
year month city county state
Nr-tickets
phone
address
descriptive
attributes

62
Exercise
• The ER schema is a portion of a database related to a video
content streaming service. Starting from this DB, we want to
build a DW to make decisions regarding the catalog of
contents for the following season and advertising to
customers.
• In particular, we want to analyze:
o Which are the TV series that have been preferred in the last year (highest number of
views); it is requested also the possibility to have details about the individual seasons or
single episodes;
o Which are the most successful series (highest number of views) for a type of customer
or a geographical area

63

Name (1,N)

TV Network Production
ID URL
ID Description
Cost (1,1)
(1,N)
(1,1)
ID
Series Include Season
Subscription
(1,N)
(1,N)

composition

(1,1)

ID Name
Data ID Data Length

Surname (1,N)
(1,1) (1,N)
has Customer associated View (1,1)
of Episode
(1,N) (1,1)

(1,1)
Titile Description

Name from
ID ID Name
(1,N)
Cast
(1,1) (1,N)

City of Nation (0,N)


ID
Name
Actor
Province
Surname

64

You might also like