0% found this document useful (0 votes)
102 views34 pages

Data Warehouse Data Modelling - Vincent Rainardi

The document discusses best practices for data modeling in a data warehouse, including defining dimensions, facts, and measures based on business events; establishing a star schema with dimension tables linked to a central fact table; and handling slowly changing dimensions and degenerate dimensions to properly model historical data over time. It provides a case study example of modeling subscription data for a publishing company to facilitate analysis of metrics like subscriptions by customer, publication, and media type.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views34 pages

Data Warehouse Data Modelling - Vincent Rainardi

The document discusses best practices for data modeling in a data warehouse, including defining dimensions, facts, and measures based on business events; establishing a star schema with dimension tables linked to a central fact table; and handling slowly changing dimensions and degenerate dimensions to properly model historical data over time. It provides a case study example of modeling subscription data for a publishing company to facilitate analysis of metrics like subscriptions by customer, publication, and media type.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

DATA WAREHOUSE DATA MODELLING

SQLbits IV
Manchester
28th March 2009

Vincent Rainardi
2
Vincent Rainardi

•Data warehousing & BI


•Data warehousing book on SQL Server
•Data warehousing articles in SQLServerCentral.com
[email protected]

About you
•Data warehousing
•Data modelling
•Dimensional modelling
3
Data Warehouse Data Modelling

•What is it
•Why is it important
•How to do it (case study)
•Miscellaneous topics (time permitting)
•Questions
4
Data Warehouse

A data warehouse is a system that retrieves and consolidates data


periodically from source systems into a dimensional or normalized data
store. It usually keeps years of history and is queried for business
intelligence or other analytical activities. It is typically updated in batch not
every time a transaction happens in the source system.
5
Data Store

•Flat files • Stage


•Cubes • Operational Data Store (ODS)
•Database • Normalized Data Store (NDS)
•Relational • Dimensional Data Store (DDS)
•Normalised • Multi-dimensional Database (MDB)
•Denormalised • Metadata
•Dimensional • Data Quality
•Flat • Standing Data
6
Data Model
Defines how the data is arranged within the data store
Defines relationship between entities (elements)

The data model most appropriate for a data store depends


on the function of the data store.
Stage Dimensional? Normalised?
ODS Dimensional? Flat?

Dimensional Normalised
•Particular business events •All business events
•Query oriented •Efficient to update
•Large data packets •Small data packets
•Multiple versions •Single version
•Analytics •Operational
7
Why is it important

• Functionality: it defines the data warehouse


what’s available and what’s not

• Foundation on which ETL, DQ, DQ


cube
reports, cubes are built
costly to rectify ETL report

• Performance Data Model


loading and query
8
Case Study: Valerie Media Group

Publish and send newsletters, articles, white papers, news alerts


• Daily, weekly, monthly
• IT, travel, health care, consumer retail (Business Unit)
• Email, RSS, text, web site

Publications are managed by business units.


Customers subscribe via agencies.

The business needs to analyze subscription by:


customer demographic, publication type, media and cost
9
Business Events
• Event 1: A customer subscribes via an agent to a publication
issued by a business unit to be delivered via a certain media

• Event 2: A business unit sends a certain edition of a publication


to 2M subscribers via certain network, on a certain media

• Other events: customer payment/refund, renewal, publish a


new pub, deactivate/reactivate a pub, change email address,
agency payment, cancel subscription, ...
10
Source System
11
Star Schema

dimension dimension

dimension fact dimension

dimension dimension

Dimensional Model aka Kimball method


Query performance (OLAP) and flexibility
12
Steps

1. Identify event, dimensions, measures


2. Define grain
3. Add attributes and measures
4. Add natural keys
5. Add surrogate keys
6. Add role-playing dimensions
7. Add degenerate dimensions
8. Add junk dimensions
9. Add fact key
13
Event, Dimension, Measure

Subscription Event

Event: a point in the business process


A customer subscribes via an agent to a publication
issued by a business unit to be delivered via a certain media

Dimension: party/object involved in the event


The who, what, whom
customer, publication, BU, media, agent
(+ when, where)
Measure: the amount in the event
unit, fee, discount, paid
14
Dimensions

Date Customer

Media Subscription Agent

Business Unit Publication

Grain: a row in this fact table correspond to ...


A customer subscribes to a publication
15
Attributes & Measures Customer
Date Customer Name
Date Address
Month Email Address
Year Registration Date
... ...

Media Agent
Subscription
Media Code Agent Name
Unit Category
Media Name Fee
Format Fee Type
Discount Active Subscribers
... Paid ...

Business Unit Publication


Short Name Publication Title
Industry Frequency
Manager Editor
... First Edition Date
...
Grain: a customer subscribes to a publication
16
Natural Key Customer
Date Customer ID
Date Customer Name
Month Address
Year Email Address
Registration Date

Media Agent
Subscription
Media Code Agent ID
Unit Agent Name
Media Name Fee
Format Category
Discount Fee Type
Paid Active Subscribers

Business Unit Publication


Business Unit ID Publication ID
Short Name Publication Title
Industry Frequency
Manager Editor
First Edition Date
The primary key in the source system
17
Surrogate Keys

• Multiple sources • Integer


• Change of natural key • Identity
• Maintain history • 0, -1
• Unknown, N/A, Late Arriving • Dim PK
• Performance • Clustered index
18
Result
19
What Date?

Role-playing dimension
20
Degenerate Dimension

The identifier (PK) of a transaction table


21
Junk Dimension

Low cardinality
22
Fact Key

• To enable referring to a fact table row • Identity


• SQL Server: clustered index • Bigint
23
Result
24
So Far
• Event, Dimensions, Measures
• Grain
• Attributes & Measures
• Natural Keys
• Surrogate Keys
• Role-playing Dimension
• Degenerate Dimension
• Junk Dimension
• Fact Key

Next
• Slowly Changing Dimension
• Snowflake
25
Slowly Changing Dimension
Type 1: Overwrite old values
Before: After:
Key Name Email Key Name Email
1 Andy [email protected] 1 Andy [email protected]

Type 2: Create a new row (keep old values)


Before: After:
Key Name Email Key Name Email
1 Andy [email protected] 1 Andy [email protected]
2 Andy [email protected]

Type 3: Put old values in another column


Before: After:
Key Name Email Key Name Email Previous Email
1 Andy [email protected] 1 Andy [email protected] [email protected]
26
Slowly Changing Dimension Type 2
Key Name Email Valid From Valid To Current
1 Andy [email protected] 1900-01-01 2009-03-27 N
2 Andy [email protected] 2009-03-28 9999-12-31 Y

• Valid From & Valid To (a.k.a. Effective Date & Expiry Date)
To put the right surrogate key in the fact table
Datetime (not date)

• Current Flag: to query the current version

Not all attributes are type 2:


• Attribute 1,2,3: type 1 (update)
• Attribute 4,5,6: type 2 (new row)
27
Snowflake
dimension dimension
dimension dimension

main main
dimension dimension
dimension dimension

dimension dimension
main main
fact dimension
dimension
dimension dimension

dimension dimension
main main
dimension dimension
dimension dimension
dimension dimension
28
Snowflake

Product, product group, product category


29
Miscellaneous Topics

•What is it
•Why is it important
•How to do it
•Miscellaneous topics
•Smart Date Key
•Dimensional Grain
•Real Time Fact Table

•Questions
30
Smart Date Key
8 digit integer YYYYMMDD

Why use Smart Date Key? Why not?


• Fact table partitioning • Multiple sources X
• Reference dimension • Change of natural key X
• Measure group partition • Maintain history X
• No lookup (everywhere) • Unknown, N/A, Late Arriving X
• Performance X

Unknown date?
31
Dimension Grain
• Dim Product Line: 2 attributes, product_key
• Dim Product: 10 attributes, product_grp_key
• Dim Product Group: 5 attributes
Combine into 1 dimension?

Snowflake Star
2 10 5
Fact 1 PL P PG Fact 1 PL 17

Fact 2 P PG Fact 2 P 15

Fact 3 PG Fact 3 PG 5

3 tables:
3 tables, linked FK-PK
• Different surrogate keys
• More flexible (attributes)
1 table with 3 views:
• Same surrogate keys
• Simpler load
32
Real Time Fact Table
Updated every time a transaction happens in the source system
• Today’s transactions only
• Stored in surrogate keys
• Limited dim updates -> unknown SK
• Heap
• Union with main fact table on query

• Depends on frequency: telco, retail, insurance, utilities, CRM


• 1-2 fact table only transactional, narrow table
• Stored in natural keys look up SK on query
33
Questions
• Event, dimensions, measures
• Grain
• Attributes and measures
• Natural keys
• Surrogate keys
• Role-playing dimensions
• Degenerate dimensions
• Junk dimensions
• Fact key
• Slowly Changing Dimension
• Snowflake
• Smart Date Key
• Dimensional Grain
• Real Time Fact Table
34
Further Resources

•Kimball & Ross: Data Warehouse Toolkit


•Imhoff, Galemmo, Geiger: Mastering Data Warehouse Design
•Kimball Group’s articles: www.kimballgroup.com
•Kimball Forum: forum.kimballgroup.com

You might also like