100% found this document useful (1 vote)
2K views314 pages

Agile Data Warehouse Design Ebook

The document outlines 'Agile Data Warehouse Design', a guide for transforming business intelligence requirements into high-performance dimensional models through an agile approach called BEAM. It emphasizes collaborative modeling techniques, such as modelstorming, to enhance communication among data warehouse designers and stakeholders. The book covers various methodologies and tools for effective dimensional modeling, aiming to improve usability and performance in data warehousing.

Uploaded by

frobnark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views314 pages

Agile Data Warehouse Design Ebook

The document outlines 'Agile Data Warehouse Design', a guide for transforming business intelligence requirements into high-performance dimensional models through an agile approach called BEAM. It emphasizes collaborative modeling techniques, such as modelstorming, to enhance communication among data warehouse designers and stakeholders. The book covers various methodologies and tools for effective dimensional modeling, aiming to improve usability and performance in data warehousing.

Uploaded by

frobnark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 314

Agile

Data Warehouse
Design
Dimensional Modeling,
from to Star Schema
Lawrence Corr
with Jim Stagnitto

Sold to
[email protected]
business  intelligence  (DW/BI)  requirements  and  turning  them  into  high  performance  dimensional  models  in  the  most  
direct  way:  by  modelstorming  (data  modeling  +  brainstorming)  with  BI  stakeholders.
Agile Data Warehouse Design is a step-by-step guide for capturing data warehousing/
business intelligence (DW/BI) requirements and turning them into high performance dimensional models in the
This
This  book  describes  BEAM✲,  an  agile  approach  to  dimensional  modeling,  for  improving  communication  between  
most direct way: by modelstorming (data modeling + brainstorming) with BI stakeholders.
data   warehouse   designers,   BI   stakeholders   and   the   whole   DW/BI   development   team.   BEAM✲   provides   tools            
and   techniques  
This bookthat   will   encourage  
describes BEAM✲, DW/BI  
an agiledesigners  
approach and   developers  modeling,
to dimensional to   move  for
away   from   their  
improving keyboards   and        
communication
between data
entity   relationship   warehouse
based   designers,
tools   and   model  BIinteractively  
stakeholders with  
and their  
the whole DW/BI development
colleagues.   The   result   team. BEAM✲thinks              
is   everyone  
provides tools and techniques that will encourage DW/BI designers and developers to move away from their
dimensionally   from   the   outset!   Developers   understand   how   to   efficiently   implement   dimensional   modeling              
keyboards and entity relationship based tools and model interactively with their colleagues. The result is
solutions.   Business   stakeholders   feel   ownership   of   the   data   warehouse   they   have   created,   and   can   already              
everyone thinks dimensionally from the outset! Developers understand how to efficiently implement dimensional
imagine  how  they  will  use  it  to  answer  their  business  questions.
modeling solutions. Business stakeholders feel ownership of the data warehouse they have created, and can
already imagine how they will use it to answer their business questions.
Within  this  book,  you  will  learn:
Within this book, you will learn:
✲    Agile dimensional modeling using Business Event Analysis & Modeling (BEAM✲)
✲ Agile dimensional modeling using Business Event Analysis & Modeling (BEAM✲)
✲    Modelstorming: data modeling that is quicker, more inclusive, more productive, and frankly more fun!
✲ Modelstorming: data modeling that is quicker, more inclusive, more productive, and frankly more fun!
✲    Telling dimensional data stories using the 7Ws (who,  what,  when,  where,  how  many,  why and how)
✲ Telling dimensional data stories using the 7Ws (who, what, when, where, how many, why and how)
✲    Modeling by example not abstraction;; using data story themes, not crow’s feet, to describe detail
✲ Modeling by example not abstraction; using data story themes, not crow’s feet, to describe detail
✲    Storyboarding the data warehouse to discover conformed dimensions and plan iterative development
✲ Storyboarding the data warehouse to discover conformed dimensions and plan iterative development
✲    Visual modeling: sketching timelines, charts and grids to model complex process measurement – simply
✲ Visual modeling: sketching timelines, charts and grids to model complex process measurement – simply
✲    Agile design documentation: enhancing star schemas with BEAM✲  dimensional shorthand notation
✲ Agile design documentation: enhancing star schemas with BEAM✲ dimensional shorthand notation
✲    Solving difficult DW/BI performance and usability problems with proven dimensional design patterns
✲ Solving difficult DW/BI performance and usability problems with proven dimensional design patterns

Lawrence Corr   is   a   data   warehouse   designer   and   educator.  As   Principal   of   DecisionOne      
Consulting,   he   helps  Corr
Lawrence is to  
clients   a data warehouse
review   designertheir  
and   simplify   and educator. As Principaldesigns,  
data   warehouse   of DecisionOne
and   advises              
vendors   Consulting,
on   visual   he helps
data   organizations
modeling   to reviewHe  
techniques.   andregularly  
simplify their data warehouse
teaches   designs, and
agile   dimensional   modeling    
advises on visual data modeling techniques. He regularly holds agile modeling workshops
courses  worldwide  and  has  taught  dimensional  DW/BI  skills  to  thousands  of  students.
worldwide and has taught dimensional DW/BI skills to thousands of business/IT professionals.
Jim Stagnitto  is  a  data  warehouse  and  master  data  management  architect  specializing  in  the        
Jimfinancial  
Stagnitto

 
healthcare,   is a data
services,   and  warehouse andservice  
information   master data management
industries.   architect
He   is   specializing
the   founder   in data        
of   the  
the healthcare, financial services, and information
warehousing  and  data  mining  consulting  firm  Llumino.   service industries. He is the founder of the
data warehousing and data mining consulting firm Llumino.

decisionone.co.uk
 BEAM✲
modelstorming.com
Agile Data Warehouse Design
Collaborative Dimensional Modeling,
from Whiteboard to Star Schema

Lawrence Corr
with Jim Stagnitto
Agile Data Warehouse Design
by Lawrence Corr with Jim Stagnitto

Copyright © 2011, 2012, 2013 by Lawrence Corr. All rights reserved.

No part of this book may be reproduced in any form or by any electronic or mechanical means including informa-
tion presentation, storage and retrieval systems, without permission in writing from the copyright holder. The only
exception is by a reviewer, who may quote short excerpts in a review.

This eBook is free of copy protection or functionality restrictions. You may view or print it for
personal use as you see fit. You may make copies for your own personal use (e.g., one for use while
traveling and one on a home computer or backup device) but you may not give the eBook file to
other people. The file is personalized with an email address on the cover and other identifying
information and belongs to that email user. Ownership can not be transferred or sold. You may print the eBook but
it has been formatted specifically for on-screen viewing with no blank pages so margins and facing pages will not
print correctly. Generally it is cheaper and more efficient to order a paperback copy than print out the entire book.

Published by DecisionOne Press, Burwood House, Leeds LS28 7UJ, UK.


Email: [email protected], Tel: +44 7971 964824.

Proofing: Laurence Hesketh, Geoff Hurrell Illustrators: Gill Guile and Lawrence Corr
Cover Design: After Escher
Printing History:
November 2011: First Edition. January 2012: Revision. October 2012: Revision. May 2013: Revision.

Non-Printing History:
May 2013: eBook First Edition.

Displayed on recycled pixels

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks.
Where those designations appear in this book, and DecisionOne Press was aware of a trademark claim, the
designations have been printed in caps or initial caps.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with
respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties,
including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended
by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation.
Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or
Website is referred to in this work as a citation and/or a potential source of further information does not mean that
the author or the publisher endorses the information the organization or Website may provide or recommendations
it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or
disappeared between when this work was written and when it is read.

ISBN: 978-0-9568172-0-4 [2013-05-10]


For Lucy Corr

1923-2009

Thank you for listening


ABOUT THE AUTHORS
Lawrence Corr is a data warehouse designer and educator. As Principal of DecisionOne Consulting, he
helps organizations to improve their Business Intelligence systems through the use of visual data model-
ing techniques. He regularly teaches agile dimensional modeling courses worldwide and has taught
DW/BI skills to thousands of IT professionals since 2000.

Lawrence has developed and reviewed data warehouses for healthcare, telecommunications, engineering,
broadcasting, financial services and public service organizations. He held the position of data warehouse
practice leader at Linc Systems Corporation, CT, USA and vice-president of data warehousing products
at Teleran Technologies, NJ, USA. Lawrence was also a Ralph Kimball Associate and has taught data
warehousing classes for Kimball University in Europe and South Africa and contributed to Kimball
articles and design tips. He lives in Yorkshire, England with his wife Mary and daughter Aimee. Law-
rence can be contacted at:

[email protected]

Jim Stagnitto is a data warehouse and master data management architect specializing in the healthcare,
financial services and information service industries. He is the founder of the data warehousing and data
mining consulting firm Llumino.

Jim has been a guest contributor for Ralph Kimball’s Intelligent Enterprise column, and a contributing
author to Ralph Kimball & Joe Caserta’s The Data Warehouse ETL Toolkit. He lives in Bucks County,
PA, USA with his wife Lori, and their happy brood of pets. Jim can be contacted at:

[email protected]
ACKNOWLEDGEMENTS
We would like to express our gratitude to everyone who made this book about BEAM✲ possible
(using BEAM✲ notation):
CONTENTS
INTRODUCTION ................................................................................................................................. XVII!
PART I: MODELSTORMING ................................................................................................................... 1!
CHAPTER 1
HOW TO MODEL A DATA WAREHOUSE.............................................................................................. 3!
OLTP VS. DW/BI: TWO DIFFERENT WORLDS ......................................................................................... 4!
The Case Against Entity-Relationship Modeling .......................................................................... 5!
Advantages of ER Modeling for OLTP ...................................................................................... 6!
Disadvantages of ER Modeling for Data Warehousing............................................................. 6!
The Case For Dimensional Modeling........................................................................................... 7!
Star Schemas............................................................................................................................ 8!
Fact and Dimension Tables ...................................................................................................... 8!
Advantages of Dimensional Modeling for Data Warehousing................................................... 9!
DATA WAREHOUSE ANALYSIS AND DESIGN .......................................................................................... 11!
Data-Driven Analysis.................................................................................................................. 11!
Reporting-Driven Analysis.......................................................................................................... 12!
Proactive DW/BI Analysis and Design ....................................................................................... 13!
Benefits of Proactive Design for Data Warehousing ............................................................... 14!
Challenges of Proactive Analysis for Data Warehousing........................................................ 15!
Proactive Reporting-Driven Analysis Challenges.................................................................... 15!
Proactive Data-Driven Analysis Challenges............................................................................ 15!
Data then Requirements: a ‘Chicken or the egg’ Conundrum................................................. 15!
Agile Data Warehouse Design ................................................................................................... 16!
Agile Data Modeling ................................................................................................................... 17!
Agile Dimensional Modeling....................................................................................................... 18!
Agile Dimensional Modeling and Traditional DW/BI Analysis ................................................. 19!
Agile Data-Driven Analysis...................................................................................................... 19!
Agile Reporting-Driven Analysis.............................................................................................. 19!
Requirements for Agile Dimensional Modeling ....................................................................... 19!
BEAM✲ ............................................................................................................................................ 21!
Data Stories and the 7Ws Framework ....................................................................................... 21!
Diagrams and Notation .............................................................................................................. 21!
BEAM✲ (Example Data) Tables ............................................................................................. 21!
BEAM✲ Short Codes.............................................................................................................. 22!
Comparing BEAM✲ and Entity-Relationship Diagrams.......................................................... 22!
Data Model Types ................................................................................................................... 23!
BEAM✲ Diagram Types ......................................................................................................... 24!
SUMMARY .......................................................................................................................................... 26!

CHAPTER 2
MODELING BUSINESS EVENTS.......................................................................................................... 27!
DATA STORIES ................................................................................................................................... 28!
Story Types ................................................................................................................................ 28!
Discrete Events ....................................................................................................................... 29!
Evolving Events....................................................................................................................... 29!
X Agile Data Warehouse Design

Recurring Events..................................................................................................................... 29!


Events and Fact Tables .......................................................................................................... 30!
The 7Ws..................................................................................................................................... 31!
Thinking Dimensionally ........................................................................................................... 31!
Using the 7Ws: BEAM✲ Sequence ........................................................................................ 32!
BEAM✲ IN ACTION: TELLING STORIES ................................................................................................ 33!
1. Discover an Event: Ask “Who Does What?” .......................................................................... 33!
Focus on One Event at a Time ............................................................................................... 34!
Identifying the Responsible Subject ........................................................................................ 34!
2. Document the Event: BEAM✲ Table ..................................................................................... 35!
3. Describe the Event: Using the 7Ws ....................................................................................... 37!
When?..................................................................................................................................... 37!
Collecting Event Stories .......................................................................................................... 38!
Event Story Themes................................................................................................................ 39!
Typical Stories......................................................................................................................... 40!
Different Stories ...................................................................................................................... 40!
Repeat Stories ........................................................................................................................ 40!
Missing Stories........................................................................................................................ 41!
Group Stories .......................................................................................................................... 42!
Additional When Details? ........................................................................................................ 43!
Determining the Story Type .................................................................................................... 44!
Recurring Event ...................................................................................................................... 44!
Evolving Event ........................................................................................................................ 44!
Discrete Event......................................................................................................................... 45!
Who?....................................................................................................................................... 45!
What?...................................................................................................................................... 46!
Where?.................................................................................................................................... 46!
How Many? ............................................................................................................................. 50!
Unit of Measure....................................................................................................................... 51!
Durations................................................................................................................................. 51!
Derived Quantities................................................................................................................... 52!
Why? ....................................................................................................................................... 52!
How? ....................................................................................................................................... 53!
Event Granularity .................................................................................................................... 54!
Sufficient Detail ....................................................................................................................... 55!
Naming the Event.................................................................................................................... 55!
Completing the Event Documentation..................................................................................... 56!
THE NEXT EVENT? ............................................................................................................................. 57!
SUMMARY .......................................................................................................................................... 58!

CHAPTER 3
MODELING BUSINESS DIMENSIONS ................................................................................................. 59!
DIMENSIONS ....................................................................................................................................... 60!
Dimension Stories ...................................................................................................................... 60!
Discovering Dimensions............................................................................................................. 61!
DOCUMENTING DIMENSIONS ................................................................................................................ 62!
Dimension Subject ..................................................................................................................... 62!
Dimension Granularity and Business Keys ................................................................................ 63!
DIMENSIONAL ATTRIBUTES .................................................................................................................. 64!
Attribute Scope........................................................................................................................... 64!
Contents XI

Attribute Examples ..................................................................................................................... 67!


Descriptive Attributes .............................................................................................................. 68!
Mandatory Attributes ............................................................................................................... 69!
Missing Values ........................................................................................................................ 69!
Exclusive Attributes and Defining Characteristics...................................................................... 70!
Using the 7Ws to Discover Attributes......................................................................................... 71!
DIMENSIONAL HIERARCHIES ................................................................................................................ 73!
Why Are Hierarchies Important? ................................................................................................ 73!
Hierarchy Types ......................................................................................................................... 75!
Balanced Hierarchies .............................................................................................................. 75!
Ragged Hierarchies ................................................................................................................ 76!
Variable Depth Hierarchies ..................................................................................................... 76!
Multi-Parent Hierarchies.......................................................................................................... 77!
Hierarchy Charts ........................................................................................................................ 77!
Modeling Hierarchy Types ...................................................................................................... 78!
Modeling Queries .................................................................................................................... 79!
Discovering Hierarchical Attributes and Levels.......................................................................... 80!
Hierarchy Attributes at the Same Level................................................................................... 82!
Hierarchy Attributes that Don’t Belong .................................................................................... 82!
Hierarchy Attributes at the Wrong Level ................................................................................. 82!
Completing a Hierarchy.............................................................................................................. 83!
DIMENSIONAL HISTORY ....................................................................................................................... 84!
Current Value Attributes............................................................................................................. 84!
Corrections and Fixed Value Attributes...................................................................................... 85!
Historic Value Attributes............................................................................................................. 86!
Telling Change Stories............................................................................................................... 86!
Documenting CV Change Stories ........................................................................................... 88!
Documenting HV Change Stories ........................................................................................... 88!
Business Keys and Change Stories........................................................................................ 89!
Detecting Corrections: Group Change Rules.......................................................................... 89!
Effective Dating ....................................................................................................................... 91!
Documenting the Dimension Type ............................................................................................. 91!
Minor Events .............................................................................................................................. 92!
HV Attributes: Dimension-Only Minor Events ......................................................................... 92!
Minor Events within Major Events ........................................................................................... 93!
SUFFICIENT ATTRIBUTES?................................................................................................................... 93!
SUMMARY .......................................................................................................................................... 94!

CHAPTER 4
MODELING BUSINESS PROCESSES.................................................................................................. 95!
MODELING MULTIPLE EVENTS WITH AGILITY ........................................................................................ 96!
Conformed Dimensions.............................................................................................................. 97!
The Data Warehouse Bus........................................................................................................ 100!
The Event Matrix ...................................................................................................................... 102!
Event Sequences ..................................................................................................................... 104!
Time/Value Sequence........................................................................................................... 104!
Process Sequence ................................................................................................................ 104!
Modeling Process Sequences as Evolving Events ............................................................... 105!
Using Process Sequences to Enrich Events......................................................................... 105!
MODELSTORMING WITH AN EVENT MATRIX ......................................................................................... 105!
XII Agile Data Warehouse Design

Adding the First Event to the Matrix ......................................................................................... 106!


Modeling the Next Event .......................................................................................................... 107!
Role-Playing Dimensions ......................................................................................................... 108!
Discovering Process Sequences ............................................................................................. 112!
Using the Matrix to Find Missing Events .................................................................................. 115!
Using the Matrix to Find Missing Event Details........................................................................ 116!
PLAYING THE EVENT RATING GAME ................................................................................................... 116!
MODELING THE NEXT DETAILED EVENT .............................................................................................. 120!
Reusing Conformed Dimensions and Examples...................................................................... 120!
Using Abbreviations in Event Stories .................................................................................... 121!
Adding New Examples to Conformed Dimensions ............................................................... 122!
Modeling New Details and Dimensions.................................................................................... 122!
Completing the Event............................................................................................................... 125!
SUFFICIENT EVENTS? ....................................................................................................................... 126!
SUMMARY ........................................................................................................................................ 128!

CHAPTER 5
MODELING STAR SCHEMAS............................................................................................................. 129!
AGILE DATA PROFILING ..................................................................................................................... 130!
Identifying Candidate Data Sources......................................................................................... 131!
Data Profiling Techniques ........................................................................................................ 132!
Missing Values ...................................................................................................................... 132!
Unique Values and Frequency.............................................................................................. 133!
Data Ranges and Lengths .................................................................................................... 133!
Automating Your Own Data Profiling Checks ....................................................................... 134!
No Source Yet: Proactive DW/BI Design ................................................................................. 134!
Annotating the Model with Data Profiling Results .................................................................... 135!
Data Sources and Data Types .............................................................................................. 135!
Additional Data...................................................................................................................... 137!
Unavailable Data................................................................................................................... 137!
Nulls and Mismatched Attribute Descriptions........................................................................ 137!
MODEL REVIEW AND SPRINT PLANNING ............................................................................................. 138!
Team Estimating ...................................................................................................................... 138!
Running a Model Review ......................................................................................................... 139!
Sprint Planning......................................................................................................................... 140!
STAR SCHEMA DESIGN ..................................................................................................................... 141!
Adding Keys to a Dimensional Model ...................................................................................... 141!
Choosing Primary Keys: Business Keys vs. Surrogate Keys................................................ 141!
Benefits of Data Warehouse Surrogate Keys ....................................................................... 142!
Insulate the Data Warehouse from Business Key Change ................................................... 143!
Cope with Multiple Business Keys for a Dimension .............................................................. 143!
Track Dimensional History Efficiently.................................................................................... 143!
Handle Missing Dimensional Values..................................................................................... 143!
Support Multi-Level Dimensions ........................................................................................... 144!
Protect Sensitive Information ................................................................................................ 144!
Reduce Fact Table Size........................................................................................................ 144!
Improve Query Performance................................................................................................. 145!
Enforce Referential Integrity Efficiently ................................................................................. 145!
Slowly Changing Dimensions................................................................................................... 146!
Overwriting History: Type 1 Slowly Changing Dimensions ................................................... 147!
Contents XIII

Tracking History: Type 2 Slowly Changing Dimensions........................................................ 147!


Current Values or Historical Values? Why Not Both? ........................................................... 148!
Updating the Dimensions ......................................................................................................... 149!
Adding Surrogate Keys ......................................................................................................... 149!
ETL and Audit Attributes ....................................................................................................... 150!
Time Dimensions ..................................................................................................................... 151!
Modeling Fact Tables............................................................................................................... 152!
Replace Event Details with Dimension Foreign Keys ........................................................... 152!
Modeling Degenerate Dimensions ........................................................................................ 153!
Modeling Facts...................................................................................................................... 153!
Drawing Enhanced Star Schema Diagrams............................................................................. 154!
Create a Separate Diagram for Each Fact Table.................................................................. 154!
Use a Consistent Star Schema Layout ................................................................................. 155!
Display BEAM✲ Short Codes on Star Schemas .................................................................. 155!
Avoid the Snowflake Schema Anti-pattern............................................................................ 156!
Do Create Rollup Dimensions............................................................................................... 157!
CREATING PHYSICAL SCHEMAS ......................................................................................................... 157!
Choose BI-Friendly Naming Conventions ................................................................................ 158!
Use Data Domains ................................................................................................................... 158!
PROTOTYPING THE DW/BI DESIGN .................................................................................................... 159!
THE DATA WAREHOUSE MATRIX ........................................................................................................ 160!
SUMMARY ........................................................................................................................................ 162!

PART II: DIMENSIONAL DESIGN PATTERNS .................................................................................. 163!


CHAPTER 6
WHO AND WHAT: DESIGN PATTERNS FOR PEOPLE AND ORGANIZATIONS, PRODUCTS AND
SERVICES ........................................................................................................................................... 165!
CUSTOMER DIMENSIONS ................................................................................................................... 166!
Mini-Dimension Pattern............................................................................................................ 166!
Sensible Snowflaking Pattern .................................................................................................. 170!
Swappable Dimension Patterns ............................................................................................... 172!
Customer Relationships: Embedded Whos ............................................................................. 173!
Recursive Relationship ......................................................................................................... 174!
Variable-Depth Hierarchies ................................................................................................... 175!
Hierarchy Map Pattern ............................................................................................................. 176!
Hierarchy Maps and Type 2 Slowly Changing Dimensions .................................................. 179!
Using a Hierarchy Map.......................................................................................................... 179!
Displaying a Hierarchy .......................................................................................................... 180!
Hierarchy Sequence.............................................................................................................. 181!
Drilling Down on Hierarchy Maps.......................................................................................... 182!
Querying Multiple Parents..................................................................................................... 182!
Building Hierarchy Maps ....................................................................................................... 183!
Tracking History for Variable-Depth Hierarchies................................................................... 183!
Historical Value Recursive Keys ........................................................................................... 184!
The Recursive Key Ripple Effect .......................................................................................... 184!
Ripple Effect Benefits............................................................................................................ 185!
Ripple Effect Problems.......................................................................................................... 185!
EMPLOYEE DIMENSIONS.................................................................................................................... 186!
Hybrid SCD View Pattern......................................................................................................... 186!
XIV Agile Data Warehouse Design

Previous Value Attribute Pattern .............................................................................................. 188!


Human Resources Hierarchies ................................................................................................ 189!
Multi-Valued Hierarchy Map Pattern ........................................................................................ 190!
Additional Multi-Valued Hierarchy Map Attributes................................................................. 191!
Handling Multiple Weighting Factors..................................................................................... 192!
Updating a Hierarchy Map ....................................................................................................... 192!
Historical Multi-Valued Hierarchy Maps ................................................................................... 193!
PRODUCT AND SERVICE DIMENSIONS ................................................................................................ 195!
Describing Heterogeneous Products ....................................................................................... 196!
Balancing Ragged Product Hierarchies ................................................................................... 196!
Multi-Level Dimension Pattern ................................................................................................. 198!
Parts Explosion Hierarchy Map Pattern ................................................................................... 201!
SUMMARY ........................................................................................................................................ 202!

CHAPTER 7
WHEN AND WHERE: DESIGN PATTERNS FOR TIME AND LOCATION ........................................ 203!
TIME DIMENSIONS............................................................................................................................. 204!
Calendar Dimensions............................................................................................................... 205!
Date Keys.............................................................................................................................. 206!
ISO Date Keys ...................................................................................................................... 207!
Epoch-Based Date Keys ....................................................................................................... 207!
Populating the Calendar........................................................................................................ 208!
BI Tools and Calendar Dimensions....................................................................................... 208!
Period Calendars ..................................................................................................................... 209!
Month Dimensions ................................................................................................................ 209!
Offset Calendars ................................................................................................................... 210!
Year-to-Date Comparisons ...................................................................................................... 210!
Fact-Specific Calendar Pattern ............................................................................................. 212!
Using Fact State Information in Report Footers.................................................................... 213!
Conformed Date Ranges ...................................................................................................... 214!
CLOCK DIMENSIONS ......................................................................................................................... 214!
Day Clock Pattern - Date and Time Relationships................................................................... 215!
Time Keys ................................................................................................................................ 216!
INTERNATIONAL TIME ........................................................................................................................ 217!
Multinational Calendar Pattern................................................................................................. 218!
Date Version Keys ................................................................................................................... 220!
INTERNATIONAL TRAVEL .................................................................................................................... 221!
Time Dimensions or Time Facts? ............................................................................................ 224!
NATIONAL LANGUAGE DIMENSIONS .................................................................................................... 225!
National Language Calendars.................................................................................................. 225!
Swappable National Language Dimensions Pattern................................................................ 225!
SUMMARY ........................................................................................................................................ 226!

CHAPTER 8
HOW MANY: DESIGN PATTERNS FOR HIGH PERFORMANCE FACT TABLES AND FLEXIBLE
MEASURES ......................................................................................................................................... 227!
FACT TABLE TYPES .......................................................................................................................... 228!
Transaction Fact Table ............................................................................................................ 228!
Contents XV

Periodic Snapshot .................................................................................................................... 229!


Accumulating Snapshots.......................................................................................................... 231!
FACT TABLE GRANULARITY ............................................................................................................... 233!
MODELING EVOLVING EVENTS ........................................................................................................... 233!
Evolving Event Measures......................................................................................................... 237!
Event Counts......................................................................................................................... 237!
State Counts ......................................................................................................................... 237!
Durations............................................................................................................................... 238!
Additional Process Performance Measures .......................................................................... 238!
Event Timelines........................................................................................................................ 238!
Using Timelines for Documentation ...................................................................................... 240!
Using Timelines for Business Intelligence............................................................................. 240!
Developing Accumulating Snapshots....................................................................................... 241!
FACT TYPES ..................................................................................................................................... 242!
Fully Additive Facts .................................................................................................................. 243!
Non-Additive Facts................................................................................................................... 243!
Semi-Additive Facts ................................................................................................................. 244!
Averaging Issues................................................................................................................... 244!
Counting Issues .................................................................................................................... 245!
Heterogeneous Facts Pattern .................................................................................................. 246!
Factless Fact Pattern ............................................................................................................... 248!
FACT TABLE OPTIMIZATION ............................................................................................................... 249!
Downsizing............................................................................................................................... 249!
Indexing.................................................................................................................................... 250!
Partitioning ............................................................................................................................... 251!
Aggregation.............................................................................................................................. 252!
Lost Dimension Aggregate Pattern ....................................................................................... 252!
Shrunken Dimension Aggregate Pattern............................................................................... 253!
Collapsed Dimension Aggregate Pattern .............................................................................. 254!
Aggregation Guidelines......................................................................................................... 254!
Drill-Across Query Pattern ....................................................................................................... 255!
Derived Fact Table Patterns .................................................................................................... 258!
SUMMARY ........................................................................................................................................ 260!

CHAPTER 9
WHY AND HOW: DESIGN PATTERNS FOR CAUSE AND EFFECT ................................................ 261!
WHY DIMENSIONS............................................................................................................................. 262!
Internal Why Dimensions ......................................................................................................... 262!
Unstructured Why Dimensions................................................................................................. 263!
External Why Dimensions ........................................................................................................ 264!
MULTI-VALUED DIMENSIONS ............................................................................................................. 265!
Weighting Factor Pattern ......................................................................................................... 265!
Modeling Multi-Valued Groups................................................................................................. 267!
Multi-Valued Bridge Pattern ..................................................................................................... 268!
Optional Bridge Pattern............................................................................................................ 270!
Pivoted Dimension Pattern....................................................................................................... 273!
HOW DIMENSIONS ............................................................................................................................ 276!
Too Many Degenerate Dimensions?........................................................................................ 277!
Creating How Dimensions........................................................................................................ 277!
Range Band Dimension Pattern............................................................................................... 278!
XVI Agile Data Warehouse Design

Step Dimension Pattern ........................................................................................................... 279!


Audit Dimension Pattern .......................................................................................................... 281!
SUMMARY ........................................................................................................................................ 283!

APPENDIX A: THE AGILE MANIFESTO ............................................................................................ 285!


MANIFESTO FOR AGILE SOFTWARE DEVELOPMENT ............................................................................. 285!
THE TWELVE PRINCIPLES OF AGILE SOFTWARE .................................................................................. 285!
APPENDIX B: BEAM✲ NOTATION AND SHORT CODES ............................................................... 287!
TABLE CODES .................................................................................................................................. 287!
COLUMN CODES ............................................................................................................................... 289!
APPENDIX C: RESOURCES FOR AGILE DIMENSIONAL MODELERS........................................... 293!
TOOLS: HARDWARE AND SOFTWARE .................................................................................................. 293!
BOOKS ............................................................................................................................................. 294!
Agile Software Development................................................................................................. 294!
Visual Thinking, Collaboration and Facilitation ..................................................................... 294!
Dimensional Modeling........................................................................................................... 294!
Dimensional Modeling Case Studies .................................................................................... 294!
ETL........................................................................................................................................ 294!
Database Technology–Specific Advice................................................................................. 294!
WEBSITES ........................................................................................................................................ 295!
INTRODUCTION
Dimensional modeling, since it was first popularized by Ralph Kimball in the mid- Dimensional
1990s, has become the accepted (data modeling) technique for designing the high modeling is
performance data warehouses that underpin the success of today’s business intelli- responsible for
gence (BI) applications. Yet, with an ever increasing number of BI initiatives today’s DW/BI
stumbling long before they reach the data modeling phase, it has become clear that successes, yet we
Data Warehousing/Business Intelligence (DW/BI) needs new techniques that can still struggle to
revolutionize BI requirements analysis in the same way that dimensional modeling deliver enough BI
has revolutionized BI database design.

Agile, with its mantra of creating business value through the early and frequent Agile techniques
delivery of working software and responding to change, has had just such a revolu- can help, but they
tionary effect on the world of application development. Can it take on the chal- must address data
lenges of DW/BI? Agile’s emphasis on collaboration and incremental development warehouse design,
coupled with techniques such as Scrum and User Stories, will certainly improve BI not just BI
application development—once a data warehouse is in place. But to truly have an application
impact on DW/BI, agile must also address data warehouse design itself. Unfortu- development
nately, the agile approaches that have emerged, so far, are vague and non-
prescriptive in this one key area. For agile BI to be more than a marketing reboot of
business-as-usual business intelligence, it must be agile DW/BI and we, DW/BI
professionals, must do what every true agilist would recommend: adapt agile to
meet our needs while still upholding its values and principlFs (see Appendix A: The
Agile Manifesto). At the same time, agilists coming afresh to DW/BI, for their part,
must learn our hard-won data lessons.

With that aim in mind, this book introduces BEAM✲ (Business Event Analysis & This book is about
Modeling): a set of collaborative techniques for modelstorming BI data require- BEAM✲: an agile
ments and translating them into dimensional models on an agile timescale. We call approach to
the BEAM✲ approach “modelstorming” because it combines data modeling and dimensional
brainstorming techniques for rapidly creating inclusive, understandable models modeling
that fully engage BI stakeholders.

BEAM✲ modelers achieve this by asking stakeholders to tell data stories, using the BEAM✲ is used for
7W dimensional types—who, what, when, where, how many, why, and how—to modelstorming BI
describe the business events they need to measure. BEAM✲ models support mod- requirements
elstorming by differing radically from conventional entity-relationship (ER) based directly with BI
models. BEAM✲ uses tabular notation and example data stories to define business stakeholders
events in a format that is instantly recognizable to spreadsheet-literate BI
stakeholders, yet easily translated into atomic-detailed star schemas. By doing so,
BEAM✲ bridges the business-IT gap, creates consensus on data definitions and
generates a sense of business ownership and pride in the resulting database design.

XVII
XVIII Introduction

Who Is This Book For?


This book is for the This book is intended for data modelers, business analysts, data architects, and
whole agile DW/BI developers working on data warehouses and business intelligence systems. All
team, to help you members of an agile DW/BI team—not just those directly responsible for gathering
not only gather BI requirements or designing the data warehouse—will find the BEAM✲ notation
requirements but a powerful addition to standard entity-relationship diagrams for communicating
also communicate dimensional design ideas and estimating data tasks with their colleagues. To get the
design ideas most from this book, readers should have a basic knowledge of database concepts
such as tables, columns, rows, keys, and joins.

It is aimed at both For those new to data warehousing, this book provides a quick-study introduction
new and experienced to dimensional modeling techniques. For those of you who would like more
DW/BI practitioners. background on the techniques covered, the later chapters and Appendix C provide
It’s a quick-study references to case studies in other texts that will help you gain additional business
guide to dimensional insight. Experienced data warehousing professionals will find that this book offers
modeling and a a fresh perspective on familiar dimensional modeling patterns, covering many in
source of new more detail than previously available, and adding several new ones. For all readers,
dimensional design this book offers a radically new agile way of engaging with business users and kick-
patterns starting their next warehouse development project.

Meet The Modelstormers or How To Use This Book


Hello, I’m over here You may have already noticed the marginalia (non-contagious), on your left at the
and I’m your fast moment. This provides a “fast track” summary for readers in a hurry. This agile
track through this path through our text was inspired by David A. Taylor’s object technology series of
book books. The margins of this book also contain a cast of anything but marginal
characters. They are the modelstormers you need on your agile DW/BI team. We
used them to highlight key features in the text such as tips, warnings, references and
example modeling dialogues. They appear in the following order (in Chapters 1-9):

The bright modeler, not surprisingly, has some bright ideas. His tips, techniques and
practical modeling advice, distilled from the current topic, will help you improve
your design.

The experienced dimensional modeler has seen it all before. He’s here to warn you
when an activity or decision can steal your time, sanity or agility. Later in the book
he follows the pattern users (see below) to tell you about the consequences or side
effects of using their recommended design patterns. He would still recommend
you use their patterns though—just with a little care.
Introduction XIX

The note takers are the members of the team who always read the full instructions
before they use that new gadget or technique. They’re always here to tell you to
“make a note of that” when there is extra information on the current topic.

The agilists will let you know when we're being particularly agile. They wave their
banner whenever a design technique supports a core value of the agile manifesto or
principle of agile software development. These are listed in Appendix A.

The modelstormers appear en masse when we describe collaborative modeling and


team planning, particularly when we offer practical advice and tips on using white-
boards and other inclusive tools for modelstorming.

The scribe appears whenever we introduce new BEAM✲ diagrams, notation con-
ventions or short codes for rapidly documenting your designs. All the scribe’s short
codes are listed in Appendix B.

The agile modeler engages with stakeholders and facilitates modelstorming. She is
here to ask example BEAM✲ questions, using the 7Ws, to get stakeholders to tell
their data stories.

The stakeholders are the subject matter experts, operational IT staff, BI users and BI
consumers, who know the data sources, or know the data they want—anyone who
can help define the data warehouse who is not a member of the DW/BI develop-
ment team. They are here to provide example answers to the agile modeler’s ques-
tions, tell data stories and pose their own tricky BI questions.

The bookworm points you to further reading on the current topic. All her reading
recommendations are gathered in Appendix C.

The agile developer appears when we have some practical advice about using soft-
ware tools or there is something useful you can download.

The head scratcher has interesting/vexing DW/BI problems or requirements that


the data warehouse design is going to have to address.

The pattern users have a solution to the head scratcher’s problems. They’re going to
use tried and tested dimensional modeling design patterns, some new in print.
XX Introduction

How This Book Is Organized


This book has two parts. The first part covers agile dimensional modeling for BI
data requirements gathering, while the second part covers dimensional design
patterns for efficient and flexible star schema design.

Part I: Modelstorming
Part I describes how to modelstorm BI stakeholders’ data requirements, validate
Collaborative these requirements using agile data profiling, review and prioritize them with
modeling with BI stakeholders, estimate their ETL tasks as a team, and convert them into star sche-
stakeholders mas. It illustrates how agile data modeling can be used to replace traditional BI
requirements gathering with accelerated database design, followed by BI prototyp-
ing to capture the real reporting and analysis requirements. Chapter 1 provides an
introduction to dimensional modeling. Chapters 2 to 4 provide a step-by-step
guide for using BEAM✲ to model business events and dimensions. Chapter 5
describes how BEAM✲ models are validated and translated into physical dimen-
sional models and development sprint plans.

Chapter 1: How to Model a Data Warehouse


Why we need new Data warehouses and operational systems: Understanding the motivation for
agile approaches using dimensional modeling as the basis for agile database design.
for gathering BI Dimensional modeling fundamentals: Contrasting dimensional modeling with
requirements. Why entity-relationship (ER) modeling, and learning the basic concepts and vocabulary
they should be of facts, dimensions, and star schemas that will be used throughout the book.
dimensional. What Agile data modeling for analysis and design: The BI requirement gathering
they should look like problem. The challenges and opportunities of proactive DW/BI. The benefits of
agile data warehousing. Why model with BI stakeholders? The case for model-
storming: using agile dimensional modeling to gather BI data requirements.
Introduction to BEAM✲ : Comparison of BEAM✲ and ER diagrams.

Chapter 2: Modeling Business Events


Step-by-step Discovering business events: Using subjects, verbs, and objects to discover busi-
modeling of a ness events and tell data stories.
business event Documenting business events: Using whiteboards and spreadsheets and BEAM✲
using BEAM✲ tables to collaboratively model events.
Discovering event details: Using the 7Ws: who, what, when, where, how many,
why, and how to discover atomic-level event details. Using prepositions to connect
details to events, and data story themes to define and document them. Using
BEAM✲ short codes to document event story types (discrete, recurring, and evolv-
ing) and potential fact table granularity.
Introduction XXI

Chapter 3: Modeling Business Dimensions


Modeling “detail about detail”: Discovering dimensions and documenting their Step-by-step
attributes with stakeholders. Telling dimension stories and overcoming weak modeling of
narratives. dimensions and
Discovering dimensional hierarchies: Using hierarchy charts to model hierarchi- hierarchies
cal relationships and discover additional dimensional attributes.
Documenting historical value requirements: Using change stories and BEAM✲
short codes to define and document slowly changing dimension policies for sup-
porting current (as is) and historically correct (as was) analysis views.

Chapter 4: Modeling Business Processes


Modeling multiple business events: Modelstorming with an event matrix to Step-by-step
storyboard a data warehouse design by identifying and documenting the relation- modeling multiple
ships between events and dimensions. Using event stories to prioritize require- business events
ments and plan development sprints. and conformed
Modeling for agile data warehouse development: Defining and reusing con- dimensions
formed dimensions. Generalizing dimensions and documenting their roles. Sup-
porting incremental development and creating a data warehouse bus architecture.

Chapter 5: Modeling Star Schemas


Agile data profiling: Reviewing and adapting stakeholder models to data realities. Validating
Using BEAM✲ annotation to document data sources and physical data types, stakeholder models
provide feedback to stakeholders on model viability and help estimate ETL tasks as and converting them
a team. into star schemas
Converting BEAM✲ tables to star schemas: defining and using surrogate keys to
complete dimension tables, and convert event tables to fact tables. Using BEAM✲
technical codes to document the database design decisions and generate database
schemas using the BEAM✲Modelstormer spreadsheet. Prototyping to define BI
reporting requirements. Creating enhanced star schemas and physical dimensional
matrices for a technical audience.

Part II: Dimensional Design Patterns


Part II covers dimensional modeling techniques for designing high-performance
star schemas. For this, we take a design pattern approach using a combination of Collaborative
BEAM✲ and star schema ER notation to capture significant DW/BI requirements, modeling within the
explain their associated issues/problems, and document pattern solutions and the DW/BI team. Using
consequences of implementing them. We have organized these design patterns design patterns
around the 7W dimensional types discovered in Part I. By using the 7Ws to exam- associated with
ine the complexities of modeling customers and employees (who), products and each of the 7W
services (what), time (when), location (where), business measures (how many), dimensional types
XXII Introduction

cause (why), and effect (how), we document new and established dimensional
techniques from a dimensional perspective for the first time.

Chapter 6: Who and What: People and Organizations, Products and


Services
Design patterns for Modeling customers, employees, and organizations: Handling large, rapidly
customer, employee changing dimension populations. Tracking changes using mini-dimensions.
and product Mixed business models: Using exclusive attributes and swappable dimensions to
dimensions model heterogeneous customers (businesses and consumer) and products (tangible
goods and services).
Advanced slowly changing Patterns: Modeling micro and macro-level change.
Supporting simultaneous current, historical, and previous value reporting re-
quirements using hybrid SCD views.
Representing complex hierarchical relationships: Using hierarchy maps to handle
recursive hierarchies, such as customer ownership, employee HR reporting struc-
tures, and product composition (component bill of materials and product bun-
dles).
Supporting variation within business events: Using multi-level dimensions to
describe events with variable granularity such as sales transactions assigned to
individual employees or to teams, web advertisement impressions for single prod-
ucts or whole product categories.

Chapter 7: When and Where: Time and Location


Design patterns for Modeling time dimensionally: Using separate calendar and clock dimensions and
time and location defining date keys.
dimensions Year-to-date (YTD) analysis: Using fact state tables and fact-specific calendars to
support correct YTD comparisons.
Time of day bracketing: Designing custom business clocks that vary by day of
week or time of year.
Multinational calendars: Modeling multinational dimensions that cope with time
and location. Supporting time zones and national language reporting.
Modeling movement: Overloading events with additional time and location
dimensions to understand journeys and trajectories.

Chapter 8: How Many: Facts and Measures and KPIs


Design patterns for Designing fact tables for performance and ease of use: Defining the three basic
modeling efficient fact table patterns: transactions, periodic snapshots, and accumulating snapshots.
fact tables and Using event timelines to model accumulating snapshots as evolving events.
flexible facts Providing the basis for flexible measures and KPIs: Defining atomic-level addi-
tive facts. Documenting semi-additive and non-additive facts, and understanding
their limitations.
Fact table performance optimization: Using indexing, partitioning, and aggrega-
tion to improve fact table ETL and query performance.
Introduction XXIII

Cross-process analysis: Combining the results from multiple fact tables using
drill-across processing and multi-pass queries. Building derived fact tables and
consolidated data marts to simplify query processing.

Chapter 9: Why and How: Cause and Effect


Modeling causal factors: Using promotions, weather, and other causal dimensions Design patterns for
to explain why events occur and why facts vary. Using text dimensions to handle modeling cause and
unstructured reasons and exception descriptions. effect
Modeling event descriptions: Using how dimensions to collect any additional
descriptions of an event. Consolidating excessive degenerate dimensions as how
dimensions, and combining small why and how dimensions.
Multi-valued dimensions: Using bridge tables and weighting factors to handle fact
allocation (‘splitting the atom’) when dimensions have multiple values for each
atomic-level fact. Using optional bridge tables and multi-level dimensions to effi-
ciently handle barely multi-valued dimensions. Using pivoted dimensions to
support complex multi-valued constraints.
Providing additional how dimensions: Using step dimensions for understanding
sequential behavior, audit dimensions for tracking data quality/lineage, and range
band dimensions for treating facts as dimensions.

Appendix A: The Agile Manifesto


Appendix A lists the four values of, and the twelve principles behind, the manifesto
for agile software development.

Appendix B: BEAM✲ Table Notation and Short Codes


Appendix B summarizes the BEAM✲ notation used throughout this book for
modeling data requirements, recording data profiling results and representing
physical dimensional modeling design decisions.

Appendix C: Resources for Agile Dimensional Modelers


Appendix C lists books, websites, and tools (hardware and software) that will help
you adopt and adapt the ideas contained in the book.

Companion Website
Visit modelstorming.com to download the BEAM✲Modelstormer spreadsheet
and other templates that accompany this book. On the site you will find example
models and code listings together with links to articles, books, and the worldwide
schedule of training courses and workshops on BEAM✲ and agile data warehouse
design. Register your paperback copy online to receive a discounted eBook version.
PART I: MODELSTORMING
AGILE DIMENSIONAL MODELING, FROM WHITEBOARD TO STAR SCHEMA

Dimensional Modeling: it's too important to be left to data modelers alone


— Anon.

Chapter 1: How to Model a Data Warehouse

Chapter 2: Modeling Business Events

Chapter 3: Modeling Business Dimensions

Chapter 4: Modeling Business Processes

Chapter 5: Modeling Star Schemas


1
✲!

HOW TO MODEL A DATA WAREHOUSE


Essentially, all models are wrong, but some are useful.
— George E. P. Box

In this first chapter we set out the motivation for adopting an agile approach to Dimensional
data warehouse design. We start by summarizing the fundamental differences modeling supports
between data warehouses and online transaction processing (OLTP) databases to data warehouse
show why they need to be designed using very different data modeling techniques. design
We then contrast entity-relationship and dimensional modeling and explain why
dimensional models are optimal for data warehousing/business intelligence
(DW/BI). While doing so we also describe how dimensional modeling enables
incremental design and delivery: key principles of agile software development.

Readers who are familiar with the benefits of traditional dimensional modeling Collaborative
may wish to skip to Data Warehouse Analysis and Design on Page 11 where we begin dimensional
the case for agile dimensional modeling. There, we take a step back in the DW/BI modeling
development lifecycle and examine the traditional approaches to data requirements supports agile
analysis, and highlight their shortcomings in dealing with ever more complex data data warehouse
sources and aggressive BI delivery schedules. We then describe how agile data analysis and design
modeling can significantly improve matters by actively involving business
stakeholders in the analysis and design process. We finish by introducing BEAM✲
(Business Event Analysis and Modeling): the set of agile techniques for collabora-
tive dimensional modeling described throughout this book.

Differences between operational systems and data warehouses Chapter 1 Topics


Entity-relationship (ER) modeling vs. dimensional modeling At a Glance
Data-driven analysis and reporting requirements analysis limitations
Proactive data warehouse design challenges
Introduction to BEAM✲: an agile dimensional modeling method

3
4 Chapter 1

OLTP vs. DW/BI: Two Different Worlds


OLTP and DW/BI Operational systems and data warehouses have fundamentally different purposes.
have radically Operational systems support the execution of business processes, while data ware-
different DBMS houses support the evaluation of business processes. To execute efficiently, opera-
requirements tional systems must be optimized for online transaction processing (OLTP). In
contrast, data warehouses, must be optimized for query processing and ease of use.
Table 1-1 highlights the very different usage patterns and database management
system (DBMS) demands of the two types of system.

Table 1-1 CRITERIA OLTP DATABASE DATA WAREHOUSE


Comparison Purpose Execute individual Evaluate multiple business
between OLTP business processes processes
(“turning the handles”) (“watching the wheels turn”)
databases and
Transaction type Insert, select, update, Select
Data Warehouses
delete
Transaction style Predefined: predictable, Ad-hoc: unpredictable, volatile
stable
Optimized for Update efficiency and Query performance and
write consistency usability
Update frequency Real-time: when busi- Periodic, (daily) via scheduled
ness events occur ETL (extract, transform, load).
Moving to near real-time
Update High Low
concurrency
Historical data Current and recent Current + several years of
access periods history
Selection criteria Precise, narrow Fuzzy, broad
Comparisons Infrequent Frequent
Query complexity Low High
Tables/joins per Few (1–3) Many (10+)
transaction
Rows per Tens Millions
transaction
Transactions per Millions Thousands
day
Data volumes Gigabytes–Terabytes Terabytes–Petabytes
(many sources, history)

Data Mainly raw detailed Detailed data, summarized


data data, derived data

Design technique Entity-Relationship Dimensional modeling


modeling (normalization)

Data model ER diagram Star schema


diagram
How to Model a Data Warehouse 5

The Case Against Entity-Relationship Modeling


Entity-Relationship (ER) modeling is the standard approach to data modeling for ER modeling is
OLTP database design. It classifies all data as one of three things: an entity, a used to design
relationship, or an attribute. Figure 1-1 shows an example entity-level ER diagram OLTP databases
(ERD). Entities are shown as boxes and relationships as lines linking the boxes.
The cardinality of each relationship—the number of possible matching values on
either side of the relationship—is shown using crow’s feet for many, | for one, and
O for zero (also knowO as optionality).

Figure 1-1
Entity-Relationship
diagram (ERD)

Within a relational database, entities are implemented as tables and their attributes Entities become
as columns. Relationships are implemented either as columns within existing tables, attributes
tables or as additional tables depending on their cardinality. One-to-one (1:1) and become columns
many-to-one (M:1) relationships are implemented as columns, whereas many-to-
many (M:M) relationships are implemented using additional tables, creating
additional M:1 relationships.

ER modeling is associated with normalization in general, and third normal form ER models are
(3NF) in particular. ER modeling and normalization have very specific technical typically in third
goals: to reduce data redundancy and make explicit the 1:1 and M:1 relationships normal form (3NF)
within the data that can be enforced by relational database management systems.
6 Chapter 1

Advantages of ER Modeling for OLTP


3NF is efficient for Normalized databases with few, if any, data redundancies have one huge advantage
transaction for OLTP: they make write transactions (inserts, updates, and deletes) very effi-
processing cient. By removing data redundancies, transactions are kept as small and simple as
possible. For example, the repeat usage of a service by a telecom’s customer is
recorded using tiny references to the customer and service: no unnecessary details
are rerecorded each time. When a customer or service detail changes (typically)
only a single row in a single table needs to be updated. This helps avoid update
anomalies that would otherwise leave a database in an inconsistent state.

Higher forms of normalization are available, but most ER modelers are satisfied
when their models are in 3NF. There is even a mnemonic to remind everyone that
data in 3NF depends on “The key, the whole key, and nothing but the key, so help
me Codd”—in memory of Edgar (Ted) Codd, inventor of the relational model.

Disadvantages of ER Modeling for Data Warehousing


3NF is inefficient for Even though 3NF makes it easier to get data in, it has a huge disadvantage for BI
query processing and data warehousing: it makes it harder to get the data out. Normalization prolif-
erates tables and join paths making queries (SQL selects) less efficient and harder
to code correctly. For example, looking at the Figure 1-1 ERD, could you estimate
how many ways PRODUCT CATEGORY can be joined to ORDER
TRANSACTION? A physical 3NF version of the model would contain at least 20
more tables to resolve the M:M relationships. Faced with such 3NF databases, even
the simplest BI query requires multiple tables to be joined through multiple inter-
mediate tables. These long joins paths are difficult to optimize and queries invaria-
bly run slowly.

3NF models are More importantly, queries will only produce the right answers if users navigate the
difficult to right join paths, i.e., ask the right questions in SQL terms. If the wrong joins are
understand used, they unknowingly get answers to some other (potentially meaningless)
questions. 3NF models are complex for both people and machines. Specialist
hardware (data warehouse appliances) is improving query/join performance all the
time, but the human problems are far more difficult to solve. Smart BI software can
hide database schema complexity behind a semantic layer, but that merely moves
the burden of understanding a 3NF model from BI users at query time to BI
developers at configuration time. That’s a good move but its not enough. 3NF
models remain too complex for business stakeholders to review and quality assure
(QA).

History further ER models are further complicated by data warehousing requirements to track
complicates 3NF history in full to support valid ‘like-for-like’ comparisons over time. Providing a
true historical perspective of business events requires that many otherwise simple
descriptive attributes become time relationships, i.e., existing M:1 relationships
become M:M relationships that translate into even more physical tables and com-
How to Model a Data Warehouse 7

plex join paths. Such temporal database designs can defeat even the smartest BI
tools and developers.

Laying out a readable ERD for any non-trivial data model isn’t easy. The mne- Large readable ER
monic “dead crows fly east” encourages modelers to keep crows’ feet pointing up diagrams are
or to the left. Theoretically this should keep the high-volume volatile entities difficult to draw: all
(transactions) top left and the low-volume stable entities (lookup tables) bottom those overlapping
right. However, this layout seldom survives as modelers attempt to increase read- lines
ability by moving closely related or commonly used entities together. The task
rapidly descends into an exercise in trying to reduce overlapping lines. Most ERDs
are visually overwhelming for BI stakeholders and developers who need simpler,
human-scale diagrams to aid their communication and understanding.

The Case For Dimensional Modeling


Dimensional models define business processes and their individual events in terms Dimensional models
of measurements (facts) and descriptions (dimensions), which can be used to filter, appeal to
group, and aggregate the measurements. Data cubes are often used to visualize spreadsheet-savvy
simple dimensional models, as in Figure 1-2, which shows the multidimensional BI users
analysis of a sales process with three dimensions: PRODUCT (what), TIME
(when), and LOCATION (where). At the intersection of these dimensional values
there are interesting facts such as the quantity sold, sales revenue, and sales costs.
This perspective on the data appeals to many BI users because the three-
dimensional cube can be thought of as a stack of two-dimensional spreadsheets.
For example, one spreadsheet for each location contains rows for products, col-
umns for time periods, and revenue figures in each cell.

Figure 1-2
Multidimensional
analysis
8 Chapter 1

Star Schemas
Star schemas are Real-world dimensional models are used to measure far more complex business
used to visualize processes (with more dimensions) in far greater detail than could be attempted
dimensional models using spreadsheets. While it is difficult to envision models with more than three
dimensions as multi-dimensional cubes (they wouldn’t actually be cubes), they can
easily be represented using star schema diagrams. Figure 1-3 shows a classic star
schema for retail sales containing a fourth (causal) dimension: PROMOTION, in
addition to the dimensional attributes and facts from the previous cube example.

Figure 1-3
Sales star schema

Star schema is also the term used to describe the physical implementation of a
dimensional model as relational tables.

Star schema diagrams are non-normalized (N3NF) ER representations of dimen-


sional models. When drawn in a database modeling tool they can be used to
generate the SQL for creating fact and dimension tables in relational database
management systems. Star schemas are also used to document and define the data
cubes of multidimensional databases.

ER diagrams work best for viewing a small number of tables at one time. How
many tables? About as many as in a dimensional model: a star schema.

Fact and Dimension Tables


Star schemas are A star schema is comprised of a central fact table surrounded by a number of
comprised of fact dimension tables. The fact table contains facts: the numeric (quantitative) meas-
and dimension ures of a business event. The dimension tables contain mainly textual (qualitative)
tables descriptions of the event and provide the context for the measures. The fact table
also contains dimensional foreign keys; to an ER modeler it represents a M:M
How to Model a Data Warehouse 9

relationship between the dimensions. A subset of the dimensional foreign keys


form a composite primary key for the fact table and defines its granularity, or level
of detail.

The term dimension in this book refers to a dimension table whereas dimensional
attribute refers to a column in a dimension table.

Dimensions contain sets of descriptive (dimensional) attributes that are used to Dimensional
filter data and group facts for aggregation. Their role is to provide good report row hierarchies support
headers and title/heading/footnote filter descriptions. Dimensional attributes often drill-down analysis
have a hierarchical relationship that allows BI tools to provide drill-down analysis.
For example, drilling down from Quarter to Month, Country to Store, and Cate-
gory to Product.

Not all dimensional attributes are text. Dimensions can contain numbers and dates Dimensions are
too, but these are generally used like the textual attributes to filter and group the small,
facts rather than to calculate aggregate measures. Despite their width, dimensions fact tables
are tiny relative to fact tables. Most dimensions contain considerably less than a are large
million rows.

The most useful facts are additive measures that can be aggregated using any
combination of the available dimensions. The most useful dimensions provide
rich sets of descriptive attributes that are familiar to BI users.

Advantages of Dimensional Modeling for Data Warehousing


The most obvious advantage of a dimensional model, noticeable in Figure 1-3, is its Dimensional models
simplicity. The small number of tables and joins, coupled with the explicit facts in maximize query
the center of the diagram, makes it easy to think about how sales can be measured performance and
and easy to construct the necessary queries. For example, if BI users want to usability
explore product sales by store, only one short join path exists between PRODUCT
and STORE: through the SALES FACT table. Limiting the number of tables in-
volved and the length of the join paths in this way maximizes query performance
by leveraging DBMS features such as star-join optimization (which processes
multiple joins to a fact table in a single pass).

A deeper, less immediately obvious benefit of dimensional models is that they are Dimensional models
process-oriented. They are not just the result of some aggressive physical data are process-
model optimization (that has denormalized a logical 3NF ER model into a smaller oriented. They
number of tables) to overcome the limitations of databases to cope with join represent business
intensive BI queries. Instead, the best dimensional models are the result of asking processes
questions to discover which business processes need to be measured, how they described using the
should be described in business terms and how they should be measured. The 7Ws framework
resulting dimensions and fact tables are not arbitrary collections of denormalized
data but the 7Ws that describe the full details of each individual business event
worth measuring.
10 Chapter 1

The Who is involved?


What did they do? To what is it done?
When did it happen?

7Ws
Framework
Where did it take place?
HoW many or much was recorded – how can it be measured?
Why did it happen?
HoW did it happen – in what manner?

The 7Ws are The 7Ws are an extension of the 5 or 6Ws that are often cited as the checklist in
interrogatives: essay writing and investigative journalism for getting the ‘full’ story. Each W is an
question forming interrogative: a word or phrase used to make questions. The 7Ws are especially
words useful for data warehouse data modeling because they focus the design on BI
activity: asking questions.

Fact tables represent verbs (they record business process activity). The facts they
contain and the dimensions that surround them are nouns, each classifiable as
one of the 7Ws. 6Ws: who, what, when, where, why, and how represent dimension
types. The 7th W: how many, represents facts. BEAM✲ data stories use the 7Ws
to discover these important verb and noun combinations.

Star schemas Detailed dimensional models usually contain more than 6 dimensions because any
usually contain of the 6Ws can appear multiple times. For example, an order fulfillment process
8-20 dimensions could be modeled with 3 who dimensions: CUSTOMER, EMPLOYEE, and
CARRIER, and 2 when dimensions: ORDER DATE and DELIVERY DATE.
Having said that, most dimensional models do not have many more than 10 or 12
dimensions. Even the most complex business events rarely have 20 dimensions.

Star schemas The deep benefit of process-oriented dimensional modeling is that it naturally
support agile, breaks data warehouse scope, design and development into manageable chunks
incremental BI consisting of just the individual business processes that need to be measured next.
Modeling each business process as a separate star schema supports incremental
design, development and usage. Agile dimensional modelers and BI stakeholders
can concentrate on one business process at a time to fully understand how it
should be measured. Agile development teams can build and incrementally deliver
individual star schemas earlier than monolithic designs. Agile BI users can gain
early value by analyzing these business processes initially in isolation and then
grow into more valuable, sophisticated cross-process analysis. Why develop ten
stars when one or two can be delivered far sooner with less investment ‘at risk’?

Dimensional modeling provides a well-defined unit of delivery—the star schema


—which supports the agile principles: “Satisfy the customer through early and
continuous delivery of valuable software.” and “Deliver working software fre-
quently, from a couple of weeks to a couple of months, with a preference to the
shorter time scale.”
How to Model a Data Warehouse 11

Data Warehouse Analysis and Design


Both 3NF ER modeling and dimensional modeling are primarily database design Analysis
techniques (one arguably more suited to data warehouse design than the other). techniques are
Prior to using either to design data structures for meeting BI information require- required to
ments, some form of analysis is required to discover these requirements. The two discover BI data
approaches commonly used to obtain data warehousing requirements are data- requirements
driven analysis (also known as supply driven) and reporting-driven analysis (also
known as demand driven). While most modern data warehousing initiatives use
some combination of the two, Figure 1-4 shows the analysis and design bias of
early 3NF enterprise data warehouses compared to that of more recent dimen-
sional data warehouses and data marts.

Figure 1-4
Data warehouse
analysis and design
biases

Data-Driven Analysis
Using a data-driven approach, data requirements are obtained by analyzing oper- Pure data-driven
ational data sources. This form of analysis was adopted by many early IT-lead data analysis avoided
warehousing initiatives to the exclusion of all others. User involvement was early user
avoided as it was mistakenly felt that data warehouse design was simply a matter of involvement
re-modeling multiple data sources using ER techniques to produce a single ‘per-
fect’ 3NF model. Only after that was built, would it then be time to approach the
users for their BI requirements.
12 Chapter 1

Leading to DW Unfortunately, without user input to prioritize data requirements and set a man-
designs that did ageable scope, these early data warehouse designs were time-consuming and
not met BI user expensive to build. Also, being heavily influenced by the OLTP perspective of the
needs source data, they were difficult to query and rarely answered the most pressing
business questions. Pure data-driven analysis and design became known as the
“build it and they will come” or “field of dreams” approach, and eventually died
out to be replaced by hybrid methods that included user requirements analysis,
source data profiling, and dimensional modeling.

Packaged apps are Data-driven analysis has benefited greatly from the use of modern data profiling
especially tools and methods but despite their availability, data-driven analysis has become
challenging data increasing problematic as operational data models have grown in complexity. This
sources to analyze is especially true where the operational systems are packaged applications, such as
Enterprise Resource Planning (ERP) systems built on highly generic data models.

IT staff are In spite of its problems, data-driven analysis continues to be a major source of data
comfortable with requirements for many data warehousing projects because it falls well within the
data-driven analysis technical comfort zone of IT staff who would rather not get too involved with
business stakeholders and BI users.

Reporting-Driven Analysis
Reporting Using a reporting-driven approach, data requirements are obtained by analyzing
requirements are the BI users’ reporting requirements. These requirements are gathered by inter-
gathered by viewing stakeholders one at a time or in small groups. Following rounds of meet-
interviewing ings, analyst’s interview notes and detailed report definitions (typically spreadsheet
potential BI users or word processor mock-ups) are cross-referenced to produce a consolidated list of
in small groups data requirements that are verified against available data sources. The results
requirements documentation is then presented to the stakeholders for ratification.
After they have signed off the requirements, the documentation is eventually used
to drive the data modeling process and subsequent BI development.

User involvement Reporting-driven analysis focuses the data warehouse design on efficiently priori-
helps to create more tizing the stakeholder’s most urgent reporting requirements and can lead to timely,
successful DWs successful deployments when the scope is managed carefully.

Accretive BI Unfortunately, reporting-driven analysis is not without its problems. It is time-


reporting consuming to interview enough people to gather ‘all’ the reporting requirements
requirements are needed to attain an enterprise or even a cross-departmental perspective. Getting
impossible to stakeholders to think beyond ‘the next set of reports’ and describe longer term
capture in full, requirements in sufficient detail takes considerable interviewing skills. Even
in advance experienced business analysts with generous requirement gathering budgets
struggle because detailed analytical requirements by their very nature are accretive:
they gradually build up layer upon layer. BI users find it difficult to articulate
How to Model a Data Warehouse 13

future information needs beyond the ‘next reports’, because these needs are de-
pendent upon the answers the ‘next reports’ will provide, and the unexpected new
business initiatives those answers will trigger. The ensuing steps of collating re-
quirements, feeding them back to business stakeholders, gaining consensus on data
terms, and obtaining sign off can also be an extremely lengthy process.

Over-reliance on reporting requirements has lead to many initially successful data Focusing too closely
warehouse designs that fail to handle change in the longer-term. This typically on current reports
occurs when inexperienced dimensional modelers produce designs that match the alone leads to
current report requests too closely, rather than treating these reports as clues to inflexible
discovering the underlying business processes that should be modeled in greater dimensional models
detail to provide true BI flexibility. The problem is often exasperated by initial
requirement analysis taking so long that there isn’t the budget or willpower to
swiftly iterate and discover the real BI requirements as they evolve. The resulting
inflexible designs have led some industry pundits to unfairly brand dimensional
modeling as too report-centric, suitable at the data mart level for satisfying the
current reporting needs of individual departments, but unsuitable for enterprise
data warehouse design. This is sadly misleading because dimensional modeling has
no such limitation when used correctly to iteratively and incrementally model
atomic-level detailed business processes rather than reverse engineer summary
report requests.

Proactive DW/BI Analysis and Design


Historically, data warehousing has lagged behind OLTP development (in technol- Early DWs were
ogy as well as chronology). Data warehouses were built often long after well estab- reactive to OLTP
lished operational systems were found to be inadequate for reporting purposes, reporting problems
and significant BI backlogs had built up. This reactive approach is illustrated on the
example timeline in Figure 1-5.

Figure 1-5
Reactive DW
timeline

Today, DW/BI has caught up and become proactive. The two different worlds of The lag between
OLTP and DW/BI have become parallel worlds where many new data warehouses OLTP and DW roll-
need to go live/be developed concurrently with their new operational source out is disappearing
systems, as shown on the Figure 1-6 timeline.
14 Chapter 1

Figure 1-6
Proactive DW
timeline

Proactive DW/BI DW/BI has steadily become proactive for a number of business-led reasons:
addresses
operational DW/BI itself has become more operational. The (largely technical) distinction
demands, avoids between operational and analytical reporting has blurred. Increasingly, sophis-
interim solutions ticated operational processes are leveraging the power of (near real-time) BI
and preempts BI and stakeholders want a one-stop shop for all reporting needs: the data ware-
performance house.
problems
Organizations (especially those that already have DW/BI success) now realize
that, sooner rather than later, each major new operational system will need its
own data mart or need to be integrated with an existing data warehouse.

BI stakeholders simply don’t want to support ‘less than perfect’ interim report-
ing solutions and suffer BI backlogs.

Benefits of Proactive Design for Data Warehousing


Proactive DW When data warehouse design preempts detailed operational data modeling it can
design can help BI stakeholders set the data agenda, i.e., stipulate their ideal information
improve the data requirements whilst the new OLTP system is still in development and enhance-
available for BI ments can easily be incorporated. This is especially significant for the definition of
mandatory data. Vital BI attributes that might have been viewed as optional or
insignificant from a purely operational perspective can be specified as not null and
captured from day one—before operational users develop bad habits that might
have them (inadvertently) circumvent the same enhancements made later. Agile
OLTP development teams should welcome these ‘early arriving changes’.

Proactive DW ETL processes are often thought of as difficult/impossible to develop without


design can access to stable data sources. However, when a data source hasn’t been defined or is
streamline ETL still a moving target, it gives the agile ETL team the chance to define its ‘perfect’
change data data extraction interface specification based on the proactive data warehouse
capture model, and pass that on to the OLTP development team. This is a great opportu-
nity for ETL designers to ensure that adequate change data capture functionality
(e.g. consistently maintained timestamps and update reason codes) are built into
all data sources so that ETL processes can easily detect when data has changed and
for what reason: whether genuine change has occurred to previously correct values
(that must be tracked historically) or mistakes have been corrected (which need no
history).
How to Model a Data Warehouse 15

When source database schemas are not yet available, ETL development can still
proceed if ETL and OLTP designers can agree on flat file data extracts. Once
OLTP have committed to provide the specified extracts on a schedule to meet BI
needs, ETL transformation and load routines can be developed to match this
source to the proactive data warehouse design target.

Challenges of Proactive Analysis for Data Warehousing


While being proactive has great potential benefits for DW/BI, the late appearance Proactive analysis
of data on the Figure 1-6 timeline unfortunately heralds further analysis challenges takes place before
for data warehouse designers: BI requirements gathering must take place before data exists
any real data is available. Under these circumstances proactive data modelers can
rely even less upon traditional analysis techniques to provide BI data requirements
to match their aggressive schedule.

Proactive Reporting-Driven Analysis Challenges


Traditional interviewing techniques for gathering reporting requirements are Reporting-driven
problematic when stakeholders haven’t seen the data or applications that will fuel analysis is difficult
their BI imagination. With no existing reports to work from, business analysts before data exists
can’t ask their preferred icebreaker question: “How can your favorite reports be
improved?” and they have nothing to point at if and ask: “How do you use this data
to make decisions?”. Even more open questions such as “What decisions do you
make and what information will help you to make them quicker/better?” can fall
flat when a new operational systems will shortly enable an entirely new business
process that stakeholders have no prior experience of measuring, or managing.

Proactive Data-Driven Analysis Challenges


IT cannot fall back on data-driven analysis: data profiling tools and database Data-driven analysis
remodeling skills are of little use when new source databases don’t exist, are still is impossible with
under development, or contain little or no representative data (only test data). no data to profile
Even when new operational systems are implemented using package applications
with stable, (well) documented database schemas they are often too complicated
for untargeted data profiling: it would take too long and be of little value if only a
small percentage of the database is currently used/populated and well understood
by the available IT resources.

Data then Requirements: a ‘Chicken or the egg’ Conundrum


Before there is data and users have lived with it for a time (with less than perfect BI Proactive DW
access) both IT and business stakeholders cannot define genuine BI requirements design requires a
in sufficient detail. Without these early detailed requirements proactive data new approach to
warehouse designs routinely fail to provide the right information on time to avoid data analysis,
a BI backlog building up as soon as data is available. To solve this ‘data then re- modeling and
quirements’/‘chicken or the egg’ conundrum, proactive data warehousing needs a design
new approach to database analysis and design: not your father’s data modeling, not
even your father’s dimensional modeling!
16 Chapter 1

Agile Data Warehouse Design


Traditional data Traditional data warehousing projects follow some variant of waterfall develop-
warehousing follows ment as summarized on the Figure 1-7 timeline. The shape of this timeline and the
a near-serial or term ‘waterfall’ might suggest that its ‘all downhill’ after enough detailed require-
waterfall approach ments have been gathered to complete the ‘Big Design Up Front’ (BDUF). Unfor-
to design and tunately for DW/BI, this approach relies on a preternatural ability to exhaustively
development capture requirements upfront. It also postpones all data access and the hoped for
BI value it brings until the (bitter) end of the waterfall (or rainbow!). For these
reasons pure waterfall (analyze only once, design only once, develop only once,
etc.) DW/BI development, whether by design or practice, is rare.

Figure 1-7
Waterfall DW
development
timeline

Dimensional Dimensional modeling can help reduce the risks of pure waterfall by allowing
modeling enables developers to release early incremental BI functionality one star schema at a time,
incremental get feedback and make adjustments. But even dimensional modeling, like most
development other forms of data modeling, takes a (near) serial approach to analysis and design
(with ‘Big Requirements Up Front’ (BRUF) preceding BDUF data modeling) that
is subject to the inherent limitations and initial delays described already.

Agile data Agile data warehousing seeks to further reduce the risks associated with upfront
warehousing is analysis and provide even more timely BI value by taking a highly iterative, incre-
highly iterative and mental and collaborative approach to all aspects of DW design and development as
collaborative shown on the Figure 1-8 timeline.

Figure 1-8
Agile DW
development
timeline
How to Model a Data Warehouse 17

By avoiding the BDUF and instead doing ‘Just Enough Design Upfront’ (JEDUF) Agile focuses on the
in the initial iterations and ‘Just-In-Time’ (JIT) detailed design within each itera- early and frequent
tion, agile development concentrates on the early and frequent delivery of working delivery of working
software that adds value, rather than the production of exhaustive requirements software that adds
and design documentation that describes what will be done in the future to add value
value.

For agile DW/BI, the working software that adds value is a combination of query- For DW design, the
able database schemas, ETL processes and BI reports/dashboards. The minimum minimum valuable
set of valuable working software that can be delivered per iteration is a star schema, working software is
the ETL processes that populates it and a BI tool or application configured to a star schema
access it. The minimum amount of design is a star.

To design any type of significant database schema to match the early and frequent Agile database
delivery schedule of an agile timeline requires an equally agile alternative to the development needs
traditionally serial tasks of data requirements analysis and data modeling. agile data modeling

Agile Data Modeling


Scott Ambler, author of several books on agile modeling and agile database tech- Agile data modeling
niques (www.agiledata.org) defines agile data modeling as follows: “Data modeling is collaborative and
is the act of exploring data-oriented structures. Evolutionary data modeling is data evolutionary
modeling performed in an iterative and incremental manner. Agile data modeling is
evolutionary data modeling done in a collaborative manner.”

Iterative, incremental and collaborative all have very specific meanings in an agile Collaborative
development context that bring with them significant benefits: modeling combines
analysis and design
Collaborative data modeling obtains data requirements by modeling directly and actively
with stakeholders. It effectively combines analysis and design and ‘cuts to the involves
chase’ of producing a data model (working software and documentation) stakeholders
rather than ‘the establishing shot’ of recording data requirements (only docu-
mentation).

Incremental data modeling gives you more data requirements when they are Evolutionary
better understood/needed by stakeholders, and when you are ready to imple- modeling supports
ment them. Incremental modeling and development are scheduling strategies incremental
that support early and frequent software delivery. development by
capturing
Iterative data modeling helps you to understand existing data requirements requirements when
better and improve existing database schemas through refactoring: correcting they grow and
mistakes and adding missing attributes which have now become available or change
important. Iterative modeling and development are rework strategies that in-
crease software value.
18 Chapter 1

Agile Dimensional Modeling


DW/BI benefits from By taking advantage of dimensional modeling’s unit of discovery—a business
agile dimensional process worth measuring—agile data modeling has arguably greater benefits for
modeling DW/BI than any other type of database project:

Agile dimensional Agile modeling avoids the ‘analysis paralysis’ caused by trying to discover the
modeling focuses on ‘right’ reports amongst the large (potentially infinite?) number of volatile, con-
business processes stantly re-prioritized requests in the BI backlog. Instead, agile dimensional
rather than reports modeling gets everyone to focus on the far smaller (finite) number of relatively
stable business processes that stakeholders want to measure now or next.

Agile dimensional Agile dimensional modeling avoids the need to decode detailed business events
modeling creates from current summary report definitions. Modeling business processes without
flexible, report- the blinkers of specific report requests produces more flexible, report-neutral,
neutral designs enterprise-wide data warehouse designs.

Agile modeling Agile data modeling can break the “data then requirements” stalemate that
enables proactive exists for DW/BI just before a new operational system is implemented. Proac-
DW/BI to influence tive agile dimensional modeling enables BI stakeholders to define new business
operational system processes from a measurement perspective and provide timely BI input to op-
development erational application development or package configuration.

Evolutionary Agile modeling’s evolutionary approach matches the accretive nature of genu-
modeling supports ine BI requirements. By following hands-on BI prototyping and/or real BI us-
accretive BI age, iterative and incremental dimensional modeling allows stakeholders to
requirements (re)define their real data requirements.

Collaborative Many of the stakeholders involved in collaborative modeling will become direct
modeling teaches users of the finished dimensional data models. Doing some form of dimen-
stakeholders to think sional modeling with these future BI users is an opportunity to teach them to
dimensionally think dimensionally about their data and define common, conformed dimen-
sions and facts from the outset.

Collaborative Collaborative modeling fully engages stakeholders in the design process,


modeling creates making them far more enthusiastic about the resultant data warehouse. It be-
stakeholder pride in comes their data warehouse, they feel invested in the data model and don’t
the data warehouse need to be trained to understand what it means. It contains their consensus on
data terms because it is designed directly by them: groups of relevant business
experts rather than the distillation of many individual report requests inter-
preted by the IT department.

Never underestimate the affection stakeholders will have for data models that they
themselves (help) create.
How to Model a Data Warehouse 19

Agile Dimensional Modeling and Traditional DW/BI Analysis


Agile dimensional modeling doesn’t completely replace traditional DW/BI analysis Agile dimensional
tasks, but by preceding both data-driven and reporting-driven analysis it can make modeling makes
them agile too: significantly reducing the work involved while improving the traditional analysis
quality and value of the results. tasks agile

Agile Data-Driven Analysis


Agile data-driven analysis is streamlined by targeted data profiling. Only the data Data-driven analysis
sources implicated by the agile data model need to be analyzed within each itera- becomes targeted
tion. This targeted profiling supports the agile practice of test-driven development data profiling
(TDD) by identifying the data sources that will be used to test the data warehouse
design and ETL processes ahead of any detailed physical data modeling. If an ETL
test can’t be defined because a source isn’t viable, agile data modelers don’t waste
time physically modeling what can’t be tested, unless they are doing proactive data
warehouse design. In this case the agile data warehouse model can assist the test-
driven development of the new OLTP system.

Agile Reporting-Driven Analysis


Agile reporting-driven analysis takes the form of BI prototyping. The early delivery Reporting-driven
of dimensional database schemas enables the early extraction, transformations and analysis becomes
loading (ETL) of real sample data so that better report requirements can be proto- BI prototyping
typed using the BI user’s actual BI toolset rather than mocked-up with spread-
sheets or word processors. It is intrinsically fairer to ask users to define their
requirements and developers to commit to them, once everyone has a sense of
what their BI tools are capable of, given the available data.

Requirements for Agile Dimensional Modeling


Agile modeling requires both IT and business stakeholders to change their work
practices and adopt new tools and techniques:

Collaborative data modeling requires open-minded people. Data modelers Collaborative


must be prepared to meet regularly with stakeholders (take on a business ana- modelers require
lyst role) while business analysts and stakeholders must be willing to actively techniques that
participate in some data modeling too. Everyone involved needs simple frame- encourage
works, checklists and guidelines that encourage interaction and prompt them interaction
through unfamiliar territory.

Business stakeholders have little appetite for traditional data models, even Collaborative
conceptual models (see Data Model Types, shortly) that are supposedly targeted data modeling
at them. They find the ER diagrams and notation favored by data modelers must use simple,
(and generated by database modeling tools) too complex or too abstract. To inclusive notation
engage stakeholders, agile modelers need to create less abstract, more inclusive and tools
data models using simple tools that are easy to use, and easy to share. These in-
clusive models must easily translate into the more technically detailed,
20 Chapter 1

logical and physical, star schemas used by database administrators (DBAs) and
ETL/BI developers.

Data modeling To encourage collaboration and support iteration, agile data modeling needs to
sessions (model- be quick. If stakeholders are going to participate in multiple modeling sessions
storms) need to be they don’t want each one to take days or weeks. Agile modelers want speed too.
quick: hours rather They don’t want to wear out their welcome with stakeholders. The best results
than days are obtained by modeling with groups of stakeholders who have the experience
and knowledge to define common business terms (conformed dimensions) and
prioritize requirements. It is hard enough to schedule long meetings with these
people individually let alone in groups. Agile data modeling techniques must
support modelstorming: impromptu stand up modeling that is quicker, simpler,
easier and more fun than traditional approaches.

Agile modelers must Stakeholders don’t want to feel that a design is constantly iterating (fixing what
balance JIT and they have already paid for) when they want to be incrementing (adding func-
JEDUF modeling to tionality). They want to see obvious progress and visible results. Agile modelers
reduce design need techniques that support JIT modeling of current data requirement in details
rework and JEDUF modeling of ‘the big picture’ to help anticipate future iterations and
reduce the amount of design rework.

Evolutionary DW Developers need to embrace database change. They are used to working with
development (notionally) stable database designs, by-products of BDUF data modeling. It is
benefits from ETL/BI support staff who are more familiar with coding around the database changes
tools that support needed to match users’ real requirements. To respond efficiently to evolution-
automated testing ary data warehouse design, agile ETL and BI developers need tools that support
database impact analysis and automated testing.

DW designers must Data warehouse designers also need to embrace data model change. They will
embrace change naturally want to limit the amount of disruptive database refactoring required
and allow their by evolutionary design, but they must avoid resorting to generic data model
models to evolve patterns which reduce understandability and query performance, and can al-
ienate stakeholders. Agile data warehouse modelers need dimensional design
patterns that they can trust to represent tomorrow’s BI requirements tomorrow,
while they concentrate on today’s BI requirements now.

Agile dimensional If agile dimensional modeling that is interactive, inclusive, quick, supports JIT and
modeling techniques JEDUF, and enables DW teams to embrace change seems like a tall order don’t worry;
exist for addressing while there are no silver bullets that will make everyone or everything agile over-
these requirements night, there are proven tools and techniques that can address the majority of these
agile modeling prerequisites.
How to Model a Data Warehouse 21

BEAM✲
BEAM✲ is an agile data modeling method for designing dimensional data ware- BEAM✲ is an
houses and data marts. BEAM stands for Business Event Analysis & Modeling. As agile dimensional
the name suggests it combines analysis techniques for gathering business event modeling method
related data requirements and data modeling techniques for database design. The
trailing ✲ (six point open centre asterisk) represents its dimensional deliverables:
star schemas and the dimensional position of each of the 7Ws it uses.

BEAM✲ consists of a set of repeatable, collaborative modeling techniques for BEAM✲ is used to
rapidly discovering business event details and an inclusive modeling notation for discover and
documenting them in a tabular format that is easily understood by business stake- document business
holders and readily translated into logical/physical dimensional models by IT event details
developers.

Data Stories and the 7Ws Framework


BEAM✲ gets BI stakeholders to think beyond their current reporting requirements BEAM✲ modelers
by asking them to describe data stories: narratives that tease out the dimensional and BI stakeholders
details of the business activity they need to measure. To do this BEAM✲ modelers use the 7Ws to tell
ask questions using a simple framework based on the 7Ws. By using the 7Ws (who, data stories
what, where, when, how many, why and how) BEAM✲ conditions everyone in-
volved to think dimensionally. The questions that BEAM✲ modelers ask stake-
holders are the same types of questions that the stakeholders themselves will ask of
the data warehouse when they become BI users. When they do, they will be think-
ing of who, what, when, where, why and how question combinations that measure
their business.

Diagrams and Notation


Example data tables (or BEAM✲ tables) are the primary BEAM✲ modeling tool BEAM✲ tables
and diagram type. BEAM✲ tables are used to capture data stories in tabular form support
and describe data requirements using example data. By doing so they support data modeling
collaborative data modeling by example rather than by abstraction. BEAM✲ tables by example
are typically built up column by column on whiteboards from stakeholders’ re-
sponses to the 7Ws and are then documented permanently using spreadsheets. The
resulting BEAM✲ models look more like tabular reports (see Figure 1-9) rather
than traditional data models.

BEAM✲ (Example Data) Tables


BEAM✲ tables help engage stakeholders who would rather define reports that BEAM✲ tables
answer their specific business questions than do data modeling. While example look like simple
data tables are not reports, they are similar enough for stakeholders to see them as tabular reports
22 Chapter 1

visible signs of progress. Stakeholders can easily imagine sorting and filtering the
low-level detail columns of a business event using the higher-level dimensional
attributes that they subsequently model.

Figure 1-9
Customer Orders
BEAM✲ table

BEAM✲ Short Codes


BEAM✲ uses short BEAM✲ tables are simple enough not to get in the way when modeling with
codes to capture stakeholders, but expressive enough to capture real-world data complexities and
technical data ultimately document the dimensional modeling design patterns used to address
properties them. To do this BEAM✲ models use short (alphanumeric) codes: (mainly) 2 letter
abbreviations of data properties that can be recorded in spreadsheet cells, rather
than graphical notation that would require specialist modeling tools. By adding
short codes, BEAM✲ tables can be used to:

Document dimensional attribute properties including history rules


Document fact properties including aggregation rules
Record data-profiling results and map data sources to requirements
Define physical dimensional models: fact and dimension tables
Generate star schemas

BEAM✲ BEAM✲ short codes act as dimensional modelers’ shorthand for documenting
short codes act generic data properties such as data type and nullability, and specific dimensional
as dimensional properties such as slowly changing dimensions and fact additivity. Short codes can
modeling be used to annotate any BEAM✲ diagram type for technical audiences but can
shorthand easily be hidden or ignored when modeling with stakeholders who are disinter-
ested in the more technical details. Short codes and other BEAM✲ notation con-
ventions will be highlighted in the text in bold. Appendix B provides a reference
list of short codes.

Comparing BEAM✲ and Entity-Relationship Diagrams


We will use Throughout this book we will be illustrating BEAM✲ in action with worked
Pomegranate Corp. examples featuring the fictional Pomegranate Corporation (POM). We begin now
examples to by comparing an ER diagram representation of Pomegranate’s order processing
illustrate BEAM✲ data model (Figure 1-10) with an equivalent BEAM✲ table for the Customer
Orders event (Figure 1-9).
How to Model a Data Warehouse 23

Figure 1-10
Order processing
ER Diagram

By looking at the ERD you can tell that customers may place orders for multiple Example data
products at a time. The BEAM✲ table records the same information, but the models capture
example data also reveals the following: more business
information than
Customers can be individuals, companies, and government bodies. ER models
Products were sold yesterday.
Products have been sold for 10 years.
Products vary considerably in price.
Products can be bundles (made up of 2 products).
Customers can order the same product again on the same day.
Orders are processed in both dollars and pounds.
Orders can be for a single product or bulk quantities.
Discounts are recorded as percentages and money.

Additionally, by scanning the BEAM✲ table you may have already guessed the type Example data
of products that Pomegranate sells and come to some conclusions as to what sort speaks volumes!
of company it is. Example data speaks volumes—wait until you hear what it says
about some of Pomegranate’s (fictional) staff!

Data Model Types


Agile dimensional modelers need to work with different types of models depend- Conceptual, logical
ing on the level of technical detail they are trying to capture or communicate and and physical data
the technical bias of their collaborators and target audience. Conceptual data models provide
models (CDM) contain the least technical detail and are intended for exploring progressively more
data requirements with non-technical stakeholders. Logical data models (LDM) technical detail for
allow modelers to record more technical details without going down to the data- more technical
base specific level, while physical data models (PDM) are used by DBAs to create audiences
database schemas for a specific DBMS. Table 1-2 shows the level of detail for each
model type, its target audience on a DW/BI project, and the BEAM✲ diagram
types that support that level of modeling.
24 Chapter 1

Table 1-2 CONCEPTUAL LOGICAL PHYSICAL


DETAIL
DATA MODEL DATA MODEL DATA MODEL
Data Model Types
Entity Name
Relationship
Attribute Optional
Cardinality Optional
Primary Key
Foreign Key
Data Type Optional
Table Name
Column Name

Data Modelers Data Modelers


Business Analysts Data Modelers DBAs
DW/BI Audience Business Experts ETL Developers DBMS
Stakeholders BI Developers ETL Developers
BI Users BI Developers
Example Data Conceptual
Table Diagrams Enhanced Star
BEAM✲ Diagram Hierarchy Chart with Short Codes Schema
Timeline Enhanced Star Event Matrix
Event Matrix Schema

BEAM✲ and ER Based on the detail levels described in Table 1-2 the order processing ERD in
notation are jointly Figure 1-10 is a logical data model as it shows primary keys, foreign keys and
used to create cardinality, while the BEAM✲ event in Figure 1-9 is a conceptual model (we prefer
collaborative models “business model”) as this information is missing. With additional columns and
for different short codes it could be added to the BEAM✲ table but each diagram type suits its
audiences target audience as is. BEAM✲ tables are more suitable for collaborative modeling
with stakeholders than traditional ERD based conceptual models. While other
BEAM✲ diagram types and short codes compliment and enhance ERDs for col-
laborating with developers on logical/physical star schema design.

BEAM✲ Diagram Types


BEAM✲ also uses Example data tables are not the only BEAM✲ modeling tools. BEAM✲ modelers
event matrices, also uses event matrices, hierarchy charts, timelines and enhanced star schemas to
timelines, hierarchy collaborate on various aspects of the design at different levels of business and
charts and enhanced technical detail. Table 1-3 summarizes the usage of each of the BEAM✲ diagram
star schemas types, and lists their model types, audience and the chapter where they are de-
scribed in detail.

BEAM✲ supports the core agile values: “Individuals and interactions over proc-
esses and tools.”, “Working software over comprehensive documentation.” and
“Customer collaboration over contract negotiation.” BEAM✲ upholds these
values and the agile principle of “maximizing the amount of work not done” by
encouraging DW practitioners to work directly with stakeholders to produce
compilable data models rather than requirements documents, and working BI
prototypes of reports/dashboards rather than mockups.
How to Model a Data Warehouse 25

Table 1-3 BEAM✲ Diagram Types

DATA
PRINCIPAL
DIAGRAM USAGE MODEL AUDIENCE
CHAPTER
TYPE

BEAM✲ (Example Data) Modeling business events and Business Data Modelers 2
Table dimensions one at a time using Logical Business Analysts
example data to document their Physical Business Experts
7Ws details. Stakeholders
BI Users
Example data tables are also used
to describe physical dimension
and fact tables and explain
dimensional design patterns.

Hierarchy Chart Discovering hierarchical relation- Business Data Modelers 3


ships within dimensions and Business Analysts
prompting stakeholders for Business Experts
dimensional attributes. Stakeholders
BI Users
Hierarchy charts are also used to
help define BI drill-down settings
and aggregation levels for report
and OLAP cube definition.

Timeline Exploring time relationships Business Data Modelers 8


between business events. Business Analysts
Business Experts
Timelines are used to discover Stakeholders
when details, process sequences BI Users
and duration facts for measuring
process efficiency.

Event Matrix Documents the relationships Business Data Modelers 4


between all the events and Logical Business Analysts
dimensions within a model. Physical Business Experts
Stakeholders
Event matrices record events in BI Users
value-chain sequences and Data Modelers
promote the definition and reuse ETL Developers
of conformed dimensions across BI Developers
dimensional models. They are
used instead of high-level ERDs
to provide readable overviews of
entire data warehouses or multi-
star schema data marts.

Enhanced Star Schema Visualizing individual dimensional Logical Data Modelers 5


models and generating physical Physical DBAs
database schemas. DBMS
ETL Developers
Enhanced star schemas are BI Developers
standard stars embellished with Testers
BEAM✲ short codes to record
dimensional properties and design
techniques not directly supported
by generic data modeling tools.
26 Chapter 1

Summary
Data warehouses and operational systems are fundamentally different. They have radically
different database requirements and should be modeled using very different techniques.

Dimensional modeling is the appropriate technique for designing high-performance data


warehouses because it produces simpler data models—star schemas—that are optimized for
business process measurement, query performance and understandability.

Star schemas record and describe the measureable events of business processes as fact tables and
dimensions. These are not arbitrary denormalized data structures. Instead they represent the
combination of the 7Ws (who, what, when, where, how many, why and how) that fully describe
the details of each business event. In doing so, fact tables represents verbs, while the facts
(measures) they contain and the dimensions they reference represent nouns.

Dimensional modeling’s process-orientation supports agile development by creating database


designs that can be delivered in star schema/business process increments.

Even with the right database design techniques there are numerous analysis challenges in
gathering detailed data warehousing requirements in a timely manner.

Both data-driven and reporting-driven analysis are problematic, increasingly so, with DW/BI
development becoming more proactive and taking place in parallel with agile operational
application development.

Iterative, incremental and collaborative data modeling techniques are agile alternatives to the
traditional BI data requirements gathering.

BEAM✲ is an agile data modeling method for engaging BI stakeholders in the design of their
own dimensional data warehouses.

BEAM✲ data stories use the 7Ws framework to discover, describe and document business
events dimensionally.

BEAM✲ modelers encourage collaboration by using simple modeling tools such as whiteboards
and spreadsheets to create inclusive data models.

BEAM✲ models use example data tables and alphanumeric short codes rather than ER data
abstractions and graphical notation to improve stakeholder communication. These models are
readily translated into star schemas.

BEAM✲ is an ideal tool for modelstorming a dimensional data warehouse design.


2
MODELING BUSINESS EVENTS
Think like a wise man but communicate in the language of the people.
— William Butler Yeats (1865–1939)

Business events are the individual actions performed by people or organizations Business events are
during the execution of business processes. When customers buy products or use the measureable
services, brokers trade stocks, and suppliers deliver components, they leave behind atomic details of
a trail of business events within the operational databases of the organizations business processes
involved. These business events contain the atomic-level measurable details of the
business processes that DW/BI systems are built to evaluate.

BEAM✲ uses business events as incremental units of data discovery/data model- BEAM✲ modelers
ing. By prompting business stakeholders to tell their event data stories, BEAM✲ discover BI data
modelers rapidly gather the clear and concise BI data requirements they need to requirements by
produce efficient dimensional designs. telling data stories

In this chapter we begin to describe the BEAM✲ collaborative approach to dimen- This chapter is a
sional modeling, and provide a step-by-step guide to discovering a business event step-by-step guide
and documenting its data stories in a BEAM✲ table: a simple tabular format that is to using BEAM✲
easily translated into a star schema. By following each step you will learn how to tables and the 7Ws
use the 7Ws (who, what, when, where, how many, why, and how) to get stake- to describe event
holders thinking dimensionally about their business processes, and describing the details
information that will become the dimensions and facts of their data warehouse—
one that they themselves helped to design!

Data stories and story types: discrete, recurring and evolving Chapter 2 Topics
Discovering business events: asking “Who does what?” At a Glance
Documenting events: using BEAM✲ Tables
Describing event details: using the 7Ws and stories themes
Modelstorming with whiteboards: practical collaborative data modeling

27
28 Chapter 2

Data Stories
Data stories are to Data stories are comparable to user stories: agile software development's lean
agile DW design as requirements gathering technique. Both are written or told by business stake-
user stories are to holders. While user stories concentrate on functional requirements and are written
agile software on index cards, data stories concentrate on data requirements and are written on
development whiteboards and spreadsheets.

Event stories use Business events, because they represent activity (verbs), have strong narratives.
the narrative of a BEAM✲ uses these event narratives to discover their details (nouns) by telling data
business event to stories. BEAM✲ events are the archetypes for many similar data stories. "Employee
discover BI data drives company car on appointment date." is an event. "James Bond drives an
requirements Aston Martin DB5 on the 17th September 1964" is a data story. By following five
event story themes, event stories, a specific type of data story, succinctly clarify the
meaning of each event detail and help elicit additional details.

Story Types
Event stories are BEAM✲ classifies business events into three story types: discrete, evolving, and
discrete, evolving or recurring based on how their stories play out with respect to time. Figure 2-1 shows
recurring depending example timelines for each type. Retail product purchases are examples of discrete
on how they events that happen at a point in time. They are (largely) unconnected with one
represent time another and occur unpredictably. Wholesale orders are evolving events that repre-
sent the irregular time spans it takes to fulfill orders. They too occur at unpredict-
able intervals. Interest charges are recurring events that represent the regular time
spans over which interest is accrued. They occur in step with one another at
predictable intervals.

Figure 2-1
Story type timelines
Modeling Business Events 29

Discrete Events
Discrete events are “point-in-time” or short (duration) stories. They typically Discrete events are
represent the atomic-level transactions recorded by operational systems. Example “point-in-time” or
discrete events include: short duration

A customer buys a product in a retail store


A visitor views a web page
An employee makes a phone call

Discrete events are completed either at the moment they occur or shortly thereaf- Discrete event
ter. By “shortly”, we mean within the ETL refresh cycle of the data warehouse; i.e., stories are
they have “finished” or reached some end state by the time they are used for BI. “finished”. They
Discrete event stories are generally associated with a single verb (e.g., “buys”, do not change
“views”, “calls”) and a single timestamp. There are exceptions to the one verb, one
timestamp rule, but for an event story to be discrete none of its details must change
over time, otherwise it is evolving.

Evolving Events
Evolving events are longer-running stories (sagas) that can take several days, Evolving events
weeks, or months to complete. They are typically loaded into a data warehouse represent irregular
when their stories begin. Example evolving events include: periods of times.
Their stories may
A customer orders a product online and waits for it to be delivered not have “finished”
A student applies for a place on a university course and is accepted
An employee processes an insurance claim

Evolving events often represent a series of discrete events (chapters if you like) that Multi-verb evolving
BI stakeholders view as milestones of a complex/time-consuming business process. events combine the
In Figure 2-1 the arrows that follow each evolving order event mark the shipment, verbs of discrete
delivery, and payment milestones that have been reached. Each of the verbs: events to
“order”, “ship”, “deliver” and “pay” can be modeled as separate discrete events, but support process
from the stakeholders’ perspective the really important measures of the order performance
fulfillment process only become visible when these events are combined to produce measurement
a multi-verb evolving event story.

Timelines (have you noticed how much we like them) are a great way to visualize
evolving events stories and an invaluable tool for modeling milestones and
interesting time intervals (duration measures). Modeling with timelines is covered
in Chapter 8.

Recurring Events
Recurring events are periodic measurement stories that occur at predictable inter- Recurring events
vals, such as daily, weekly, and monthly (serials). In Figure 2-1 the arrowed line occur at predictable
preceding each recurring event represents the period of time that the event meas- intervals
ures. Example recurring events include:
30 Chapter 2

Nightly inventory for a product at a retail location


Monthly balance and interest charges/payments for a bank account
Minute-by-minute viewing figures by audience for a TV channel

Recurring events Recurring events are typically used to sample and summarize discrete events,
summarize discrete especially when cumulative measures, such as stock levels or account balances, are
events but can also required that would be expensive to derive from the discrete events. For example,
represent the atomic calculating an account balance at any point in time would require all the transac-
detail for “automatic” tions against the account from all prior periods to be aggregated. Recurring events
measurements can also represent atomic-level measurement events that “automatically” occur on
a periodic basis; for example, the hourly recording of rainfall at weather stations.

Events and Fact Tables


Events are business The three BEAM✲ event story types are business models (conceptual models) for
models for physical the three physical fact table types found in the star schemas of dimensional data
fact tables warehouses. Table 2-1 shows how story types and fact table types are related.

Table 2-1 BEAM✲ STORY TYPE STAR SCHEMA TYPE/PHYSICAL DIMENSIONAL MODEL
Story types and
Discrete Transaction fact table
their matching star
Recurring Periodic snapshot
schemas
Evolving Accumulating snapshot

Discrete events Discrete events are implemented as transaction fact tables. All the detail that there
are user models for is to know about discrete events is known before they are loaded into a data ware-
transaction fact house. This means that each discrete event story (fact record) is inserted once and
tables never updated, greatly simplifying the ETL process.

Recurring events Recurring events are implemented as periodic snapshot fact tables. Many of their
represent interesting measures are semi-additive balances that must be carefully reported
periodic snapshots over multiple time periods.

Evolving events Evolving events are implemented as accumulating snapshot fact tables. They are
represent loaded into a data warehouse shortly after the first event in a predictable sequence,
accumulating and are updated each time a milestone event occurs until the overall event story is
snapshots completed.

Chapter 5 describes the basic steps involved in translating events into star sche-
mas. Chapter 8 provides more detailed coverage on designing transaction fact
tables, periodic snapshots and accumulating snapshots.
Modeling Business Events 31

The 7Ws
BEAM✲ uses the 7Ws: who, what, when, where, how many, why, and how to Event stories are
discover and model data requirements as event details. Every event detail that told using the 7Ws
stakeholders need falls into one of the 7 W-types. They are the nouns for people
and organizations (who), things such as products and services (what), time (when),
locations (where), reasons (why), event methods (how), and numeric measures
(how many) that in combination form event stories.

Each of the 7Ws is also an interrogative, a word or phrase that can be used to Each of the 7Ws
construct a question, and that is precisely what you do with them. By asking gives you a question
stakeholders a who question, you discover the people and organizations they want to ask for story
to analyze. By asking stakeholders a what question you discover the products and details
services they want to analyze. By asking these questions in the right combination
and sequence you discover the business events they need to analyze.

As you capture event details you can record their dimension type in the type row
of a BEAM✲ table. You will use this knowledge to help you model the details as
dimensions and facts. Part 2 of this book has chapters dedicated to the 7Ws,
covering common BI issues and dimensional modeling design patterns associ-
ated with each type.

Thinking Dimensionally
The 7W questions you ask to discover event details, mirror the questions that The 7Ws help
stakeholders will ask themselves when they define queries and reports. For exam- stakeholders think
ple, a stakeholder will think about where, when, and how many to build a query dimensionally about
that asks: “Which sales locations are performing better than last year?” and who, their data and BI
when, what, and why to ask: “Which customers are responding early to product queries
promotions?” When stakeholders start using the 7Ws they are thinking about their
data dimensionally, because the 7Ws represent how data is naturally modeled
dimensionally. Table 2-2 shows the type of data that each of the 7Ws represent
together with examples of matching physical dimensions or facts.

7WS DATA EXAMPLE DIMENSIONS (AND FACTS) Table 2-2


Who People and organizations Employee, Customer 7Ws data,
What Things Product, Service dimensions
and facts
When Time Date/Calendar, Time of Day/Clock
Where Locations Store, Hospital, Delivery Address
Why Reasons and causality Promotion, Weather
How Transaction IDs and status Order ID (Degenerate Dimension),
codes Call Status
How Many Measures and Key Perform- Sales Revenue, Quantity (Facts)
ance Indicators (KPIs)
32 Chapter 2

Using the 7Ws: BEAM✲ Sequence


You ask 7W The flowchart in Figure 2-2 shows the order (BEAM✲ sequence) for using the
questions in a 7Ws, along with the information that they give you at each stage. You start by
specific, repeatable using who and what to discover an event. From there you discover when the event
order to discover happens and begin describing event stories using example data. After that, you ask
event details as many who, what, when, and where questions as necessary to discover details for
all of the people, organizations, products, services, timestamps, and locations
related to the event. Then you ask how many, why, and how questions to discover
the quantities, causes and other descriptive details needed to fully explain the
event.

Figure 2-2
BEAM✲ sequence:
7Ws flowchart

7W questions and Once you become familiar with using the 7Ws you will find they flow naturally
event detail answers from one to another; for example, quantity (how many) answers lead to why
naturally flow from questions. If you discover a discount quantity this would naturally lead to the
one to another question: “Why do some orders have discounts?” Similarly, the why answer: “be-
cause of promotions” might lead to the how question: “How are promotions
implemented?” and the answer: “with discount vouchers and codes.”

Discovering event There is no need to be a slave to the BEAM✲ Sequence. If stakeholders call out
details out of relevant event details at random (hopefully not all at once) or remember details out
sequence is okay of sequence, that’s okay, but try to return to the flowchart as soon as possible to
make sure all 7Ws are covered.

Put a simple version of the 7Ws flowchart up on the wall, so that who, what, when,
where, how many, why, and how can start working on everyone’s dimensional
imagination and stakeholders know your next question type.
Modeling Business Events 33

BEAM✲ in Action: Telling Stories


Modeling an event is an alliterative three-step process: discover, document, describe.
Think of these as the 3Ds to go along with the 7Ws. Table 2-3 shows the steps and
their matching techniques.

STEP BEAM✲ TECHNIQUE Table 2-3


1. Discover an event Ask “Who does what?” Event modeling
2. Document the event BEAM✲ table steps
3. Describe the event The 7Ws and event stories

The following sections describe each of the event modeling steps using an order Imagine you are
processing example for Pomegranate Corp., our fictional multinational computer modeling
technology, consumer electronics, software, and consulting firm. In this initial Pomegranate’s
worked example, order creation will be modeled in detail as a discrete event. In order process
Chapter 4, shipments, deliveries and other related events will be modeled at a
summary level. In Chapter 8, several of theses events are combined as a single
evolving event that allows stakeholders to more easily measure the performance of
the entire order fulfillment process.

1. Discover an Event: Ask “Who Does What?”


BEAM✲ modelers discover business events by asking a deceptively simple ques-
tion (using the first 2Ws):

Who does what?

The answer to this blunt opening question is an event. An event is an action. An You are asking for
action means that a verb is involved. When a verb is involved, there is someone or a a subject, a verb,
something doing the action: the subject, and someone or a something having the and an object
action done to it: the object. So linguistically, the answer will be a subject-verb-
object combination: the simplest story possible.

“Who does what?” is really a mnemonic, a way of remembering to ask stakeholders You do this to
to name the subject, verb, and object that identify an interesting event. It’s a short discover interesting
way of saying: “Think of an activity. Who (or what) does it? What do they do? business activity
Who or what do they do it to?”. Whatever form of the question you use, what you that needs to be
actually want to discover is an interesting business activity (verb) that is in scope, so measured
it may need some qualification to work well. You might begin with your version of:

Who does what, that we want to report on within the scope


of the next iteration/release?
34 Chapter 2

To which the stakeholders might reply:

Customer orders product?

The answer to “Who You now have what you need to begin modeling: a subject: customer, a verb:
does what?” is the orders, and an object: product. This subject-verb-object combination is called the
main clause of an main clause of the event, and you will reuse it to ask most of the follow-up ques-
event tions for discovering the “whole” story.

Focus on One Event at a Time


Stakeholders are Not surprisingly, you will find when asking a question as open as “Who does
typically interested what?” that stakeholders can describe many subject-verb-object combinations that
in multiple events. are interesting to them. In fact, they will typically bombard you with the subjects,
Many share the verbs, and objects of several business events right off the top of their heads. Even
same subject or reframing the question with a single chosen subject, for example, “Doctors do
object what?” or “Drivers do what?” can trigger a cascade of information. Stakeholders
typically have several events that they need to measure for any given subject, each
with a different verb. For instance, doctors prescribe medicines, but they also
diagnose conditions, perform procedures, and schedule appointments. Drivers may
deliver packages, but they also depart from depots, accept payments, and collect
returns. Each subject-verb-object combination represents a different event which
needs to be documented in its own BEAM✲ table.

Keep stakeholders Getting an eager group of stakeholders to slow down and take things one event
focused on one (verb) at a time can take some discipline. Try to reassure them (and yourself) that
event (verb) at a there will be plenty of time to capture all of this data, but you need to focus on one
time event at a time until it is complete—with all of its details. Don't worry about the
stakeholders, they will not forget their other favorite events while you are
documenting the current one.

Identifying the Responsible Subject


Responsible Whenever possible, try to identify a responsible subject for the event’s main clause.
subjects (usually A responsible subject is a person or organization that actually performs the activity
whos) help you that the verb describes. This is important as it helps you discover the detailed
discover atomic-level events of a business process rather than less flexible summary events
atomic-level that may only address the current report requirements. For example, if stake-
business events holders are thinking too much about product reports they may respond “Product
generates revenue” but you want to get them thinking about the underlying busi-
ness process(es) by asking: “How does a product generate revenue? Who makes
that happen?” To which the stakeholders might respond: “Customer buys product”
or “Salesperson sells product”. Both of these are better subject-verb-object combi-
nations (main clauses) for an event that will help you identify the detailed transac-
tions that should be modeled and loaded into the data warehouse.
Modeling Business Events 35

Summary events can always be added afterwards for query efficiency, if necessary, Summary events
so long as the details are there. You should make sure you initially model the most can always be
granular discrete events the stakeholders are interested in. You can then model added later. You
recurring and evolving events from these (in subsequent iterations) to provide should initially
easier access to the performance measures that stakeholders need. Chapter 8 covers concentrate on
the event modeling and dimensional design techniques for doing this. the atomic detail

Asking: “Who does what?” does not always ensure that you will get actual who and Subjects and
what details. Objects especially can be any of the 7Ws, including how many. For objects are not
automated recurring events there may simply not be a responsible who subject that always who and
triggers them. For example, “Store stocks product” or “Weather station records what. They can be
temperature” are both perfectly valid events, but neither has who details. “Store” any of the 7Ws
and “weather station” are where-type subjects, and “temperature” is an example of
a how many object rather than a what object.

There is no need to fret over this, and try to coax stakeholders into supplying As long as you have
actual people (who) and things (what) in every case as this can get in the way of a verb you will find
capturing their perspective on the event. The most important thing is that stake- any whos or whats
holders supply a main clause containing a verb worth measuring. If their main involved (if any) by
clause doesn’t contain a who or what you will soon discover any that belong to the asking more W
event as you use each of the 7Ws to discover more details. questions shortly

Verbs
Verbs are one of the most difficult parts of any language. Because of numerous
tenses, cases, and persons, the possible ways of expressing a verb can be
confusing. For instance, the verb “to buy” can be written as “buy”, “bought”,
“buying” and “buys”. To simplify events, use this last version “buys” which is the
For verbs, use the
third-person singular present tense. This simply means it sounds right after “he”,
third-person singular
“she”, or “it”. In English, this version of the verb always end in “s”. For example,
present tense
“to call” becomes “calls”, “reviewed” becomes “reviews”, “auditing” becomes
“audits”, and “will sell” becomes “sells”. This standard form is intuitive and avoids
awkward verb variations.

2. Document the Event: BEAM✲ Table


Now that the stakeholders have supplied an event, you need to document and An event is
display it on a whiteboard or screen where everyone can see it using a BEAM✲ documented as a
example data table. Figure 2-3 shows the initial table for “Customer orders prod- BEAM✲ table on a
uct.” If this looks like it could be an “ordinary” spreadsheet table that’s a good whiteboard or in a
thing; Business stakeholders are the target audience for this model and they are spreadsheet
usually very comfortable with spreadsheets.
36 Chapter 2

Figure 2-3
Initial event table

Several important BEAM✲ notation conventions are indicated in Figure 2-3. The
subject (CUSTOMER) and object (PRODUCT) are capitalized, and have become
column headers. The verb (orders) is in lowercase, and is placed in its own row
above the object. This row will be used to hold other lowercase words shortly. The
capitalized column headers are the event details that will eventually become facts
or dimensions. The lowercase words will connect subsequent details to the event
and clarify their relationships with the main clause. They make event stories
readable, but are not components of the physical database design.

Draw data tables on whiteboards without borders between or below the example
rows. Fewer lines make freehand drawing quicker and neater, while at the same
time visually suggesting that the examples are open ended: that stakeholders can
add to them at any point to illustrate ever more interesting stories that help to
clarify exceptions.

Don’t name the The rest of the table is left blank, with several empty rows for example data and
event until it is space above for an event name. Don’t attempt to name the event just yet because
complete this may prejudice the details that the stakeholders provide.

Leave some The table is now ready to record event details and example data (event stories) to
working space for clarify the meaning of each detail. In Figure 2-3 there is also space reserved above
detail about detail the table as a scratchpad for recording detail about detail: important details that
you may capture along the way that don’t belong directly to the event (see What?
later in this chapter) but will need to be modeled as dimensional attributes after the
event is complete.
Modeling Business Events 37

You can download a copy of the BEAM✲Modelstormer spreadsheet from


modelstorming.com. It contains template BEAM✲ (example data) tables linked to
formulas for generating customizable SQL DDL and simple table/entity graphics.
You can use the DDL to define physical database tables or export your BEAM✲
model to other database modeling tools to produce star schemas.

3. Describe the Event: Using the 7Ws


BEAM✲ obeys the maxim “show, don’t tell” to describe and model an event using Every event story
event stories rather than lengthy descriptive text. But before you can ask for useful needs a when detail
example stories you need one more detail. You need to discover when the event
occurs. You find out by asking your second simple “W” question: a when question.

When?
Every event story has at least one defining point in time. No meaningful BI analysis You discover when
takes place without a time element. Therefore, immediately following the discovery details by asking a
of an event, you should ask for its when detail. You do so by repeating the main when question
clause of the event to the stakeholders as a question, with a “when” appended or
prepended:

CUSTOMER orders PRODUCT when?


or
When do CUSTOMERs order PRODUCTs?

To which the stakeholders might respond (if you’re lucky):

On order date.

This is certainly what you are hoping for: a prepositional phrase containing a You are looking for a
preposition: “on” followed by a noun: “order date.” If they respond with actual preposition and a
dates/times, ask what these should be called. You are looking for a noun to name name for the when
this detail; after you have it you can then use the date/time values for example detail
stories to help you understand the time nature of the event. The general form of a
when question is: “Subject Verb Object when?” or “When does a Subject Verb
Object? What do you call that date/time?” The required response is in the form:
“on/at/every Time Stamp Name”

The preposition on used with a when detail implies that the detail is recorded as a When prepositions
date, suggesting that the time of day is not available or is not important. An at contain clues to the
preposition implies that time of day is recorded and is important. Whenever stake- level of time detail
holders give you example when values you should check that prepositions and available/needed
examples match; so that event stories can be read correctly.
38 Chapter 2

Prepositions
Prepositions are the words that link nouns, pronouns and phrases in sentences
and describe their relationships. These relationships include time (when),
possession (who/what), proximity (where), quantity (how many), cause (why)
and manner (how). Examples of typical prepositions include: with, in, on, at, by,
to, from and for. BEAM✲ uses prepositions to:

Link details to the main clause of an event.


Construct event stories using natural language sentences made from
main clause-preposition-detail combinations.
Clarify the relationship between an event and its details.
Discover event detail rules such as the time granularity of when details
(on, at) and the direction implied by where details (from, to).

Add the when detail After you have confirmed the prepositional phrase you add it to the event table, as
and preposition shown in Figure 2-4, with the on preposition above the new detail ORDER DATE.
to the table Now that you have the subject, object, and initial when details you can begin to fill
out the table with event stories.

Figure 2-4
Adding the first
when detail

The preposition for a when detail is highly significant. “on order date,” “at call
time,” and “every sales quarter” each contain an important clue to the level of
time detail available (or necessary) for their respective events.

Collecting Event Stories


You ask the stakeholders to provide examples for every event detail you discover,
for the following reasons:

Asking for examples and getting useful answers is a clear indication that you
are being agile, that you are modeling with the right people: stakeholders who
know their own data.
Modeling Business Events 39

Example data clarifies the meaning of each event detail as you discover it with
the minimum documentation.

Examples avoid abstraction. Stakeholders can start to visualize how their data
might appear on reports.

Examples demonstrate how events behave over time by illustrating typical,


exceptional, old, new, minimum, and maximum values amongst other event
stories.

Capturing examples quickly leads to an understanding of the story type, and


eventually to a definition of the event granularity (the set of detail values that
uniquely identify each event story).

Wait until you have at least one when detail before collecting example data.
Having a when detail helps you get more interesting examples that tell a story.

Event Story Themes


To model events rapidly you want to describe each detail as fully as possible using Useful event stories
the minimum number of example stories. You can discover and document most of follow five themes
what you need to know in five or six example rows by asking for stories that
illustrate the following five themes:

Typical
Different
Repeat
Missing
Group

Figure 2-5 shows how the themes vary slightly across the 7Ws. The italic descrip- Themes help you
tions suggest the range of values that you want to illustrate for each “W” (by using discover data ranges
the typical and different themes). Armed with this information you are now ready for each
to start “modeling by example”: asking the stakeholders to tell you event stories for of the 7Ws
each theme.

Figure 2-5
Story theme
template
40 Chapter 2

Typical Stories
You start by asking Each event table should start with a typical event story that contains com-
for a typical event mon/normal/representative values for each detail. For who data this could be a
story frequent CUSTOMER. For what details this might be a popular PRODUCT.
Similarly, for how many details you are looking for average values that match the
other typical values. To fill out this example story you simply ask the stakeholders
for typical values for each detail.

Different Stories
Ask for different Following the typical example event you ask the stakeholders for another example
stories to explore with different values for each detail. If you ask for two different examples you can
value ranges use them to discover the range of values that the data warehouse will have to
represent. This is particularly important for when details because they indicate how
much history will be required and how urgently the data must be loaded into the
data warehouse.

Use relative time For when details, use relative time descriptions such as Today, Yesterday, This
descriptions to Month, and 5 Years Ago to capture the most recent and earliest values so that the
document event stories remain relevant long after the model is completed. If the latest when is
ETL urgency Yesterday, then you know that the data warehouse will demand a daily refresh for
and DW history this particular business process. In Figure 2-6, the fourth and fifth example events
requirements show that the data warehouse will need to support 10 years of history for this
event, and that a daily refresh policy is required.

If the latest when event story is Today, the data warehouse will need to be refreshed
more urgently than daily—perhaps in near real-time. Because this will significantly
complicate the ETL processing and increase development costs, you should con-
firm that this is a vital requirement with budgetary approval. If it is, you need to
find out if Today means “an hour ago” or “10 minutes ago”.

Look for old and For who and what details, ask for old and new values—representing customers who
new values as well have become inactive versus brand-new customers, or products that have been
as high and low discontinued versus those just released.

Repeat Stories
Ask for a repeat Once you have collected a few different examples you ask for a repeat story—one
story to find out that is as similar as possible to the typical story (the first row)—so you can discover
what makes an what makes each event story unique. You do this by asking whether the typical
event story unique values can appear again in the same combination; for example, you might ask:

Can this CUSTOMER order this PRODUCT


again on the same day?
Modeling Business Events 41

The third event story in Figure 2-6 shows that this is possible. Each time you add a Repeat the typical
new detail to the event you return to the repeat story to see if that detail can be story if it is not yet
used to differentiate the event, by adding it to the previous question; for example, unique
“Can this CUSTOMER order this PRODUCT again on the same day, from the
same SALES LOCATION?” If that’s possible repeat the typical story values.

You might have uniqueness with the subject, object, and initial when details
alone, or you might not have it until you discover a how detail with your very last
question.

Figure 2-6
Adding event stories

Missing Stories
You ask for a missing story to discover which event details can have missing values Missing stories
(e.g. unknown, not applicable, or not available) and which are mandatory (always document how
present). You use a missing story to document how stakeholders want to see missing values will
missing values displayed on their reports. When you fill out a missing story (such be treated by BI
as the fifth story in Figure 2-6) you use normal values for mandatory details and applications. They
the stakeholders’ default missing value labels (e.g. “N/A”, “Unknown”) for the non also help to identify
mandatory details. For quantities you must find out whether missing data should mandatory event
be treated as NULL (the arithmetically correct representation of missing) or details
replaced by zero, or some other default value. You document the mandatory details
by also adding the short code MD to their column type.

MD : mandatory detail. Event detail is always present under normal circum-


stances (no data errors).

Missing stories can be unrealistically sparse, containing missing values for any It’s OK for missing
detail that might ever be missing. It’s okay if there are more missing values than stories to be
would ever be seen in a single real event story. unrealistically empty
42 Chapter 2

Occasionally you may have an event subject that is consistently missing. For
example, a retail sales event might be described as “CUSTOMER buys PRODUCT
in STORE”, but the customer name is never recorded. When the event is imple-
mented as a physical fact table this virtual detail will be dropped, but during event
storytelling it focuses everyone on the atomic-level event. Perhaps the event
stories should contain “Anon.” or “J. Doe”.

Evolving events will Evolving events by their nature will have a number of validly missing details that
have many missing are unknown when the event begins; for example, ACTUAL DELIVERY DATE for
details. an order or FINAL PAID AMOUNT for an insurance claim. However, if you find
For discrete and discrete and recurring events with a lot of missing details it is often a clue that you
recurring events this are trying too hard to model a “one size fits all” generic event that is difficult for
is a warning that stakeholders to use and it may be better to model a number of more specific event
they may be too tables where the details that really define distinct business events are always pre-
generic sent.

Group Stories
Group stories You ask for example events containing groups to expose any variations in the
highlight details that meaning of a detail. For example, a typical order event consists of an individual
vary in meaning customer ordering a product. But is this always the case? You should ask the
stakeholders:

Is a customer always an individual?

Can a product be something more complex,


like a bundle of products?

Mixed business The last two example events in Figure 2-6 are group themed. From these you learn
models (B2B, B2C, that customers can be organizations as well as individuals and orders can be placed
products and for multi-product bundles. The knowledge that there are different types of cus-
services) are often tomer (B2B: business-to-business and B2C: business to consumer), and prod-
discovered by group uct/service bundles will make you think carefully about how you implement these
stories details as dimensions. Chapter 6 covers mixed customer type dimensions, multi-
level dimensions and hierarchy maps, while Chapter 9 covers multi-valued dimen-
sions. These design patterns can be used to solve some of the more vexing model-
ing issues surfaced by group themed examples.

You should ask for just enough event stories so that everyone is clear about the
meaning of each event detail. Don’t get carried away trying to record every story—
that’s what the data warehouse is for—you want to concentrate on discovering all
the event details.
Modeling Business Events 43

Additional When Details?


After you have documented the initial when detail and collected the beginnings of Ask for more when
several stories, you continue looking for when details. Discovering all the when details to discover
details as early as possible is useful, because it helps you determine the story type the story type
which in turn helps you to ask more insightful questions as you look for further
details. For now you ask:

Are there any other dates and times associated


with a customer ordering a product?

to which the stakeholders might reply:

Yes, orders are due for delivery on a delivery due date.

You add this new when detail to the event table, as shown in Figure 2-7. With each Use when examples
additional when you also capture examples before proceeding on to the next when to describe long and
detail. As you do this you may want to adjust some existing example date/times to short durations
illustrate interesting time intervals (exceptionally short and long stories) between
milestones.

Figure 2-7
Adding a second
when detail

If you have more than two when details, draw a simple timeline to help stake-
holders describe the chronological sequence and name the most interesting
durations between pairs of whens.
44 Chapter 2

Determining the Story Type


Knowing the story After you have identified all the when details and documented them with example
type helps you to data, you use this information to determine the story type which in turn will give
discover more you strong clues about subsequent detail types you can expect especially the how
details many details.

Recurring Event
Recurring events If the event contains a when detail with an every preposition and the example
contain a when stories confirms that it occurs on a regular periodic basis then the event is recur-
detail with an every ring. If so, it will often contain balance measures. You should check for these when
preposition you ask your how many questions.

Evolving Event
If an event has at If you have two or more when details you may have an evolving event. If any of the
least one when details are initially unknown and/or can change after the event has been
changeable when created (and loaded into the data warehouse) then it is definitely an evolving event.
detail it is evolving If so, you should look out for changeable duration measures that make use of the
multiple whens.

If an event is evolving you should ask the stakeholders for example stories that
illustrate the initial and final states—the emptiest event story possible, and a fully
completed one—to help explain how the event evolves.

Imagine for a moment that the stakeholders had responded to your additional
when question with:

Orders are delivered on delivery date and paid on payment date.

This would make the event evolving if the actual delivery dates and payment dates
are unknown when orders are loaded into the data warehouse.

Evolving events Notice that the “on” prepositions for these when details are preceded with the verbs
contain additional “deliver” and “pay”. These verbs are events in their own right that occur some time
verbs that may be after the initial order event. However, if stakeholders respond in this way they view
modeled as discrete them primarily as order event milestones. Therefore, you should continue to
events in their own model them as when details of an evolving order event but you may also want to
right model delivery and payment as separate discrete events too: You would ask “Who
delivers what?” and “Who pays what?” to discover if there are important details
that will be lost if deliveries and payments are only available at the order level.

If stakeholders provide multiple when details, pay attention to the verbs used in
prepositional phrases. The multiple verbs can identify a process sequence of
related milestone events. These events can be modeled as part of the current
evolving event and as discrete events in their own right if you suspect they have
more details. See Chapters 4 and 8 for more details on modeling evolving events.
Modeling Business Events 45

Discrete Event
By a process of elimination, if an event is neither recurring nor evolving, it must be Discrete events
discrete. You reconfirm this each time you discover a new detail by asking if its contain details that
example values can ever change. If details never change, or changes are handled as do not change
new adjustment events, then the event remains discrete. In Figure 2-7 both the
ORDER DATE and the DELIVERY DUE DATE (if applicable) are known at the
time of an order and do not change, so the order events, as modeled so far, are
discrete.

Who?
Once you have identified the story type it’s time to double-back (to the top of the Ask a who question
7Ws flowchart in Figure 2-2) and find out whether any other whos are associated to see if anyone
with the event. The general form of a who question is “Subject Verb Object else is involved
from/for/with who?” Using the current subject, verb and object you might ask:

CUSTOMER orders PRODUCT from whom?

To which the stakeholder might reply:


Salesperson

If so, you add the new who to the table and ask for example salespeople to match
the existing event stories. E.g., to continue a group themed story you might ask:

Is there always just one SALESPERSON responsible for the order?

In Figure 2-8, the event stories introduce you to some of Pomegranate’s finest sales
personnel, but also shows that orders can be made without a salesperson, and that
some orders are handled by sales teams rather than individual employees (continu-
ing the group story theme).

Figure 2-8
Adding a second
who detail
46 Chapter 2

Don’t use real employee names in event stories. You may have to model stories
where employees underperform—you don’t want to point the finger at anyone in
the room or elsewhere. Try using famous fictional characters instead. This side-
steps any legal problems, and can be mildly entertaining, but don’t overdo it: you
don’t want to distract everyone from the real event stories and details.

What?
Ask a what question, Next you ask for any additional whats associated with the event. The general form
especially if you of the question is: “Subject Verb Object with/for what?” What questions are par-
don’t already have ticularly useful when the main clause doesn’t already contain a what detail; for
a what detail example:

CUSTOMER pays MAINTENANCE FEE for what?

might give you the what detail: SOFTWARE PRODUCT that would be added to
the table with a “for” preposition. You can keep repeating variations on the what
question to see if there are any more what details, but be careful not to collect
“detail about detail” (see sidebar: Detail about Detail)

Where?
Ask for a where next The next detail type to look for is a where. You ask for this by using the event’s
main clause with a where appended:

CUSTOMER orders PRODUCT, where?

You are trying to find out whether the event occurs at a specific geographic loca-
tion (or website address). If the stakeholders respond:

Online, or at a retail outlet.

Online and retail you would extend the table to record the website URL or retail store location as a
outlets could be where detail of the event. You might generalize the stakeholders’ response to:
generalized as sales CUSTOMER orders PRODUCT at SALES LOCATION. Naming the detail SALES
locations. LOCATION enables you to record websites and retail stores in the same column. If
Generalizations you define a generalization detail like this you should make sure that its meaning is
should be clearly clearly documented by examples. In Figure 2-9 the examples for the new where
documented by detail SALES LOCATION show three different types of location: store, website and
examples call center.
Modeling Business Events 47

Detail About Detail


A new event detail can sometimes turn out to be an additional characteristic of an Check that each
existing detail, rather than a detail of the whole event itself. It is detail about detail new detail belongs
instead of detail about the event. You can be given unnecessary details if you ask to the whole event
too many what questions as they can sound so open-ended. For example, the what and is not just detail
question: “CUSTOMERS order PRODUCTS with what?” might give you an answer about detail that
“with product type”, but is PRODUCT TYPE a detail of the event that would be lost if only further
it was not recorded in the event or does it belong elsewhere? describes a detail
you already have
Spotting detail about detail is usually intuitive, but if you have any doubts you can
test a detail to see if it is position sensitive. You do this by mentally swapping the
new detail into the middle of the main clause, and reading it both ways: before and
after swapping. If the event still makes sense, then the detail isn’t position sensitive,
and belongs to the event. However, if the event sounds like nonsense (even when
you change the detail preposition), then the detail is really about another detail and
will only make sense if placed directly to the right of the detail that it actually de-
scribes. For example:

“CUSTOMER orders PRODUCT with PRODUCT TYPE”

This sounds okay, but try placing the new detail after the subject:

“CUSTOMER with PRODUCT TYPE orders PRODUCT”

Oops, clearly this no longer makes sense. Customers don’t have product types,
products do. Product type only makes sense if it appears directly after (to the right
of) PRODUCT. It is position sensitive. This tells you that product type describes
PRODUCT, not the event itself, and is therefore detail about detail.

Stripping out any details that are not directly related to the event is important, so that Detail about detail
the event can be used to define an efficient fact table. However, you do not want to isn’t discarded.
completely discard the important finding that PRODUCT TYPE is a detail about It is used to define
products. It’s obviously something that stakeholders want to report on. Instead of dimensions
adding it to the table you can place it above the PRODUCT column in the space set
aside for capturing detail about detail. You will use it shortly to define the PRODUCT
dimension.

You can apply the same test to the SALESPERSON detail from the earlier who
question: swap it around the event main clause and listen to yourself saying:

“from SALESPERSON CUSTOMER orders PRODUCT” or


“CUSTOMER, from SALESPERSON, orders PRODUCT”

You sound strange (like Yoda in Star Wars) but it still makes sense. You can see
that the additional who detail can be placed anywhere in the main clause and its
meaning is not lost. Therefore SALESPERSON is not position sensitive, and this
tells you that it is a detail about the event.
48 Chapter 2

If you find yourself generalizing several details you should ask questions about how
similar the event stories really are. If stories have very different details you will
probably want to model them in separate event tables, because highly generalized
events rapidly become meaningless to stakeholders.

Figure 2-9
Adding where
details

Check that each When you ask for additional where details emphasize that you are looking for
where is a detail of locations that are specific to the whole event, not the existing who or what details.
“who does what” not This helps avoid (for the moment) detail about detail—like customer address and
just who or what product manufacturing address—that are not dependent on the event. These
reference addresses will be modeled as dimensional attributes of CUSTOMER and
PRODUCT once the event is complete (see Chapter 3).

Each time you finish collecting a W-type, it’s good practice to quickly scan
through the previous Ws to check for missing details. After you finish asking
where questions check to see if any of the where details remind the stakeholders
of additional whos, whats and whens.

Modelstorming with Whiteboards


Whiteboards are the agile practitioner’s favorite collaborative modeling tool. They
are ideal tools for modelstorming snippets of your design at a time but even the
most generous whiteboards can be challenged by the width of a full BEAM✲
table. Here’s some practical advice for using them and other tools for event
modeling:

Use “whiteboard on a roll” plastic sheets to extend/replace your finite white-


board. Large post-it™ notes or flipchart paper and masking tape work too
but are not so neat or forgiving for iterative design. Sheets with a 2.5cm grid
work very well for event tables.
Modeling Business Events 49

Put the primary details: the event main clause and initial when detail, on your
main whiteboard or first sheet. If you can’t fit those first three columns on your
whiteboard, it’s too small or your writing is too big. The primary details can stay
front and center while you add or remove extension sheets for blocks of the
other Ws. We suggest you divide up the details as we have the latter chapters
of this book, with at least one sheet each for who & what, when & where, how
many, and why & how.

Have a scribe recording the model as you go. With traditional interactive model-
ing efforts, scribes are usually members of the data modeling team because of
the technical nature of the information they record and modeling tools they use.
With BEAM✲, the scribe can be anyone who can use a spreadsheet. This is an
ideal role for the on-site customer or product owner (one of the stakeholders) on
an agile team.

If you are limited for whiteboard space and lacking a scribe because of the
impromptu nature of your modelstorming, take pictures so you can erase as you
go. The cameras in most smartphones and tablet devices are more than ade-
quate for this and can take advantage of scanner apps that will automatically
clean up whiteboard images (reduce glare, increase contrast, fix perspective)
and email the results to your group. Don’t forget to turn off the flash.

If you have to erase as you go, leave the primary details and example data on
the board. If room permits (or on a separate flipchart) keep a visible “shopping
list” of the detail names you’ve had to erase.

Use any color you like as long as its black! If you’re going to take photos of your
work, stick to black whiteboard markers to improve the results. BEAM✲ notation
is deliberately non-color coded to help you here. Why do you see so many rain-
bow-colored whiteboard diagrams? Occasionally someone will have a well
thought out color scheme (but did they remember 8% of the male population
have color vision deficiency?). More often than not it’s because black is the
missing/dried-up pen. Go out and buy a box of black dry-wipe markers now!
Right now!

If you want to increase the level of interest, interactivity, contribution and energy
when you’re modelstorming give everyone a (black) marker. Get stakeholders
on their feet writing their own event stories on the board as soon as they’re used
to BEAM✲. How well this works depends on your style, their style, everyone’s
handwriting and the number of modelstormers. Having everyone edit the white-
board model together can work well for small groups of peers but no one wants
to feel they’re back at school being told to “Write that on the board”.
50 Chapter 2

For more structured modeling sessions with larger groups of stakeholders


you might want to use a data projector and model directly into a projected
spreadsheet. If so, investigate the use of short throw, interactive projectors
and annotation software. With these tools, modelstormers can huddle round
a projected interactive whiteboard without casting shadows—but don’t let
gadgets get in the way of modeling.

If you are using the BEAM✲Modelstormer spreadsheet you will find that the
primary details (the subject, object, and initial when detail) of each table are
frozen, so that you can scroll horizontally without losing the context of each
event story. This spreadsheet also draws a pivoted ER table diagram in sync
with the BEAM✲ table, so you can see a list of all the details at any time.

Appendix C provides recommendations for tools and further reading that will
improve your collaborative modeling efforts.

How Many?
Ask “how many?” How many questions are used to discover the quantities associated with an event
to discover facts, that will become facts in the physical data warehouse and the measures and key
measures, KPIs performance indicators (KPIs) on BI reports and dashboards. Again, you repeat
the main clause of the event as a question to the stakeholders, but this time with
“how many” and its variants: “how much”, “how long” etc. inserted to make
grammatical sense. For example:

CUSTOMER orders how many PRODUCTs?


How much are PRODUCT orders worth?

In both cases you want the name of the quantity. Likely answers to these ques-
tions—ORDER QUANTITY and REVENUE—have been added to the event table
in Figure 2-10 along with examples that show a wide range of values. You should
ask how much/many questions for each detail to see if it has any interesting quanti-
ties that should be associated with the event. So you could ask:

How many CUSTOMERs order PRODUCTs?

The stakeholders would probably like to answer “thousands” but for each order
event story it is always one customer. For details like this where the answer is
always one, or zero if the detail is missing (not mandatory), there isn’t a useful
additional quantity to name and add to the event. When you have checked all the
details for quantities, you should follow up with the general question:

How else would you measure this event?


Modeling Business Events 51

Figure 2-10
Adding quantities

Unit of Measure
When you ask the stakeholders for example quantity values, you should also Ask for the standard
discover their unit of measure. If you find that a quantity is captured in multiple unit of measure for
units of measure it will need to be stored in a standard unit in the data warehouse each quantity
to produce useful additive facts, so you should ask the stakeholders what that
standard unit should be. (Chapter 8 provides details on designing additive facts.)
You record the unit of measure in the quantity’s column type using square brackets
type notation; e.g., [$] or [Kg]. The unit of measure is a more useful descriptive
type for a quantity than [how many].

[ ]: Square brackets denote detail type (e.g. who, where) and unit of measure for
how many details.

If a quantity needs to be reported in multiple units of measure you can record them Multiple units of
as a list with the standard unit of measure first. Figure 2-10 shows examples events measure can be
where REVENUE is captured in dollars and pounds. The column type [$, £, €, ¥] listed in the column
records that US Dollar is the standard unit for the data warehouse, but BI applica- type
tions will also need to report REVENUE in Sterling, Euro and Yen.

Durations
You discover durations by asking how long questions. For example, “How long Ask “How long?” to
does it take a CUSTOMER to order a PRODUCT?” If the stakeholders view the discover durations
event as a point in time there will be no duration (not recorded or not significant). and evolving events
Asking for durations is another way of testing if the event should be modeled as
evolving. Duration calculations can expose missing when details and highlight
other events (verbs) that are so closely related to the current event that they should
all be part of an evolving event.
52 Chapter 2

Derived Quantities
Derived quantities Some modelers may question the need for modeling duration quantities. If time-
help define BI stamps are present then durations can be calculated rather than stored. This is true,
reporting but BEAM✲ tables are BI requirements models for documenting data and report-
requirements ing requirements not physical storage models. By documenting durations and
other derived measures as event details you have the opportunity to capture their
business names and document their maximum and minimum values (in stories),
which can be used as thresholds for dashboard alerts and other forms of condi-
tional reporting.

BEAM✲ event tables do not translate column for column into physical fact tables.
When an event table is physically implemented as a star schema the majority of
its non numeric details will be replaced by dimensional foreign keys, and some of
its quantities can be replaced by BI tool calculations or database views. This
process is covered in Chapter 5.

Why?
Time to ask “Why?” Capturing why details is the next step in modeling the event. As with the other “W”
questions you ask a why question using the main clause of the event:

Why do CUSTOMERs order PRODUCTs?

might be a little open ended but

Why do CUSTOMERs order PRODUCTs in these


quantities on these dates at these locations?

Why details often will focus the stakeholders on identifying the causal factors that specifically explain
explain quantity variations in event quantities. The why details you are looking for can include
variations promotions, campaigns, special events, external marketplace conditions, regulatory
circumstances or even free-form text reasons for which data is readily available. If
the stakeholders respond with:

Product promotions.

Try to discover you would expand the event table as shown in Figure 2-11, and add example stories
typical and that illustrate typical and exceptional circumstances. Notice that the typical pro-
exceptional why motion is “No promotion” and that the why detail has prompted the stakeholders
stories to supply an additional DISCOUNT quantity. Chapter 9 provides detailed coverage
on modeling why details as causal dimensions.

If event stories show wide quantity variations, point this out as you model why
details. Ask stakeholders if there are any reasons that would explain these varia-
tions? If causal descriptions are well recorded they may also lead you to discover
additional quantities.
Modeling Business Events 53

Figure 2-11
Adding a why detail

How?
The final “W” questions discover any how details. How refers to the actual mecha- You finish with how
nism of the business event itself. You discover these details by asking a how ques- questions
tion using the main clause of the event:

How does a CUSTOMER order a PRODUCT?

Often how details include transaction identifiers from the operational system(s)
that capture each event. If the stakeholders respond with:

A customer or salesperson creates an ORDER with an ORDER ID.

then you would add ORDER ID to the table as in Figure 2-12. ORDER ID might be Transaction IDs
an equally good answer to other how questions such as: “How do you know that a (how details) help
customer ordered a product; what evidence do you have?” or “How can you tell to differentiate
one similar order from another?” With these questions you are explicitly asking for events
operational evidence that these event stories exist and can be differentiated from
one another.

Figure 2-12
Adding a how detail
54 Chapter 2

How details can be You should ask further how questions to find out if there are any more descriptive
also be descriptive how details. You are typically looking for methods and status descriptions. A
suitably rephrased how question might be:

In what ways can a CUSTOMER order a PRODUCT?

to which the stakeholders might respond:

Using a credit card or a purchase order. We’ll call that


PAYMENT METHOD.

Multi-verb evolving events can have multiple transaction ID and/or descriptive


how details; one for each verb. In Chapter 5, transaction ID hows are modeled as
degenerate dimensions within fact tables. In Chapter 9, more descriptive hows are
modeled as separate how dimensions.

Event Granularity
Event granularity is Each event story must be uniquely identifiable (otherwise there would be no way to
the combination of identify duplicate errors). Therefore you must have enough details about an event
event details that so that each example story can be distinguished from all others by some combina-
guarantee tion of its values. This combination of detail is called the event granularity. Discov-
uniqueness ering the event granularity is the job of the repeat story theme. If every detail in the
repeat story matches the typical story you do not have enough details to define the
granularity.

Transaction IDs In most cases, event granularity can be defined by a combination of who, what,
(how details) can be when, and where details, but occasionally details stubbornly refuse to be unique
used to define through most of the 7Ws. While highly unlikely, perhaps the same customer really
granularity can order the same product at the same time at the same price from the same
salesperson for delivery on the same date to the same location. In cases like this,
the operational source system will have created a transaction identifier—such as
Order ID (a how detail)—that can be used to differentiate these event instances.

Granular details are After you have discovered the event granularity you document it by adding the
marked with the short code GD (granular detail) to the column type of each granular detail that in
code GD combination defines the granularity. Figure 2-12 shows that the order event granu-
larity is defined by a combination of ORDER ID and PRODUCT. This would
equate to order line items in the operational system.

GDn : Granular Detail (or Dimension). A detail that singularly or in combination


with others defines the granularity/uniqueness of an event. Alternative
combinations are numbered; e.g., GD1 and GD2 denote two alternative
granular details groups.
Modeling Business Events 55

Sufficient Detail
Just because you have enough details to define the event granularity does not mean Granularity is not
you stop adding details. If the stakeholders are still providing relevant details, keep enough, you want
adding then. Event uniqueness is a minimum requirement. What you are aiming all the details
for is the complete set of details that tell the “full” story (or at least as much of it as
is currently known).

Naming the Event


It is now time to give the event a short descriptive name, one that the stakeholders Event names often
will be comfortable with and that matches a recognized business process. The use the event
name can often be some variant of the event verb. If the verb is shared by other subject and verb
events (you may have to model other types of orders; e.g., wholesale orders or
purchase orders) then a combination of subject and verb will be needed to make a
unique name. By convention, event names are uppercase plural. In Figure 2-13 the
completed event table, now named CUSTOMER ORDERS using the subject and a
verb, has been transferred to a spreadsheet.

Figure 2-13
Documenting the
event

If the event verb doesn’t provide a good name try using one of the how details. For If the verb doesn’t
example, if the main clause was “Customer Buys Product” an event name of sound right try a
CUSTOMER BUYS might be replaced by CUSTOMER PURCHASES or how detail
CUSTOMER INVOICES using how details like PURCHASE ID, or INVOICE
NUMBER.

The subject of an event is actually subjective: it is based on the stakeholders’ initial Event subjects are
perspective. Different stakeholders can describe the very same event starting with a subjective!
different subject. Once all its details have been discovered the initial event might be
better described using one of those alternative points of view, by swapping around
56 Chapter 2

Before naming an its details to establish a better subject and/or event name. For example, the event
event you may need “‘SALESPERSON sells PRODUCT to DISTRIBUTOR”’ might be reordered as
to change its subject “‘DISTRIBUTOR orders PRODUCT from SALESPERSON”’ and named
DISTRIBUTOR ORDERS rather than SALESPERSON SALES. The initial subject
(SALESPERSON) has helped to tease out the event stories, but its work here as a
subject is now done.

If the subject-verb combination doesn’t provide a good event name try using the
object and a how detail.

Completing the Event Documentation


Add the story type Finally, now that you have all the event details you can define the story type with
to the table heading confidence. To recap, if the event has a when detail with an every preposition it is
after the event recurring. If the event has multiple when details that can change it is evolving.
name Otherwise it is a discrete event. You record the story type using one of the table
codes: DE, RE, or EE which you place in the event name header. Table level codes
follow the table name within square brackets as in Figure 2-13.

[DE] : Discrete event


[RE] : Recurring event
[EE] : Evolving event

Draw a double bar When you finish documenting an event using a spreadsheet (as in Figure 2-13)
on the right edge to draw a double bar on the right edge to signify that the table is complete (for now).
denote a completed This is a helpful visual clue because BEAM✲ tables grow wider than the screen or
table printed page. If you can’t see the double bar you know you should scroll right or
look for a continuation page to see more details.

The finished stories By scanning the completed table stakeholders can now read their finished event
can now be told stories, such as:

Elvis Priestly (is that really his name?) orders 1 iPip Blue Suede on the 18th May
2011, for delivery due on the 22nd May 2011, from James Bond at the POMStore
NYC, for delivery to Memphis, TN, for $249, on no promotion with zero dis-
count, using ORD1234.

Vespa Lynd orders a POMBook Air on the 29 June 2011, for delivery on the 4th
July 2011, from POMStore.com, for delivery to London, UK, for £1,400 with a
launch event 10% discount, using ORD007.

With a little tweaking of prepositions and reordering of details (mainly the how
manys) you can now construct an event story archetype for CUSTOMER ORDERS
from the examples. This final piece of documentation might say:
Modeling Business Events 57

Customer orders a quantity of products, on order date, for delivery on delivery due The generic
date, from salesperson at sales location, for delivery to delivery address, for revenue, customer orders
on promotion with discount using order ID. story

The Next Event?


Having fully described the details of your initial event, you should model the Having modeled the
dimensions they represent before moving on to other events. This is exactly what first event you
stakeholders will want you to do because although they may find the details dis- should model its
covered so far to be fascinating, they will want to know that they can analyze events matching
using many more descriptive attributes. If stakeholders provide you with lots of dimensions next
detail about detail this is a sure sign that they are keen to define the dimensional
attributes they will need to aggregate and filter the atomic-level business events to
produce interesting reports—and that is exactly what you learn how to do in
Chapter 3.

In subsequent modelstorming iterations, when you have established a library of When you already
dimensions, stakeholders will want to move directly from event to event, rapidly have a library of
telling event stories by reusing common details (dimensions) and examples. This is common dimensions
a habit that you want to encourage early on by using the event matrix techniques, you can quickly move
described in Chapter 4. to the next event

When you and your stakeholders get the hang of telling event stories, BEAM✲ can
proceed at a storming pace, describing many events in quick succession but you
should be careful not to model too many events at the story level. It is a balancing
act. Stakeholders need to model multiple events, to describe their cross-process
analytical requirements and be able to prioritize the most important event(s) for
the next release—not necessarily the first event(s) you discover when modelstorm-
ing. However, telling many more detailed event stories than can be implemented in
the next sprint is unnecessarily time consuming and can create unrealistic expecta-
tions of what will soon be available. It can also lead to the dreaded BDUF (Big
Design Up Front) that does not reflect the business realties, changed requirements
and available knowledge when it is eventually implemented.

You should reserve event stories for just-in-time (JIT) modeling of the detailed
data requirements for your next sprint/iteration/release depending on your agile
development schedule. Look to Chapter 4’s JEDUF (Just Enough Design Up Front)
techniques for modeling ahead to (even more rapidly) create higher-level models of
the events in future releases. These models provide just enough information to help
stakeholders decide the best events to model in detail now, and help you design
more flexible versions of those events: ones that will require less rework in future
DW iterations.
58 Chapter 2

Summary
Business events represent the measureable business activity that give rise to DW/BI data
requirements. BEAM✲ uses data storytelling techniques for discovering business events by
modelstorming with business stakeholders.

BEAM✲ defines three event story types: discrete, evolving and recurring. They match the three
fact table types: transaction fact tables, accumulating snapshots and periodic snapshots.

Each BEAM✲ event consists of a main clause, containing a subject verb object, followed by
prepositional phrases containing prepositions and detail nouns. Each subject, object and detail
is one of the 7Ws; i.e., a noun that potentially belongs in a dimensional database design.

BEAM✲ modelers use the 7Ws to discover a business event and document its type, granularity,
dimensionality and measures—everything needed to design a fact table.

BEAM✲ modelers avoid abstract data models. They “model by example”: ask stakeholders to
describe their BI data requirements by telling data stories. BEAM✲ modelers document these
requirements using example data tables.

Event stories are example data stories for business events.

Event stories are sentences made up of main clause and preposition-detail examples.

Event stories succinctly describe event details and clarify their meaning by providing examples
of each of the five themes: typical, different, missing, repeat and group.

Typical and different themed stories explore data ranges and exceptions.

Typical and repeat stories describe event uniqueness (granularity).

Missing stories help BEAM✲ modelers to discover mandatory details and document how BI
applications should display missing values.

Group stories uncover event complexities including mixed business models and multi-valued
relationships.

BEAM✲ short codes are used to document mandatory details, granular details and story type.
Other elements of the BEAM✲ notation document W-type, units of measure and completed
event models.
3
MODELING BUSINESS DIMENSIONS
I keep six honest serving-men (They taught me all I knew);
Their names are What and Why and When And How and Where and Who.
— Rudyard Kipling, The Elephant’s Child

Business events and their numeric measurements are only part of the agile dimen- Business events
sional modeling story. On their own, BEAM✲ event tables are not sufficient to need dimensions to
design a data warehouse or even a data mart, because they do not contain all the fully describe them
descriptive attributes required for reporting purposes. For complete BI flexibility, for reporting
stakeholders need both the atomic-level event details modeled so far and higher- purposes
level descriptions that allow those details to be analyzed in practical ways. The data
structures that provide these descriptive attributes are dimensions.

In addition to the 7Ws and example data tables, BEAM✲ uses hierarchy charts and BEAM✲ modelers
change stories to discover and define dimensional attributes. Hierarchy charts are draw hierarchy
used to explore the hierarchical relationships between attributes that support BI charts and tell
drill-down analysis, while change stories allow stakeholders to describe their change stories to
business rules for handling slowly changing dimensions. define dimensions

In this chapter we describe how these BEAM✲ tools and techniques are used to This chapter shows
model complete dimension definitions from individual event details. We will use you how to model
the CUSTOMER and PRODUCT event story details from Chapter 2 for our dimensions from
example dimension modelstorming with stakeholders. event story details

Modeling the dimensions of a business event Chapter 3 Topics


Using the 7Ws and BEAM✲ tables to define dimensional attributes At a Glance
Drawing hierarchy charts to model dimensional hierarchies
Telling change stories to describe dimensional history

59
60 Chapter 3

Dimensions
Dimensional Dimensions are the significant nouns of a business or organization that form the
attributes are the subjects, objects, and supporting details of interesting business events. They are 6
nouns and of the 7Ws: the who, what, when, where, why, and how of every event story. Dimen-
adjectives that sional attributes further describe business events using terms that are familiar to
describe events in the stakeholders. They represent the adjectives that make data stories more inter-
familiar business esting. From a BI perspective, dimensions are the user interface to the data ware-
terms house, the way into the data. Dimensional attributes provide all the interesting
ways of rolling up and filtering the measures of business process performance. BI
applications use dimensions to provide the row headers that group figures on
reports and the lists of values used to filter reports. BI takes advantage of the
hierarchical relationships between dimensional attributes to support drill-down
analysis and efficient aggregation of atomic-detail measurements. The more
descriptive dimensional attributes you can provide, the more powerful the data
warehouse and BI applications appear. Consequently, good, richly descriptive
customer and product dimensions can have 50+ attributes.

The data values of a dimension (or an individual dimensional attribute) are re-
ferred to as its members.

Dimension Stories
Dimensions data Dimensions, because they represent descriptive reference data (adjectives and
stories have weak nouns), lack the strong narrative of business events. Events (and event stories) are
narratives. They are associated with “exciting”, active verbs such as “buy”, “sell”, and “drive” as used in:
subject and object “customer buys product”, “employee sells service” and “James Bond drives Aston
heavy but verb light Martin DB5”. Dimensions, on the other hand, are associated with static verbs such
as “has” and “is” that lead to weak narratives like “Customer has gender. Customer
has nationality” and “Product has product type. Product has storage”. These are
state of being events, archetypes for many “is/has” data stories such as “Vespa Lynd
is female. She is British” and “iPOM Pro is a computer. It has a 500GB disk”.
Important as these sentences are, we hardly think of them as stories because they
lack the drama of “who does what, to whom or what, and when, and where” that
propels you through all of the 7Ws to rapidly discover data requirements. Dimen-
sion stories are not exactly page-turners!

BEAM✲ modelers While data stories are highly effective at discovering dimensions and facts (as event
add drama and plot details) and the 7Ws remain a powerful checklist at all times, additional techniques
to help define are needed to uncover the information hidden within the weaker narratives of
dimensions dimension stories. BEAM✲ modelers have to add some drama to dimensions to
help stakeholders tell more interesting stories that fully describe their attributes
and business rules that affect ETL processing and BI reporting. To do this BEAM✲
modelers use two plot devices:
Modeling Business Dimensions 61

Hierarchy charts are used to ask stakeholders questions about how they Hierarchy charts
organize (an active verb) the members of a dimension (the values in event sto- help you to ask
ries) into groups. These groups, because they generally have much lower cardi- questions about
nality than the individual members of the dimension, make good row headers how dimensions
and filters for reports—good starting points for telling fewer, higher-level event are organized
stories for BI analysis. Many of these groups have hierarchical relationships
with one another. Exploring these hierarchies will prompt stakeholders for ad-
ditional descriptive attributes and provides the information needed to config-
ure BI drill-down and aggregation facilities.

Change stories are data stories that document how each dimensional attribute Change stories
should handle historic values. By asking stakeholders how dimension members describe how
change (another active verb) they not only describe which attributes can and dimensional
cannot change (but can be corrected), but also state their reporting preferences attributes change,
for using current values or historical values. Getting stakeholders to think and how they
about change reminds them of additional attributes that behave in similar ways should reflect
to existing ones and can lead to the discovery of auxiliary business processes history for reporting
that capture changes. Some of these auxiliary processes can be significant purposes
enough that they need to be modeled as business events in their own right.

The Cardinality of an attribute or relation refers to the number of unique values.


High cardinality attributes have a large number of unique member values, whereas
low cardinality have a small number of unique members.

Discovering Dimensions
Having modeled an event, there is no elaborate discovery technique for dimensions Dimensions are
because they naturally come out of the event details that you already have. Each discovered as event
event detail that has additional descriptive attributes will become a dimension. For details by telling
most details there is, at the very least, an identity attribute—a business key—that event stories
uniquely identifies each dimension member. This is typically the case for the who,
what, when, where, and why details. The candidate dimensions from Chapter 2’s
CUSTOMER ORDERS event are:

CUSTOMER [who] SALESPERSON [who] Event details with


PRODUCT [what] SALES LOCATION [where] additional
ORDER DATE [when] DELIVERY ADDRESS [where] descriptions
DELIVERY DUE DATE [when] PROMOTION [why] become dimensions

Missing from this list are the event quantities and any how details, such as ORDER
ID, which do not have any additional descriptions that need to be modeled in
separate dimension tables. When the event is converted into a star schema these
62 Chapter 3

details translate directly into facts and degenerate dimensions (denoted as DD)
within the fact table. Chapter 9 covers instances where physical how dimensions
are needed.

DD : Degenerate Dimension. A dimensional attribute stored in a fact table.

Documenting Dimensions
Dimensions are Dimensions are documented by taking each event detail (that has additional
modeled using attributes), one at a time, starting with the event subject and modeling it as a
BEAM✲ tables BEAM✲ (example data) table. Figure 3-1 shows how CUSTOMER—the subject of
CUSTOMER ORDERS—is used to define a dimension of the same name. Note that
dimensions are singular whereas events are plural.

Dimension Subject
The event detail The event detail also becomes the first attribute of the dimension table—its subject.
becomes the This mandatory attribute of the dimension is denoted by the code MD. You should
subject of the ask the stakeholders for a suitable subject name. Typically, a dimension’s subject is
dimension its “Name” attribute, such as CUSTOMER NAME, PRODUCT NAME, or
EMPLOYEE NAME. After you have a name for the attribute you populate it with
the unique examples from the event table: notice in Figure 3-1 that the repeated
customer names have been removed.

Figure 3-1
Modeling the
Customer
dimension
Modeling Business Dimensions 63

Dimension Granularity and Business Keys


The next step is to define the granularity of the dimension so that it precisely Ask for a business
matches the event(s) it describes. The event stories modeled so far have used key to uniquely
customer and product names for readability but you must now discover the busi- identify each
ness keys that uniquely identify these details. Business keys are the primary keys of dimension member
source system reference tables. You discover a business key by asking “What
uniquely identifies each [dimension name]?” or “How do you distinguish one
[dimension name] from another?” For example:

What uniquely identifies each customer?

To which stakeholders might reply:


A customer ID

You check that this is what you need (a single, universal, reliable identifier) by
confirming with the stakeholders that:

CUSTOMER ID is mandatory. Every customer must have a value for this A business key
business key at all times. If this were not the case other business keys would be must be unique and
needed to augment this one. mandatory

CUSTOMER ID is unique. There are no duplicate values. New customer are


not assigned old lapsed customer IDs.

CUSTOMER ID is stable. IDs are not changed or reassigned.

Assuming CUSTOMER ID passes these stakeholder tests (which should be con- Add the identity
firmed by data profiling as soon as possible) you add this identity attribute to the attribute
dimension table with examples to match the existing customer names as shown in immediately after
Figure 3-2. You also mark it with the column code BK to denote that it is a “Busi- the dimension
ness Key” and because it is the only business key for CUSTOMER you also mark it subject
as mandatory using MD. You can leave out the “has” preposition as it adds little
value. When the relationship between an attribute and the dimension subject is
more complex (and not apparent from the attribute name) you can add a preposi-
tion to help you read the dimension members as stories.

Figure 3-2
Adding a business
key to a dimension
64 Chapter 3

BK : Business Key. A source system key that can uniquely identify a


dimensional member.

Discovering a Asking for a customer business key can be a vexing question. If customer data
suitable customer comes from multiple sources, a single business key that identifies all customers
identifier can be uniquely across all business processes may simply not exist. If you are lucky,
difficult. There may customers will have a single major source, or a master data management (MDM)
be more than one application will have created a master customer ID that the data warehouse can
use. If not, be prepared for some interesting discussions with the stakeholders. You
may have to model several alternate business keys to have a reliable identifier for
each customer.

Dimensional Attributes
Use any detail about Having defined the dimension’s granularity, you are now ready to discover the rest
detail that you have of its attributes. You may already have a short list of candidate attributes that were
then ask for new identified as detail about detail while modeling an event. Figure 3-3 shows the
attributes PRODUCT TYPE detail about detail being added to the PRODUCT dimension.
These candidate attributes are a good place to start, because the stakeholders are
obviously keen to use them. If you don’t have any candidate attributes then it’s
time to ask the stakeholders:

What attributes of [dimension name]


are interesting for reporting purposes?

What attributes would you like to be able to


sort, group, or filter [dimension name]s on?

This usually produces a list of attributes that should be tested, one at a time, to
ensure they are in scope.

Attribute Scope
Check that the new Before you add an attribute to a dimension you need to check that it is within the
attribute is in scope scope of the dimension and the current project iteration. You check the initial
for your current feasibility of an attribute by asking stakeholders whether they believe that data for
project, iteration or the new attribute will be readily available from the same sources as other attributes
release and event details. You should be wary of attributes that don’t exist yet, or greatly
expand the number of data sources you have to deal with in an iteration. Attempt-
ing to collect examples is a good way of weeding out attributes that are “nice to
have if we had an infinite budget.” If stakeholders struggle to provide any exam-
ples, or can’t agree on common examples, the attribute may be of limited value.
Modeling Business Dimensions 65

Figure 3-3
Product dimension
populated with detail
about detail

If you are modeling directly into (projected) spreadsheet BEAM✲ tables, as in the
figure examples in this chapter, freeze the dimension subject and business key
columns, as you would the primary details of events, to keep them visible as you
scroll horizontally to add new attributes.

Assuming that most of the attributes that stakeholders suggest are within the Check that the new
current scope, your next task is to check that they belong in the current dimension. attribute has only
You are looking for attributes that have a single value for each dimension member one value for each
at any one moment in time (including “now”): These are attributes that have a one- dimension member
to-one (1:1) or a one-to-many (1:M) relationship with the dimension subject when
you disregard history. You will consider historical values shortly when you ask for
change stories. For now, you check each new attribute by asking the following
“moment-in-time” question:

Can a [dimension name] have more than one


[new attribute name] at any one moment in time?

The question is carefully phrased so that you are checking for the condition you
don’t want, so if the answer is NO then the attribute belongs. NO is good! It tells
you that the attribute does not have a M:M or M:1 relationship with the dimension,
both of which would rule it out. For example, you might ask:

Can a customer have more than one customer type


at any one moment in time?
66 Chapter 3

If the stakeholders answer NO, you add CUSTOMER TYPE to the dimension as in
Figure 3-4. Try this question on something which doesn’t belong to CUSTOMER,
like the detail about detail PRODUCT TYPE:

Can a customer have more than one product type


at any one moment in time?

! long answer might be:


The short answer is YES; the stakeholder’s

Yes, a customer can buy or use several products with


different product types at any moment.

Product type obviously doesn’t belong to!Customer and the YES answer confirmed
it. In reality, common sense would have prevented you or the stakeholders from
considering this as a CUSTOMER attribute.

Figure 3-4
Adding
CUSTOMER TYPE
to CUSTOMER

Sometimes you will also have to exercise a little intuition to interpret a YES (multi-
ple values are possible at any moment) answer. If you ask:

Can a customer have more than one customer address?

The answer could be YES but you intuitively


! feel that CUSTOMER ADDRESS
belongs to CUSTOMER. How you solve this problem depends on just how many
interesting addresses a Customer has. You could ask:

Is there a single primary (easily identifiable) address for each


customer that should be used for geographic analysis?

If the answer is: !

Yes, billing address.


Modeling Business Dimensions 67

then you have found a single-valued attribute that belongs. You should also find A small fixed
out how many other address values a customer can have. If customers have only number of values
two additional addresses, for example, home address and work address, then you can be modeled as
can define them as separate attributes in the dimension by being more precise separate attributes
about their meaning/name.

If customers can have multiple addresses (some might have hundreds) then you If a proposed
may have discovered a missing where detail of CUSTOMER ORDERS. Addresses attribute has
might be delivery addresses; customers are ordering gifts for their family and multiple values it
friends or resellers are ordering products for their clients. If that is the case, the may represent a
addresses in question don’t belong to customers. They belong instead to the busi- separate dimension
ness event as a DELIVERY ADDRESS detail and subsequently as a separate dimen- of the business
sion. This handles the genuine many-to-many (M:M) relationship between event
customers and these addresses.

If the answer to the question: “Can a [dimension name] have more than one [at- A proposed attribute
tribute name] (at one moment in time)?” is a resounding NO then the attribute can be qualified
belongs in the dimension. If the answer is a resounding YES it most likely doesn’t (e.g. Main… or
belong but may warrant more investigation before you rule it out. If no one is Primary…) to
happy to see the dimension without a particular attribute, you may have to qualify restrict it to a
the attribute in some way (for example, Primary Address), or adjust the dimen- single value
sion’s granularity to accommodate it.

If the multiple addresses do belong to a customer—they are the offices or stores of Alternatively the
a corporate customer—you might need to change the CUSTOMER dimension dimension
granularity to customer location, if event activity is tied to specific locations and granularity might
stakeholders treat each location as an individual customer. In which case you need to be adjusted
would redefine the granularity as one row per customer per location and define a to match the vital
composite business key to uniquely identify each member. attribute

The previous example treats customer address as a single attribute. In reality,


address would be several dimensional attributes, such as Street, City, Region,
Postal Code, and Country: attributes that represent a geographic hierarchy. If this
is well understood by stakeholders, the individual attributes can be modeled later.

Attribute Examples
After you have established that an attribute belongs in the dimension you add it to Ask for examples for
the BEAM✲ table and ask the stakeholders for example values that match the each new attribute
dimension subject. The dimension subject will already contain typical and excep- that match the
tional (different and group themed) examples copied from an event (that you dimension subject
captured using the event story themes in Chapter 2). These usually prompt the members
stakeholders to provide interesting values for each new attribute, too. Stakeholders
will typically give you values that they want to see on their reports.
68 Chapter 3

If stakeholders The goal of using examples is to ensure that everyone is in clear agreement about
cannot agree on the definition of each attribute. If the meaning or use of an attribute is unclear or
examples, you may contentious ask for additional examples. If stakeholders can’t agree on example
have homonyms: values for an attribute it can indicate that you have discovered homonyms: two or
multiple attributes more attributes with the same name but different meanings. If all the possible
with similar names meanings are valid, then each set needs to be uniquely named, and modeled as a
separate attribute with its own examples.

If stakeholders struggle to provide examples for an attribute you should be alerted


to the possibility that the attribute doesn’t exist yet, or is a “nice to have” attribute
that isn’t currently well understood enough to be useful.

Descriptive Attributes
Check if codes such When documenting examples for business keys and any other cryptic codes check
as business keys if the values contain any hidden meaning. For example, the data in Figure 3-2
are smart keys: suggests that CUSTOMER ID is a “smart key” that can be used to differentiate
contain hidden business and consumer customers. How does this tally with CUSTOMER TYPE?
meaning You would investigate this further via data profiling. It might prove useful as an
additional quality assurance test during ETL processing. For every code you are
given you should ask the stakeholders:

Do any existing reports or spreadsheets decode [business keys /


other cryptic codes] into more descriptive labels?

If YES, you want to convert this report logic in to ETL code and define descriptive
Model descriptive attributes for these labels in your dimensions, where they will be consistently
attributes that maintained and available to everyone. Your motto should be “No SQL decodes at
decode all cryptic query time!” If BI applications need to decode dimensional attributes you and the
codes. No decodes stakeholders have not done a good enough job of defining the dimensions.
in BI queries!
Beware of “smart keys” with embedded meaning. They seldom remain smart over
their lifetime, and often become overloaded with multiple meanings as business
processes evolve. BI applications should not attempt to decode smart keys and
other codes to provide descriptive labels. It is almost impossible for embedded
report logic to keep up with these codes as their meaning morphs over time. It
should be replaced by descriptive data in the data warehouse.

If you find more BI-friendly descriptive attributes for codes you can remove or
Codes provide hide the codes in the final version of a dimension, as long as stakeholders have no
consistent sort order use for them on reports. However, if you are designing a multinational data ware-
for multi-language house that will translate descriptive attributes into several national languages, these
text otherwise cryptic codes are useful for consistently sorting reports that are re-run
internationally. Chapter 7 covers techniques for handling national languages.
Modeling Business Dimensions 69

Boolean flag attributes that contain "Y" or "N", (e.g., RECYCLABLE FLAG) can
usefully be augmented with matching report display-friendly attributes containing
descriptive values, (e.g., "Recyclable" and "Non-Recyclable").

Mandatory Attributes
While you are filling out example data for an attribute ask whether it is mandatory. Use MD to record
If the stakeholders believe it is, you should add MD to its column type. MD does that stakeholders
not necessarily define attributes as NOT NULL in the data warehouse. MD may believe an attribute
just represent the stakeholders’ idealistic view, while the data warehouse has to to be mandatory
cope with the actual operational system data quality. By documenting allegedly
mandatory attributes you are capturing rules that the ETL process can test for, and
identifying potentially useful attributes for defining dimensional hierarchies.

Missing Values
One example you must fill in for every attribute is its “Missing” value. If the di- Every dimension
mension subject has already been identified as possibly missing from an event needs x missing row
story there will already be a missing subject copied from the event. If not you to document
should add a missing row to the dimension just as you would to an event. You fill missing display
out this row by asking the stakeholders how they want “Missing” to be displayed values
for each attribute.

Paradoxically you need to ask for missing values even for mandatory attributes. For Even mandatory
example, If CUSTOMER TYPE is a mandatory attribute of CUSTOMER then for attributes need a
all SALES events involving Customers you can rely on the Customer Type to be missing value
present. But if Customers are missing for certain SALES events (for example,
anonymous cash transactions) then CUSTOMER TYPE will also be missing.
Figure 3-4 shows that when CUSTOMER is missing, the stakeholders want
CUSTOMER TYPE to be displayed as “Unknown.”

If stakeholders need the data warehouse to be able to differentiate between various If there are different
types of “Missing” (e.g., “Not Applicable”, “Missing in error”, or “To be assigned types of missing,
later”) the dimension will need additional special case missing stories with differ- you need multiple
ent descriptive values and ETL processes will have to work a little harder to assign missing stories
the correct “missing” version to the events. The implementation of this is discussed
in Chapter 5.

Don’t go overboard with examples. Dimensions usually have far more attributes
than events have details, and you want to discover as many dimensional attrib-
utes as possible rather than exhaustively capture examples for only a few.
70 Chapter 3

Exclusive Attributes and Defining Characteristics


Dimensions with As you explore missing and mandatory examples you may come across mutually
mixed types often exclusive attributes: attribute pairs or attribute groups that cannot both have values
contain mutually for the same member. Figure 3-5 shows examples of exclusive customer attributes.
exclusive attributes Customers can have a DATE OF BIRTH and GENDER or a SIC CODE and
EMPLOYEE COUNT but not all 4 values. This is often an indication that you have
discovered a heterogeneous dimension, one that contains a highly diverse set of
members that are described in different ways (even if they are measured in similar
ways by taking part in similar business events). In this case, the customer dimen-
sion contains businesses and consumers. Their mutually incompatible descriptions
should be marked as exclusive attributes using the short code Xn where n is an
exclusive group number. In Figure 3-5, consumer attributes are exclusive group 1,
marked X1, and business attributes are group 2, marked X2. Exclusive attributes
are often a sign of mixed business models; e.g., business-to-business (B2B) and
business-to-consumer (B2C), offering products and services.

Figure 3-5
Exclusive customer
attributes

Exclusive attributes If you discover exclusive attributes, you need to find their defining characteristic(s):
have at least one one or more attributes whose values control the validity/applicability of the exclu-
defining sive attributes. For customer there is a single attribute, CUSTOMER TYPE, which
characteristic dictates whether the attributes in X1 or X2 are valid. This is marked as a defining
characteristic with the code DC. If a dimension contains multiple defining
characteristics each DC code should be followed by a list of the exclusive group
numbers controlled by the attribute. For example, if CUSTOMER TYPE was one
of several DC attributes in CUSTOMER it would be marked DC1,2 because it
selects between X1 and X2 group attributes only.

DCn,n : Defining Characteristic, dictates which exclusive attributes are valid.


Xn : Exclusive attribute or attribute group
Modeling Business Dimensions 71

Good defining characteristics should be low cardinality, mandatory attributes. Defining


They should have a small number of unique values to match the small number of characteristics
exclusive groups they control and they should always be present to provide con- should be low
trolling values. Because of these properties DC attributes are typically important cardinality and
levels in dimension hierarchies. If a DC attribute is optional and high cardinality mandatory
and/or a dimension is awash with exclusive groups it may be a clue that you are
struggling to model a ‘one size fits all’ generic dimension.

Exclusive attribute groups can be nested if required. For example, if “for profit” Exclusive attribute
and “non profit” organizations were described differently, an additional defining groups can be
characteristic BUSINESS TYPE marked DC3,4 would govern their descriptive nested with other
attributes marked X3 and X4. As these are all business related attributes they exclusive groups
would be nested within the “Business” exclusive group X2 using the code combina-
tions X2, X3 and X2, X4. Their defining characteristic BUSINESS TYPE is also a
business only exclusive attribute so it should be marked in full as X2, DC3,4.

Some of the exclusive attributes in Figure 3-5 are marked as mandatory (MD) but An attribute can only
are not always present because they are exclusive to a subset of the dimension be mandatory when
members. The code combination Xn, MD means exclusive mandatory attribute: its Xn group is valid
attribute is only mandatory when its exclusive group is valid.

Defining characteristic and exclusive attribute groups allow you to model subsets Exclusive attribute
within a single BEAM✲ table. Subsets can help you later to define restricted views subsets can be
(or swappable dimensions, see Chapter 6) to increase usability and query perform- implemented
ance. They also provide important ETL processing rules and checks. as separate tables

Using the 7Ws to Discover Attributes


Every dimensional attribute, just like every event detail, is one of the 7Ws. There- Use the 7Ws
fore you can use the 7Ws as a question checklist to help you ask stakeholders for as a checklist
additional attributes when their initial flood of attributes starts to dry up. Not every for discovering
W question makes sense for every W-type dimension so you don’t want to stick so attributes
rigidly to the BEAM✲ sequence you use to follow the narrative arch of an event.
For who, what, when and where dimensions it is often useful to start with a ques-
tion of the same W-type as the dimension. For who and what dimensions the
answers are often example members for type attributes. For example, if you ask:

Who or what is a customer or a product?

Pomegranate stakeholders might reply with examples like:

Customers are consumers, businesses, charities…


Products are computers, software, accessories…
They can also be services.
72 Chapter 3

Answers like these would lead you to attributes such as CUSTOMER TYPE and
PRODUCT TYPE (if you didn’t already have them). Table 3-1 illustrates other
7W-inspired questions and example answers for customer and products.

Table 3-1
Example 7W
attribute questions
and answers

BEAM✲ MODELER QUESTION STAKEHOLDER ANSWERS

Who else is associated with a customer? Primary Contact, Spouse, Sponsor,


Decision Maker, Owner (Parent
Company), Referrer

Who is associated with a product? Manufacturer, Distributor, Supplier,


Marketer, Promoter, Product
Manager, Inventor, Designer,
Developer, Author

What dates (whens) are important to know Birth Date, Graduation Date, First
about a customer? Purchase Date, Last Purchase
Date, Renewal Date

What milestone dates (whens) are there for Launch Date, Arrival of First
a product? Competitor, Patent Expiration Date,
Discontinued Date

Where are customers? Headquarters, Sales Region, Work


Address, Home Address, Nearest
Branch

What geographic (where) information Country of Origin, Manufacturing


describes a product or service? Plant, Language, Market, Voltage

Are there any single-valued quantities (how Life Time Value, Loyalty Score,
many) that describe or group customers? Current Balance, Number of
Employees, Number of Dependents

What quantities (how many) describe Weight, Size, Capacity, List Price
products?

Why or how do customers become Channel, Prospect Source, Referral


customers?

When stakeholders give you examples (such as “Consumer” or “Business”)


instead of a suitable attribute name (Customer Type) try adding these examples to
the dimension table in a new column against their matching members and then
ask what that column should be called.
Modeling Business Dimensions 73

Dimensional Hierarchies
Hierarchies provide a mechanism for dealing with details that are too numerous or Dimensional
small to work with individually. Dimensional hierarchies describe sequences of hierarchies support
successively lower cardinality attributes. They allow individual business events to BI drill-down
be consistently rolled up to higher reporting levels, and subsequently drilled-down analysis
on, to explore progressively more detail. Without hierarchies, BI reporting would
be overwhelmed by detail.

So how much detail is too much? In everyday life we all seem to agree that 365 (or When dealing with
366) are too many! Too many days to always deal with individually—too short a lots of things, we
period of time to complete large tasks and activities or see trends. So we group our naturally tend to
days into longer periods: weeks, months, quarters, terms, semesters, seasons etc. so organize them
that we can plan bigger things and see patterns and trends in our lives. Figure 3-6 hierarchically into
shows how the days in the first quarter of 2012 can be grouped hierarchically. fewer and fewer
Organizations naturally do this with the many fiscal time periods, geographic groups
locations, people, products, and services that they work with. The clue is in the
name: organizations organize things and if there are enough things (their 7Ws)
they typically organize them hierarchically. The majority of the 7Ws that describe a
business will have de-facto hierarchies in place, and it is vitally important that they
are standardized and made available in the dimensions in the data warehouse.

Figure 3-6
The calendar: a
balanced hierarchy

Why Are Hierarchies Important?


Explicit hierarchies are not strictly necessary for drill-down or drill-up reporting. Agile dimensional
So long as dimensions contain the attributes that BI users want to drill on they can modelers take
effectively drill-down by adding them to existing reports and drill-up by removing advantage of
them. However, there are a number of significant modeling and implementation hierarchies for a
benefits in making at least one hierarchy in each dimension explicit: number of reasons
74 Chapter 3

Hierarchy definition Hierarchies provide a necessary hook to catch dimensional attributes. When
provides a hook for you ask stakeholders about hierarchies you are asking them how they (would
catching additional like to) organize their data. Discussing this activity is one technique for adding
attributes some otherwise missing narrative to dimension stories and prompting stake-
holders for the BI friendly attributes you need to model good dimensions.
When stakeholders think about their 7Ws hierarchically, they describe low
level attributes that can be used as discriminators for similar dimensional
members, and higher level attributes that can group together many dimen-
sional members.

Hierarchies help When stakeholders describe their favorite hierarchy levels they will frequently
expose “informal” provide you with additional “informal“ data sources (spreadsheets, personal
stakeholder databases) they own that contain this categorical information. These stake-
maintained data holder maintained sources often contain hierarchy definitions, vital to BI, that
sources that can are missing from “formal” OLTP databases because they are nonessential for
greatly enrich operational activity. Many operational applications happily perform their func-
dimensions tion at the bottom of each hierarchy with no knowledge of the higher level
groupings that are imposed upon their raw transactions for reporting purposes.
For example, orders can be processed day in, day out without the order proc-
essing system knowing how a single date is rolled into a fiscal period. Similarly,
items can be shipped to the correct street number/postal code without knowing
how the business currently organizes sales geographically (or how they might
have been organized differently last year).

Hierarchies help you Hierarchies exist so that organizations can plan. Discussing hierarchies with
discover planning stakeholders will get them thinking about their planning processes, and will
processes likely help you discover additional events and data sources that represent budg-
ets, targets or forecasts. You must make sure the dimensions you design con-
tain the common levels needed to roll up actual event measures for comparison
against these plans.

BI users and BI BI users like default hierarchies and BI tool “click to drill” functionality that
tools require default allows them to quickly drill-down on an attribute without having to manually
hierarchies to decide each time what to show next. For example, if users drill on “Quarter”
enable simple drill- they usually want to see monthly detailed data by default. Explicit hierarchies
down establish predictable analytical workflows that are very helpful to (new) BI us-
ers exploring the data for the first time. “Clicking to drill” is less laborious and
error prone than manually adding and removing report row headers.

Hierarchies are Everyone wants common drill-down and drill-up requests to happen quickly.
used to optimize Explicit hierarchies are needed to define efficient data aggregation strategies in
query performance the data warehouse. On-Line Analytical Processing (OLAP) cubes in multidi-
mensional databases, aggregate navigation /query rewrite optimization in rela-
tional databases and prefetched micro cubes in BI tools, all take advantage of
hierarchy definitions to maximize query performance.
Modeling Business Dimensions 75

Hierarchy Types
Data warehouses and BI applications have to deal with three types of hierarchies: There are three
balanced, ragged (unbalanced) and variable depth (recursive). Each of these can hierarchy types:
come in two flavors: simple single parent and more complex multi-parent. Of the balanced, ragged
six varieties shown in Figure 3-7, single parent balanced hierarchies are the easiest and variable depth
to implement and use and should be the main focus of your initial modeling
efforts.

Figure 3-7
Hierarchy types

Balanced Hierarchies
Balanced hierarchies have a fixed (known) number of levels, each with a unique Balanced
level name. Time (when) is an example of a balanced hierarchy, as the example hierarchies have
calendar data in Figure 3-6 shows. This example has four levels: day, month, fixed numbers of
quarter, and year. The hierarchy is balanced because there are always four levels; levels
days always roll up to months, months to quarters, and quarters to years—there are
no exceptional dates that do not belong to a month and only belong to a quarter or
a year.

Being balanced has nothing to do with the number of members (unique data The number of
values) at each hierarchy level. For example, even though the number of days in a members at each
month varies from 28 to 31, and days in a year can be 365 or 366, the calendar level can vary
hierarchy is still balanced in depth. Figure 3-6 is not the only time hierarchy;
alternative hierarchies of day → fiscal period → fiscal quarter → fiscal year and
day → week → year may all exist in the same calendar dimension. Each of these is
a separate balanced hierarchy.

A hierarchy is implemented in a dimension by adding an attribute for each of its Balanced hierarchy
levels. For a balanced hierarchy each of its fixed levels must be a mandatory attrib- levels are
ute with a strict M:1 relationship with the parent attribute one level above it and a mandatory attributes
1:M relationship with the child levels below it.
76 Chapter 3

Ragged Hierarchies
Ragged hierarchies Ragged (or unbalanced) hierarchies are similar to balanced hierarchies in that they
have missing levels have a known maximum number of levels and each level has a unique name, but
(with zero members) not all levels are present (have values) for every path up or down the hierarchy—
that unbalance them making some paths appear shorter than others. Figure 3-8 illustrates a ragged
product hierarchy, where a product (POMServer) does not belong to a subcate-
gory. This product is effectively a subcategory all of its own.

Figure 3-8
Ragged product
hierarchy

Try to balance You can model ragged hierarchies in a dimension by using non-mandatory attrib-
slightly ragged utes for the missing levels, but these gaps (missing values) cause problems for BI
hierarchies by drilling. If a hierarchy is only slightly ragged you can often redesign it with the
removing levels or stakeholders help as a balanced hierarchy, to improve reporting functionality. This
filling in missing can involve removing levels that are not consistently implemented for all members
values or creating new level values to fill in the gaps (e.g. a subcategory value of “Server”
for the Figure 3-8 example). See Chapter 6 for more details on balancing ragged
hierarchies.

Variable Depth Hierarchies


Variable depth Variable depth hierarchies have a variable (unknown) number of unnamed levels.
hierarchies often The variable levels do not have unique names because they are typically all of the
represent recursive same type; for example, in human resource hierarchies that document the relation-
relationships ships between staff and managers, each level is an employee. Another example is
the bill of material for a product comprised of components and subassemblies that
are themselves decomposed into other components and subassemblies. Variable
depth hierarchies are also know as recursive hierarchies because they are typically
represented in source data by recursive relationships: tables that join to themselves.

Recursive relationships are used in operational database design for succinctly


recording variable depth hierarchies but are very difficult to work with in a data
warehouse. They are impractical for measuring business processes because they
are impenetrable to stakeholders and BI tools alike. They must be “unfurled” for
efficient reporting purposes using the hierarchy map technique described in
Chapter 6.
Modeling Business Dimensions 77

Multi-Parent Hierarchies
The time hierarchy in Figure 3-6 is single parent hierarchy because each child level Multi-valued
value rolls up to just one parent level value. In contrast, Figure 3-9 shows a multi- hierarchies contain
parent product hierarchy where a product (iPipPhone) belongs to more than one members with two
Product Type (it is part telephone, part media player). In a multi-parent hierarchy or more parents at
each child level can roll up to multiple parents. If a multi-parent product hierarchy the same level
is used to roll up sales to the Product Type level, something must be done to
account for products that fall into multiple types. Their sales will need to be care-
fully allocated; otherwise revenue for products with two parents will be double-
counted at the Product Type or Subcategory level.

Figure 3-9
Multi-valued Product
hierarchy

Multi-parent hierarchies can also be ragged or variable depth. The latter are typi- Multi-parent,
cally represented in source systems by M:M recursive relationships. Multi-parent variable-depth
hierarchies and variable depth hierarchies cannot be modeled directly in dimen- hierarchies
sion tables. Chapter 6 covers additional structures (hierarchy maps) for coping represent M:M
with these complex hierarchies and handling fact allocation at query time across recursive
multiple parents. For the remainder of this chapter, assume hierarchies to be single relationships
parent hierarchies that are modeled within dimension tables.

A dimension can contain multiple hierarchies of different types. You should model
at least one balanced hierarchy for each dimension to help discover additional
attributes and common levels for comparisons across processes, and to enable
default BI drill-down facilities.

Hierarchy Charts
Hierarchy charts are simple, quick to draw diagrams used to model single or Hierarchy charts
multiple hierarchies. On a hierarchy chart a dimensional hierarchy is represented are a quick way
by a vertical bar with the dimension name at the bottom and the highest-level to visualize
attribute of the hierarchy at the top. The levels are represented as marks on the bar, hierarchies
in ascending order. Figure 3-10 shows three example hierarchy charts for Time and
Product.

Hierarchy charts are based on Multidimensional Domain Structures (MDS) de-


scribed in Microsoft OLAP Solutions, Erik Thomsen et al., Wiley, 1999.
78 Chapter 3

Figure 3-10a, b, c
Hierarchy charts for
Time and Product

Levels can be When you draw a hierarchy chart you can space out the level tics evenly, as in
spaced evenly or Figure 3-10a and 3-10c, or in rough approximation of their relative aggregation, as
relative to the in Figure 3-10b where levels that expose more details are placed further below their
aggregation they parent than levels that reveal fewer details. Relative spacing gives stakeholders a
provide visual clue as to how much more detail they can expect to drill down to at each
level, or how selective filters would be placed at various levels. You can also anno-
tate levels with their approximate cardinalities, as in Figure 3-10a. Large gaps or
jumps in cardinality on a hierarchy chart can prompt stakeholders for missing
levels that would give them ‘finer grain’ drill-down and even more interesting
descriptions.

Hierarchy charts In addition to providing a visual comparison of levels within a single hierarchy, a
can show single or hierarchy chart can also be used to compare multiple hierarchies for a single
multiple hierarchies dimension, as in Figure 3-10b, or all the dimensions associated with an event, as in
Figure 3-11.

Figure 3-11
CUSTOMER
ORDERS
hierarchy chart

Modeling Hierarchy Types


Hierarchy charts Ideally, hierarchy charts are used to discover the attribute levels of simple balanced
can be annotated to hierarchies, but they are capable of modeling ragged, multi-parent, and variable
model ragged, depth hierarchies when necessary. Figure 3-12a shows a hierarchy chart for the
multi-parent and ragged product hierarchy of Figure 3-8. The missing/optional level is enclosed in
variable depth brackets. In Figure 3-12b the multi-parent hierarchy (matching Figure 3-9) is
hierarchies denoted with a double bar between the product child level and its multiple product
Modeling Business Dimensions 79

type parents. In Figure 3-12c the variable depth of a human resources (HR) hierar-
chy caused by the recursive relationship (see Chapter 6) between managers and
employees is modeled by adding a circular path between the two levels. These
annotations can be combined to model the most complex hierarchies.

Figure 3-12a, b, c
Ragged,
multi-parent and
variable depth
hierarchy charts

Modeling Queries
Event hierarchy charts which combine multiple dimension hierarchy charts for the Event hierarchy
same event can be used to model query definitions for report and dashboard charts can model
design. One or more queries can be defined on an event hierarchy chart as lines the dimensionality
connecting the referenced levels, as shown in Figure 3-13 where X marks levels that of queries, OLAP
are used to filter the query (WHERE clause), and O marks those used to aggregate cubes and
it (GROUP BY clause). In this way, event hierarchy charts can also be used to aggregates
model the dimensionality of OLAP cubes and aggregate fact tables.

Figure 3-13
Query definition
hierarchy chart

Event hierarchy charts can be used while modeling events to help capture their
major dimensional attributes. By drawing a hierarchy chart above an event table
you can record detail about detail (dimensional attributes) in hierarchical order at
the same time that you model event details (dimensions and facts).
80 Chapter 3

Discovering Hierarchical Attributes and Levels


You discover hierarchies by asking stakeholders how they organize the members of
a dimension. You could begin by drawing a hierarchy chart with PRODUCT at the
bottom and asking:

How do you organize products (hierarchically)?

It’s a leading question if you add “hierarchically”, but stakeholders generally have a
good idea about the hierarchies they need, and will usually offer you candidate
levels in hierarchical order—which helps. If any are new attributes check that they
belong in the dimension (have a M:1 relationship with it) and ask for examples
before considering them as candidates.

Introduce stakeholders to hierarchy charts by drawing a simple version of the


time hierarchy (Figure 3-10) on the whiteboard. Try using relative spacing for this
well known hierarchy to get stakeholders thinking generally about additional
useful levels that fill large hierarchical gaps. At some point, you will want to use
this chart to model any custom calendar levels and design an explicit CALENDAR
dimension (see Chapter 7).

Ask stakeholders to Ask stakeholders where they think a candidate attribute sits on the hierarchy chart
add their new levels bar and add it to your diagram where they suggest, or better still get them to add it.
to the hierarchy Figure 3-14 shows SUBCATEGORY added to a PRODUCT hierarchy chart be-
chart tween BRAND and CATEGORY.

Figure 3-14
Adding
SUBCATEGORY to
the PRODUCT
hierarchy at the
correct level

Check that each As each new candidate is added, you need to check that it is in the right position
candidate level has relative to the existing levels. It must have a M:1 relationship with its parent (if
the correct parent present) and a 1:M relationship with its child. If you or any stakeholders are
child relationships unsure, you can check the relationship by methodically asking the following
Modeling Business Dimensions 81

“moment-in-time” questions and temporarily marking up the hierarchy chart with


the cardinality results. Starting with the parent relationship, ask:

Can a [Candidate] belong to more than one [Parent]


(at one moment in time)?

If the answer is YES put “M” just below the parent.


If the answer is NO put a “1” just below the parent.

Can a [Parent] have more than one [Candidate]?

If the answer is YES put “M” just above the candidate.


If the answer is NO put “1” just above the candidate.

For example, to test that SUBCATEGORY belongs below CATEGORY ask:

Can a SUBCATEGORY belong to more than one CATEGORY?

then:

Can a CATEGORY have more than one SUBCATEGORY?

If you finish with “M” above the candidate (SUBCATEGORY) and “1” below the If a new level is M:1
parent (CATEGORY) (NO, YES answers) as in Figure 3-14 then you have the M:1 with its parent,
relationship you are looking for and the candidate belongs in the hierarchy below check that it is 1:M
the parent. If the child below the new level is the dimension itself then the candi- with its child
date is in the correct position (you already know that the new level has a 1:M
relationship with the dimension). Otherwise you test that the child relationship is
1:M by asking a few more quick fire questions while pointing at the hierarchy chart
(pointing always helps):

Can a [Candidate] have more than one [child]?

If the answer is YES then put “M” just above the child.
If the answer is NO put “1” just above the child.

Can a [child] belong to more than one [Candidate]?

If the answer is YES put “M’ just below the candidate.


If the answer is NO put a “1” just below the candidate.
82 Chapter 3

To test that SUBCATEGORY belongs above BRAND ask:

Can a SUBCATEGORY have more than one BRAND?

then:

Can a BRAND belong to more than one SUBCATEGORY?

If you finish with “M” above the child (BRAND) and “1” below the candidate
(SUBCATEGORY) (YES, NO answers) you have the right M:1 child relationship,
the candidate is in the correct position and you can move on to the search for
another level in the hierarchy.

Hierarchy Attributes at the Same Level


A 1:1 relationship If you have two “1”s between levels this indicates a one-to-one (1:1) relationship
means that which means that the candidate is at the same level as the existing hierarchical
attributes are at the attribute. As such it adds no additional drill-down functionality to the hierarchy,
same level and you should not insert it. It may, however, replace the existing attribute, if it
proves to be a better report label. For example, CATEGORY NAME is at the same
level as CATEGORY CODE, but would be a more useful descriptive label on a
drill-down report than the cryptic code. Remember that even if an attribute is not
added to a hierarchy it will still be available in the dimension as an alternative for
report formatting, so excluding CATEGORY CODE from the hierarchy does not
preclude its use in custom drill-down reporting, just its use in default drilling.

Hierarchy Attributes that Don’t Belong


M:M indicates that If you get two “M”s between levels this indicates a M:M relationship which means
attributes do not that the two levels do not belong in the same (balanced) hierarchy. If stakeholders
belong in the same want to define a hierarchy containing the candidate it must be a separate parallel
hierarchy hierarchy to the current one, just as Week and Month are in different time hierar-
chies in Figure 3-10b.

Hierarchy Attributes at the Wrong Level


Reverse If you finish with the reverse relationship from the one you are looking for the
relationships mean candidate is at the wrong level. If you get a 1:M with a parent (you’re looking for an
that attributes need M:1), the candidate is too low and you should move it up a level. If you get an M:1
repositioning on the with a child (you’re looking for a 1:M), the candidate is too high and you should
hierarchy chart move it down a level. After you move it, be sure to retest it against its new parent
and child levels.

Follow each hierarchy level name with a few example values as a bracketed list as
in the Figure 3-14 chart. This is especially useful if you are modelstorming a
hierarchy chart with limited whiteboard space and stakeholders cannot see a copy
of the dimension example data table.
Modeling Business Dimensions 83

Completing a Hierarchy
After you have found the correct position for the new level—or discarded it if it Check for “hot”
does not belong—you continue to find more levels by pointing at the existing ones planning levels
and asking stakeholders whether any other levels exist above or below them. When before finishing
they have finished providing new levels you should ask one more hierarchy related each hierarchy
question:

Do you have plans, budgets, forecasts, or targets associated


with [dimension]s? If so at what level(s) are they set?

If stakeholders identify planning levels, mark these with an asterisk (*) to denote Mark hot levels with
that they are “hot”, i.e., likely to be particularly important levels for many BI an * and be
comparisons. You may need to design aggregates or OLAP cubes at these levels to prepared to model
improve query performance (see Chapter 8). You should definitely model the their additional
planning events themselves along with any additional hot level rollup dimensions matching events
they require. E.g., Month is typically a hot level in the when hierarchy that is and rollup
implemented as a rollup dimension (a separate physical dimension derived from dimensions
the base calendar dimension) to match the granularity of plan and aggregate facts.
(See Chapters 4 and 8.)

Hot levels often appear at the points where different W-type hierarchies logically or Hot levels exist
physically intersect. For example, Category and Department are hot levels in the where different W-
Product and Employee hierarchies (as denoted in Figure 3-12) because this is type hierarchies
notionally where a what hierarchy of things (products) intersects with a who intersect
hierarchy of people (employees). At that point, a de facto 1:1 relationship exists
between the HR and product hierarchies: a single employee (a product sales man-
ager) responsible for a single group of employees (a department) is also responsible
for a single group of products (a category). He or she will want many reports
summarized to these levels.

When you have finished modeling a hierarchy, check that each level is a mandatory Check all levels are
attribute. If some are not mandatory then you may have a ragged hierarchy instead mandatory to avoid
of your preferred balanced one. If data profiling confirms that certain level attrib- ragged hierarchies
utes contain nulls, then update the hierarchy chart to document the missing levels
by putting brackets around their names (as in Figure 3-12a) prior to resolving the
issue with stakeholders. It is especially important that hot levels are mandatory for
successful cross-process analysis.

When you have completed all the hierarchy charts for a dimension, rearrange the
level attributes in hierarchy order in the dimension table with the low level attrib-
utes first followed by higher level attributes (reading left to right). Hierarchical
column order increases readability and helps to roughly document the hierarchi-
cal relationships within physical dimension tables.
84 Chapter 3

Dimensional History
You must define When you have discovered all the attributes of a dimension, described them using
how dimensional examples, and modeled their hierarchical relationships, there is one more piece of
attributes handle information that stakeholders must tell you about each one: how to handle its
history history. This information is also known as a dimension’s temporal properties or
slowly changing dimension (SCD) policy.

Slowly changing Stakeholders instinctively feel the need to preserve event history—especially legally
dimensions binding financial transactions—but may think of dimensions as (relatively static)
dramatically affect reference data that simply has to be kept up to date when it does change. While it is
how historical true that dimensions are relatively static compared to dynamic business events,
events are reported slowly changing dimension history, or rather the lack of it, can have a profound
effect on a data warehouse’s ability to accurately compare events over time and
meet stakeholders’ initial and longer-term needs. For example, Dean Moriarty, a
customer who was based in New York, relocates to California at the beginning of
this year. Should a BI query for “order totals by state, this year vs. last year” associ-
ate all of Moriarty’s orders to his current location: California? Or should Mori-
arty’s orders last year be associated with New York (last year’s location), and only
this year’s with California? What if BI users want to look at the biggest spenders in
California over the last two years, should their queries include Moriarty based on
his high spending while he was still in New York last year or exclude him because
he hasn’t spent so much since moving to L.A.? Another way of asking these ques-
tions is “Should queries use the current or historical values of customer state?” Is
there a simple answer?

Current Value Attributes


Operational systems Operational systems generally concentrate on the “here and now” with an under-
generally default to standable bias towards current values for reference or master data. For example, an
current valued order processing system will make sure that only “this year’s model” products are
reference data available to be selected for shipping to customers’ current locations. Because of this
bias, operational database applications will often overwrite reference data when it
changes. If the same updates are applied to dimensional attributes they will contain
current values only.

Current value (CV) Dimensional attributes that only contain the current value descriptions provide “as
attributes provide is” reporting; i.e., they roll up event history using today’s descriptions, making it
“as is” reporting that seem as if everything has always been described as it is now. For the previous
matches operational example, current value (CV) attributes would roll up all of Moriarty’s orders to
reporting results California (his current location) regardless of where, on the road, he was living
when he placed them. This is typically the style of reporting that stakeholders are
used to from their attempts to analyze history directly from operational systems.
Modeling Business Dimensions 85

CV attributes enable the data warehouse to produce the same results as existing
operational reports—often an initial acceptance criteria for stakeholders.

Unfortunately problems arise when current values are the only descriptions avail- With current values
able for DW/BI systems, which by definition, must support accurate historical only, dimensional
comparisons. CV attributes may be capable of answering questions such as: history is lost and
“Where are the customers, now, who bought … in the last three years?” or “What many potentially
are the top selling products this year vs. last year?” But they cannot answer: “Where important BI
were those customers and what were they like when they bought our products?” or questions cannot
“Exactly what were products like (how were they described and categorized) at the be answered
time they were purchased?” These questions cannot be answered because dimen- correctly
sional history is lost when CV attributes are updated (overwritten).

Another limitation of current value only solutions is that they cannot reproduce Current value only
previous historical analyses. The same report with exactly the same filters will often designs make it
yield different results when run later—even though every detailed event remains impossible to
unaltered—because the reference data used to group, sort and filter the events has reproduce reports
changed when it should not have. This is a common bane of reporting from opera- when dimensions
tional sources that stakeholders do not want repeated in the data warehouse. change

It’s not all bad news for CV attributes. Even though they are historically incom- Current value
plete/incorrect, they can be useful for certain types of historical comparisons that attributes can
recast history: deliberately pretend everything was described as it is now. For usefully recast
example, a sales manager who wants to compare channel sales for this year versus history
last year, may need to pretend that today’s channel structures also existed last year,
in order to make the comparison. This is exactly what CV channel description
attributes will do.

Corrections and Fixed Value Attributes


Current values are also entirely appropriate when mistakes are corrected and Current values are
previous erroneous values should never be used again. For example, when a cus- historically correct
tomer or employee’s date of birth changes in an operational database, we know it for fixed value (FV)
hasn’t really changed (from one corrected date to another), it must be a correction. attributes that do not
We assume that the most recent (current) value is correct as someone has gone to change over time
the trouble of making the update. Date of Birth is an example of a fixed value (FV) but can be corrected
attribute that cannot change but can be corrected (fixed when it’s not right). FV
attributes have a strict 1:M relationship over time with the dimension. All other
attributes have potentially a M:M relationship over time with the dimension, which
is ignored (treated as M:1) in the case of CV attributes.
86 Chapter 3

Historic Value Attributes


Historical value (HV) Historic value (HV) attributes support “as was” reporting by providing the histori-
attributes support cally correct dimensional values to group and filter historical events and measures.
“as was” reporting Everything is reported “as it was” at the time of the event. Returning to the Mori-
arty example, an HV customer STATE attribute would roll last year’s orders up to
New York, and this year’s up to California.

HV attributes Preserving dimensional history requires more ETL work but data warehouses that
require more ETL are built using HV dimensions are more flexible. Not only can they correctly
resources, but answer the “What were things really like when …?” questions by default, they can
provide more also be used with minimal effort to recast history to current values for “as is”
flexible reporting reporting, or to a specific date, such as a financial year-end for “as at” reporting.
HV dimensions techniques for supporting both “as is” and “as was” reporting are
covered in Chapter 6.

Historic value dimensional attributes support the agile principle “Welcome chang-
ing requirements, even late in development.” Stakeholders are able to change
their mind about using current or historic values without ETL developers “tearing
their hair out” and having to reload the data warehouse.

Telling Change Stories


You discover To discover exactly how a dimension handles descriptive history, you add an extra
temporal properties row to the dimension table, to hold a change story for each attribute and ask the
by telling change stakeholders to help you fill it out. For each attribute, you start by asking if it can
stories change; for example:

Can the PRODUCT NAME of a PRODUCT change?

If the answer is NO, label the attribute as fixed value with the short code FV and
copy its example value from the first row (the typical member) into its change story
as in Figure 3-15 to illustrate that it is unchanging over time. Then move on to the
next attribute.

If the answer is YES, the attribute’s values are not fixed and you need to ask a
follow-on question to discover if stakeholders want/need (not always the same
thing) historical values. For example, if PRODUCT TYPE is not FV ask:

If the PRODUCT TYPE of a PRODUCT changes will you need its


historic values for grouping and filtering your reports?
Modeling Business Dimensions 87

Make sure you never ask: “Do you want current values or historic values?” The Ask about historic
answer, which is invariably “current values”, tells you nothing you shouldn’t values: you know
already know. Of course stakeholders want current values—that’s a given: they are current values are
incredibly interested in current business events and want those events to be de- important already
scribed properly using current values just as they are in the operational systems.
You want to discover if attribute history is equally important. It is also highly
misleading to present historic values and current values as an either/or choice: HV
attributes must include current values because current values are the historically
correct descriptions for the most recent events.

Figure 3-15
Modeling
dimensional history
using a change
story

If the answer to your “Do you need historic values?” question is NO, double-check Double-check that
with the stakeholders that they will only ever care about current values and are history really is
fully aware of the BI limitations they are settling for. The problems of misstated unnecessary before
history and unrepeatable reports caused by CV only attributes are best explained defining a CV
using examples (see Documenting CV Change Stories shortly), to any stakeholder attribute
making this decision for the first time.

For many attributes it is a good idea to treat the stakeholders’ CV answers as You may need to
reporting directives rather than storage directives. They, quite rightly, are telling you provide CV
how they want their (initial) BI reports to behave: often exactly like existing opera- reporting behavior
tional reports they know (and love), but not how to store information in the data initially but store HV
warehouse—that’s not their role. For all but the largest dimensions and most attributes to provide
volatile attributes it is possible to efficiently store (compliance ready) HV attributes flexibility in the
but provide CV versions by default for reporting purposes. future
88 Chapter 3

Documenting CV Change Stories


For CV attributes: If stakeholders confirm that an attribute’s history is never needed—perhaps they
copy their typical want an explicitly current value attribute such as CURRENT STATUS or
examples into the LIFETIME VALUE—you label the attribute as CV. To fill out its change story, you
change story or can either copy the attribute’s typical example (as you would if it was FV), for
better still (time speed, or ask stakeholders for a changed value, for clarity. If you do the latter you
permitting) ask for must also change the typical example on the first row to this new value. Both the
new values to fill out typical example and the change story of a CV attribute contain the same value, the
the change story new current value, to “show, don’t tell” that there is no history. Figure 3-16 illus-
row and then trates this for CUSTOMER NAME: if customer Elvis Priestley changes his name to
replace the typical J.B. Priestley both the first and the last rows in the BEAM✲ dimension are updated
example values to J.B.; Elvis has left the building! The graphic demonstration of existing examples
being overwritten, during modelstorming, hammers home the point that history is
lost and may cause stakeholders to reconsider some of their CV attribute decisions.

Documenting HV Change Stories


For HV attributes If stakeholders say they do want the historical values of an attribute such as
ask for new PRODUCT TYPE, label the attribute as HV, and ask for a new example value to
examples for their place in the change story, as shown in Figure 3-15, but don’t update the original
change stories but typical example. Leaving different values in the first and last rows clearly document
leave their typical the attributes that can change and will hold different historical values over time for
examples unaltered the same dimension member: product code IPPBS16G in the Figure 3-15 example.

You can document If you, or the stakeholders, decide that both HV and CV data is needed for an
hybrid temporal attribute you can label it as an HV/CV or CV/HV hybrid attribute, with its default
requirements by reporting behavior listed first. In figure 3-15, SUBCATEGORY and CATEGORY
combining HV and default to HV but CV reporting will also be available. For both attributes their
CV short codes change stories reflect the more complex HV behavior.

Unless you are using specialist temporal database technology, the CV and HV
values of a hybrid attribute will need to be stored as separate physical columns in
the same dimension or in separate hot swappable dimensions. See Chapter 6 for
the hybrid slowly changing dimension design pattern to implement this.

Use CV/PV to For certain attributes, that change very infrequently, stakeholders may be content
document with the current value and one previous value: the value before the last change.
requirements for These attributes can be labeled CV/PV in the BEAM✲ model and implemented as
previous values separate columns in a physical star schema. This design pattern is know as a type 3
slowly changing dimension.

FV : Fixed Value or type 0 slowly changing dimension.


CV : Current Value only or type 1 slowly changing dimension.
HV : Historic Value or type 2 slowly changing dimension.
PV : Previous Value or type 3 slowly changing dimension.
Modeling Business Dimensions 89

Business Keys and Change Stories


As a general rule, business keys that uniquely identify dimension members and Business Keys (BK)
define dimensional granularity should not change over time; they should be de- should be FV by
fined as fixed value (FV) and you should simply copy their typical values into the default
change story. If this is not the case and business keys have been known to change
you will need to model their CV or HV rules and help design more complex ETL
processes to identify these awkward changes and handle them appropriately when
they occur.

Detecting Corrections: Group Change Rules


One of the ETL complexities of handling HV updates is how to distinguish be- HV ETL processes
tween corrections that should overwrite errors and genuine changes that should must differentiate
preserve history. This is a non-issue for FV and CV attributes because they handle corrections from
correction and change alike (FV attribute updates are always corrections). Being changes
able to tell the difference between corrections, minor changes and major changes is
especially important when designing the ETL for large, highly volatile dimensions
such as customer because tracking every change may not even be possible, let alone
necessary.

Ideally, source systems should provide reason codes for the most important HV Source systems
attribute updates. Unfortunately, explicit reason descriptions are not often avail- rarely provide
able. One of the many benefits of proactive dimensional modeling is that ETL update reasons
designers can take advantage of preemptive HV definitions to request that not only to help detect
update notification but update reason notification is built into a new operational corrections
system while it is still in the early stages of development.

If update reasons are not available, the next best thing is to define business rules Group change rules
that identify important changes based on groups of attributes that should all can help detect
change at the same time. An example group change rule which tracks only “large” corrections and
changes affecting several attributes at once, might be: minor changes

“If customer STREET Address changes but ZIP CODE is unchanged, then handle
the update as a correction (or minor move in the same Zip code area): do not
preserve the existing address. If STREET and ZIP CODE change together, track
the customer’s address history prior to this major relocation”

You can discover these rules by asking general questions like “What attributes To discover these
change together?” or specific questions for each attribute such as “what other rules ask for
attributes must change when this changes?”. You can also tie your questions to attributes that
some activity; you might ask: change together

When customers really move—rather than just correct


their address—which attributes should change?
90 Chapter 3

Asking for change Questions like this, not only expose change dependencies between existing attrib-
groups can help find utes, they can help uncover missed attributes too. They are another one of the
missed attributes BEAM✲ modeler’s secret weapons for attribute discovery. Discussing the activity
of change is another way of adding narrative to an otherwise static dimension and
will get stakeholders thinking. They might respond:

STREET, ZIP CODE and CITY should all change together. If only
one or two of those change, its probably a correction or a move
within the same zip or city. If customers move locally—in the same
city—we don’t need their old addresses. But if they move city, we
will want to use those previous addresses.

Use HVn to define a You can model a group change rule like this one, very concisely, by using num-
conditional HV bered HV codes to define conditional HV groups: attributes that only act as HV
group of attributes when every member of the same numbered group changes at the same time. In
that must change Figure 3-16, the stakeholders’ rule has been documented by marking STREET, ZIP
together CODE, and CITY as CV, HV1. They are each CV by default (the first temporal
short code in the list) so that individual changes will be treated as corrections.
Additionally, they are all members of the conditional group HV1 and will act as
HV to preserve address history when a customer moves city (when all three attrib-
utes change) unless, that is, an exceptional customer manages to move to the very
same street address in a different city. Perhaps ZIP CODE and CITY should be in a
group of their own (HV3) to safeguard missing this rare type of relocation.

Figure 3-16
Modeling a CV
change story and
group change rules

An attribute can Notice that the three HV1 attributes in Figure 3-16 are also in group HV2 along
belong to multiple with COUNTRY. This means that their history is tracked when all three (group
HVn groups and be HV1) change even if COUNTRY does not change, but COUNTRY itself will only
HV by default be tracked when all four address attributes (group HV2) change. If an attribute is
Modeling Business Dimensions 91

always HV but also triggers a conditional HV attribute: HV3, it would be marked


HV, HV3 rather than CV, HV3.

Effective Dating
When you have captured change stories and temporal business rules for each Add administrative
attribute, add three more attributes to the dimension table: EFFECTIVE DATE, attributes to each
END DATE, and CURRENT as in Figure 3-17. These additional administrative dimension for
attributes enable ETL processes to track changes and flag the current version of effective dating
each member. They effectively convert HV dimensions into minor event tables
capable of recording numerous small events.

Figure 3-17
Effective dating a
dimension table

With the addition of effective dating, readers who are familiar with how type 2 Change stories
slowly changing dimensions are implemented will notice how closely the change demonstrate type 2
story row in a BEAM✲ dimension matches this ETL technique. This is intentional slowly changing
as BEAM✲ models are designed to be translated into physical dimensional models dimension behavior
with minimal changes. It is also important that BEAM✲ modelers do not refer to but don’t use this
HV attributes as type 2 SCDs or attempt to modelstorm the final piece of their ETL jargon with
puzzle: surrogate keys, with business stakeholders. Type n SCD terminology and stakeholders
surrogate keys (covered in Chapter 5) are appropriate star schema-level topics for
discussion with ETL and BI developers not stakeholders.

Documenting the Dimension Type


When you have completed a dimension (for now), add a double bar to the end of Mark dimensions
the table just as you would for an event, and add its dimension type to the table that contain no HV
header. To do this you use one of the temporal short codes. If the dimension attributes as [CV]
contains at least one HV attribute mark the dimension as [HV], otherwise mark it
as [CV] to denote that it contains only CV, FV or PV, i.e., its members will not
include multiple historic versions.
92 Chapter 3

Minor Events
Not every event is a Occasionally, you will discover events that do not seem to have enough details or
significant business occur frequently enough to represent significant business processes in their own
process right; they seem more like dimensions. For example, imagine the answer to your
next “Who does what?” event discovery question is:

Customer moves (to a new) address.

Minor events have You model several event stories and end up with the CUSTOMER MOVES event
few details. They table in Figure 3-18. This is a perfectly acceptable event, with a subject-verb-object
often represent main clause, containing a who subject (CUSTOMER), an active verb (“moves”), a
external activity where object (ADDRESS), and a when detail (MOVE DATE) but that’s all. Despite
asking all the 7Ws questions, it lacks any other who, what, why, how, or how many
details. Why customers move, how much it costs them, or who helps them are
unknowns because the event is external to Pomegranate’s business. In BEAM✲
terms CUSTOMER MOVES is a minor event (despite being quite a major event for
the customer). Minor events represent activities that are not always interesting or
detailed enough for standalone analysis. But the data values arising from them are
important for correctly labeling, grouping, and filtering the other, far more inter-
esting, major events of the organization.

Figure 3-18
Minor CUSTOMER
MOVE event

HV Attributes: Dimension-Only Minor Events


Minor events can be If the verb provided by the stakeholders can easily be replaced with “has” without
modeled as HV losing important information, this often indicates that the subject and object can
attributes if they be attributes of the same dimension. For example, “customer moves to address on
occur infrequently move date” can be replaced with “customer has address on effective date” if the act
of moving, itself, is unimportant and all that stakeholders care about is a history of
customer locations. A customer dimension can model simple “has” events as HV
attributes.

If the subject and the object of an event both describe the same thing (e.g.,
customers) and there are no other details except when, you can handle the event
object as an HV attribute of the subject dimension, as long as the change repre-
sented by the event does not occur too often. Daily or monthly change would
make it a rapidly changing dimension—better handled as an event.
Modeling Business Dimensions 93

Minor Events within Major Events


If no explicit business process captures customer relocation, how does Pomegran- If minor events
ate know that it occurs? It must be recorded as a byproduct of some other signifi- occur frequently
cant business event(s). Customers typically inform you about new addresses when they can be
they order products. If customers move infrequently, these changes can be cap- modeled as
tured by an HV dimension. If moves are frequent—for example, some (perhaps additional details of
undesirable) consumers provide a new address every time they apply for credit— other, more
then the new address is a detail of the credit application event and “moving” is also interesting, major
part of that event—a minor event within a major event. For Orders, a new or events
different customer address may, in fact, be a third party delivery address for gift
purchases: a separate where detail of the event and therefore a separate who/where
dimension rather than an attribute of the customer who dimension.

If you discover a minor event with a small number of details (typically three Ws,
including when), ask how and when these details are captured. You may be able
to model their capture within a far more interesting major event.

Figure 3-19 shows a very different version of the CUSTOMER MOVES event One organization’s
compared to the minor version of Figure 3-18. By reading the event stories in both minor event may be
tables you can see that these are actually the same events happening to the same another
people—but in Figure 3-19 they have been modeled in far greater detail for the data organization’s major
warehouse of a relocation company. This CUSTOMER MOVES is clearly a major event
event—for that company.

Figure 3-19
Major CUSTOMER
MOVES event

Sufficient Attributes?
How do you know when you have sufficient attributes for a dimension or levels in
a hierarchy? There is no magic test; stakeholders will simply run out of ideas. If you
feel you have not discovered every possible attribute: don’t worry, be agile, press
on. As long as you have the major hierarchies and HV attributes, and a clear
definition of granularity for each dimension, additional attributes can be added
with relative ease in future iterations. That said, the great benefit of modelstorming
with stakeholders is the ability to define common (conformed, see next chapter)
dimensional attributes early on, so don’t miss your opportunity while you have
their initial attention.
94 Chapter 3

Summary
A dimensional data warehouse is only as good as its dimensions. Good dimensions contain
dimensional attributes that describe business events using terms that are familiar and
meaningful to the BI stakeholders. This is the best reason for asking stakeholders to modelstorm
the dimensions they need using examples to clarify the terms they use.

Dimensions themselves are discovered by event modeling; most who, what, when, where and
why event details become dimensions. Dimensional attributes are discovered by modeling each
of these details as the subject of its own BEAM✲ dimension table. How details can be
dimensionalized in this way too but they typically do not have additional descriptive attributes.
Non-descriptive how details become degenerate dimensions, stored in fact tables along with the
how many details which become facts.

The first additional attribute that must be modeled for a dimension subject is its identity. This is
the business key (BK) which uniquely identifies each dimension member and defines the
dimension granularity. If a dimension is created from multiple source systems there may be
more than one BK for a dimension member. The unique BK for a dimension can be a
composite of multiple source systems keys.

Further dimension attribute requirements are gathered by asking stakeholders to provide


example descriptive data for each dimension (using the 7Ws as an attribute type checklist).
Examples help to discover mandatory attributes (MD), and exclusive attribute combinations
(Xn) together with their defining characteristics (DCn,n).

Hierarchy charts describe how stakeholders (want to) organize dimensional members
hierarchically to support drill-down analysis and plan vs. actual reporting. Drawing these charts
helps to prompt stakeholders for additional hierarchical attributes and data sources. Hot
hierarchy levels, that represent popular levels of summarization, help to identify additional
planning events, rollup dimensions and aggregation opportunities.

Hierarchies can be balanced, ragged or variable depth. Each type can be single or multi-parent.
Single parent, balanced hierarchies are the easiest hierarchies to implement dimensionally and
the simplest to work with for BI. Additional techniques are needed to balance ragged
hierarchies and represent multi-parent and variable depth hierarchies (See Chapter 6).

Change stories describe how dimensional history is handled. The short codes CV, HV, FV, PV
are used to document the temporal properties of each attribute. These temporal codes can be
numbered to define group change rules involving multiple attributes.

Minor events are events that occur infrequently and contain few details. They typically do not
represent significant business processes that warrant modeling as separate events. Often they
can be modeled as HV dimensional attributes, or as additional details of other major events
(recognizable business processes).
4
abbreviations

MODELING BUSINESS PROCESSES


The only reason for time is so that everything doesn't happen at once
— Albert Einstein

Designing a data warehouse or data mart for business process measurement BI Stakeholders
demands that you quickly move beyond modeling single business events. All but need multiple
the simplest business processes are made up of multiple business events and BI events for process
stakeholders invariably want to do cross-process analysis. When you modelstorm measurement
these multi-event requirements you soon notice two crucial things:

Stakeholders model events chronologically. As you complete one event, Events sequences
stakeholders naturally think of related events that immediately follow or pre- represent business
cede it. These event sequences represent business processes and value chains processes and
that need to be measured end-to-end. value chains

Stakeholders describe different events using many of the same 7Ws. When Events share
you define an event in terms of its 7Ws, stakeholders start thinking of other common
events with the same details, especially events that share its subject or object. dimensions that
These shared details, known to dimensional modelers as conformed dimen- support cross-
sions, are the basis for cross-process analysis. process analysis

In this chapter we describe how an event matrix, the single most powerful BEAM✲ The event matrix is
artifact, is used to storyboard the data warehouse: rapidly model multiple events, an agile tool for
identify significant business processes and conformed dimensions, and prioritize modeling multiple
their development. events

The importance of conformed dimensions for agile DW design Chapter 4 Topics


Modelstorming event sequences with an event matrix At a Glance
Prioritizing event and dimension development using Scrum
Modeling event stories with conformed dimensions and examples
95
96 Chapter 4

Modeling Multiple Events With Agility


Agile designers Deploying an agile data warehouse that will eventually handle the multiple busi-
might be tempted to ness events required for enterprise BI is especially challenging. To meet the agile
limit the modeling principles of early and frequent delivery of valuable working software, agile design-
scope per release ers may be tempted to limit their modeling scope per release in terms of business
processes and/or stakeholder departments. Unfortunately this can quickly lead to
the silo data mart anti-pattern, of Figure 4-1.

This can result in With a tightly-controlled initial scope, BI users can receive their agile data marts
silo data marts early and obtain valuable insight from them individually on a department by
that are unable to department basis. So far so good, but when users want to step up to cross-
support cross- department, cross-process analysis they find they cannot make the necessary
process analysis comparisons due to incompatible or missing descriptions and measures. Rebuild-
ing each data mart from scratch is unthinkable so data is re-extracted from the
source systems so that each department can look at it “their way”. The cost of this
extra work and the inconsistent or conflicting answers that emanate from these
“multiple versions of the truth” drive BI stakeholders crazy!

Figure 4-1
Silo data marts that
cannot be shared:
a data warehouse
anti-pattern

With too limited a Silo data marts are examples of technical debt. Agile software development inten-
scope data tionally takes on technical debt when “just barely good enough” code is released.
warehouse design This makes good sense when the business value of early working software out-
incurs heavy weighs the interest on the debt: the extra effort involved in refactoring the code in
technical debt future iterations. However, for DW/BI projects, the cost of servicing high interest
technical debt: refactoring terabytes of incorrectly represented historical data, can
be ruinously high.

Traditional BDUF The especially high interest of database technical debt could be argued as a good
does not match agile reason for taking a traditional, non-agile, approach to data requirements gathering
BI requirements and data warehouse design itself and postponing agile practices for ETL and BI
Modeling Business Processes 97

development. But the problem is this fallback to the “big design upfront” (BDUF)
simply does not match the evolutionary nature of modern BI requirements nor
their delivery timescales. Plus it is incredibly hard for a DW/BI project to become
agile when it does not start off agile.

Instead, agile data warehouse modelers should stay agile (and dimensional), but Agile dimensional
lower their technical debt by balancing “just in time” (JIT) detailed modeling of modelers lower their
business events for the next development sprint and “just enough design up front” technical debt by
(JEDUF) for cross-process BI in the future. To do so, modelers need to rapidly modeling ahead just
model ahead in just enough detail to discover which of the dimensions, needed for enough to define
the next sprint, should also be conformed dimensions that will help to future proof conformed
their designs for enterprise BI. dimensions

Conformed Dimensions
Figure 4-2 shows a Promotion Analysis Report that combines information from two Conformed
events: CUSTOMER ORDERS and PRODUCT CAMPAIGNS to explore the dimensions allow
connection between campaign activity and sales revenue. The report is possible measures from
because the two different events have identical descriptions of PRODUCT and different events to
PROMOTION. These conformed dimensions allow measures from both events to be combined and
be aggregated to a compatible level and lined up next to one another on the report. compared
Lining up the answers or drilling-across like this appears obvious but if the events
are handled by different operational systems (an Oracle-based order processing
application and a SQL Server-based customer relationship management system)
then this report might be the first time that the two sets of data have actually met.
If each source system describes products and promotions differently and the
individual star schemas use these non-conformed descriptions, the analysis would
not be possible because the measures would not align.

Figure 4-2
Conformed
dimensions
enable
cross-process
analysis
98 Chapter 4

Conformed The simplest technical definition of a conformed dimension is a single physical


dimensions are dimension table shared by multiple fact tables or exact replicated copies of a master
shared by dimension, if fact tables are distributed on multiple database servers. Separate
multiple distinct dimensions can also be conformed at the attribute level if they contain
fact tables conformed attributes with identical business meanings and identical contents that
line up as common report row headers. Three types of dimension are conformed at
the attribute level, they are:

Swappable, rollup Swappable dimensions [SD] that are subsets of conformed dimensions. For
and role-playing example, a CUSTOMER dimension (1M people) and a subset EXTENDED
dimensions are WARRANTY CUSTOMER dimension (100K people) are conformed if they
conformed at the describe the same customer in exactly the same way. These two dimensions
dimensional would allow product sales and extended warranty claims to be compared for all
attribute level customers or just warranty holding customers. Swappable dimensions are cov-
ered in Chapter 6.

Rollup dimensions [RU] with conformed attributes in common with their base
dimensions. Figure 4-3 shows an example of the conformed when dimensions
CALENDAR and MONTH. These two dimensions can be used to compare
daily and monthly granularity measures at the Month, Quarter, or Year level.
Rollup dimensions are typically used to describe planning events and aggregate
fact tables.

Role-playing dimensions [RP]: Single physical dimensions used to play multi-


ple logical roles. For example, a CALENDAR [RP] dimension used to play the
role of ORDER DATE and PAYMENT DATE.

Figure 4-3
When dimensions,
conformed at the
attribute level
Modeling Business Processes 99

While dimensions are frequently shared across many business processes, facts are Conformed
typically specific to a single process or event. However, they can be used to create measures rely on
conformed measures if they have compatible calculation methods and common compatible facts
units of measure that allow totaling and comparison across processes; for example, with common units
if sales revenue and support revenue are both pre-tax dollar figures they can be of measure
combined to produce region totals.

Conforming data is not so much a technical challenge, as a political one, requiring Conforming data
consensus on data definitions across many departments within an organization as is a political
well as operational systems. By modelstorming with stakeholders you highlight the challenge. BEAM✲
value of conformed dimensions to the very people who can make them happen. tackles this by
Modeling multiple events by example, as BEAM✲ encourages you to do, quickly modeling with
reveals inconsistencies that would otherwise thwart conformance. Stakeholders will stakeholders who
work to address these inconsistencies and conform dimensions when they see the can make it happen
potential business value they provide.

Homonyms are data terms with the same name but different meanings. They are Homonyms are non-
the opposite of conformed dimensions and attributes. For example, both Pome- conformed data
granate’s Sales and Finance departments use the term “Customer Type” but Sales terms with the same
has five types of customer and Finance only three. If stakeholders cannot agree on name but different
a conformed customer type then you would have to define two uniquely named meanings
details: SALES CUSTOMER TYPE and FINANCE CUSTOMER TYPE. However,
taking this approach for every homonym perpetuates incompatible reporting and
weakens the analytical power of the data warehouse. Perhaps by discovering this
problem through modelstorming examples, Sales and Finance stakeholders could
agree on a new conformed version of Customer Type with four descriptive values.

Synonyms are data terms with the same meaning but different names. Organiza- Synonyms are
tions will often use different names across different departments/ business proc- conformable data
esses for what could be the same conformed dimension or attribute. For example, terms with the same
an insurance company might use the terms Customer Enrollee, Subscriber, Policy meaning but
Holder and Claimant interchangeably, while a pharmaceutical company may refer different names
to the same person as a Physician, Doctor, Healthcare Provider or Practitioner.

The value of modeling with examples, to help define conformed dimensions,


cannot be overstated. Stakeholders often think they fully understand the meaning
of their data terms, until ambiguities and differences of opinion are quickly ex-
posed when they provide examples to their peers.
100 Chapter 4

The Data Warehouse Bus


Conformed Figure 4-4 presents a very different data mart architecture to the silo data marts of
dimensions Figure 4-1. This time, data marts are shareable by departments and do support
define a data cross-process analysis because they have been implemented using conformed
warehouse bus dimensions. These valuable dimensions define a data mart integration standard
standard for referred to as the data warehouse bus because each data mart “plugs into the bus”
plug-in data marts of conformed dimensions, much like a USB (universal serial bus) device plugs into
a computer.

Defining a data Compared to standalone data mart projects or the silo data mart anti-pattern, the
warehouse bus data warehouse bus requires some more initial work to:
requires more
initial work Model enough different business processes/events to identify potentially
valuable conformed dimensions and expose conformance issues.

Face, up front, the political challenges of getting stakeholders to conform


inconsistent business terms.

Build more robust ETL processes that actively conform dimensional attributes,
from multiple operational data sources, not just the event source(s) currently in
scope.

Establish a conformed dimension (master data) management regime that


promotes the use of conformed dimensions, not just by enforcing reuse but
also refactoring (improving) the conformed dimensions on a regular basis. This
removes the need for individual BI projects to develop their own “better” ver-
sions, that would inevitably dilute conformance.

The pay-back is The reward for conforming is less technical debt and rework and greater agility in
reduced technical the long run. Once the initial conformed dimensions have been defined, self-
debt and greater governing agile teams, that promise to use them, can work in parallel to develop
long-term agility data marts for individual business events or processes, becoming experts in their
data sources and measurement.

Data mart teams can develop additional local (non-conformed) dimensions so


long as they adhere to the data warehouse bus for conformed dimensions. Local
dimensions will always be necessary to describe what is unique about an event.
They are used in addition to conformed dimensions—never in place of.

While the inception costs of conforming are higher, the data warehouse bus is still
an agile JEDUF technique: once the bus has been defined, only the conformed
dimensions for the current development sprint need to be modeled in detail and
actively conformed; i.e. you can conform incrementally.
Modeling Business Processes 101

Figure 4-4
Data warehouse
bus design pattern

The most useful tool for planning conformance and designing a data warehouse A dimensional
bus is a dimensional matrix. This is a grid of rows representing business processes matrix is the ideal
and columns representing dimensions with tick marks at the intersections where a conformance
dimension is a candidate detail of a process. Figure 4-5 shows an example dimen- planning tool for
sional matrix for Pomegranate’s manufacturing process. The simplicity of this designing a data
diagram belies the power of the single page overview it provides (even for a com- warehouse bus
plex real-world design as opposed to this text book example). The clarity of this
model, in a format readily understood by stakeholders and IT alike, compared to
individual tables or data warehouse level ER diagrams, can be truly inspiring! Start
scanning the matrix and see.

Figure 4-5
A dimensional
matrix
102 Chapter 4

Scan down the Scanning down the dimension columns reveals the potential for dimensional
matrix columns to conformance. Conformed dimensions that could form a data warehouse bus show
identify potential up with multiple ticks. The contrast between these valuable dimensions that
conformed support cross-process analysis (Hurray!) and the non-conformed dimensions that
dimensions do not (Boo!) should encourage everyone to work towards conformance.

Scan across the Scanning across the process rows helps to estimate the complexity of a business
matrix rows to process: generally, the more dimension ticks, the more complex a process is likely
compare process to be and the more resource needed to define its business events and implement
complexity them.

Start with a high- It’s a good idea to start your agile DW/BI project by creating a high-level matrix to
level matrix to help help you plan your data warehouse design from a conformed dimensional process
you plan measurement perspective from the outset. You may want to add to it some of the
dimensionally additional features of the event matrix described below.

Use a high-level dimensional matrix to gain support from senior business and IT
management for conforming dimensions.

The Event Matrix


The event matrix is An event matrix is a more detailed version of the dimensional matrix. It is a busi-
a modelstorming ness event-level modelstorming tool designed to be filled in by/with stakeholders,
version of the using the 7Ws framework. Figure 4-6 shows an event matrix version of the Figure
dimensional matrix 4-5 manufacturing processes. The additional details on this matrix include:

It contains details Event Sequences: Business events, including their main clause short stories,
for BEAM✲ event recorded in time/value/process order.
story telling and
Scrum planning Dimensions in BEAM✲ story sequence (who, what, where, why, and how).
This helps you fill in the matrix using the 7Ws, read summary event stories,
spot opportunities to reuse dimensions of the same W-type and focus on con-
forming the most important who and what dimensions: typically customer,
employee and product.

Stakeholder Group columns for recording event interest and ownership. Ticks
can be linked to attendee lists of who was involved in modelstorming the event
details, or should be.

Importance and Estimate rows and columns for prioritizing events and
dimensions on a Scrum product backlog and estimating their ETL tasks for a
sprint backlog.
Modeling Business Processes 103

Figure 4-6
An event
matrix
!
104 Chapter 4

Event Sequences
Events are listed on Look back at the event rows on the matrix in Figure 4-6 and you will notice that
the matrix in value events are not listed alphabetically. Instead, they are listed in value sequence begin-
chain sequence ning with MANUFACTURING PLANS, and ending with WAREHOUSE
SHIPMENTS. This sequence orders the events by the increasing value of their
outputs. In this example, the sequence starts with potentially valuable planning
followed by the procurement of lower value components, and proceeds through
the building and shipment of higher value products. When business activity is
ordered in this way it is often referred to as a value chain.

Time/Value Sequence
Value sequence can Value sequence can also represent time sequence. Generally low value output
also represent time activity occurs before high value output activity or at least that is how most of us
sequence which think of business activity at a macro-level. For example, in manufacturing, pro-
helps stakeholders curement happens before product assembly, shipping, and sales. Similarly, in
to think of the next service industries, time and money is spent acquiring low value (high cost) pros-
and previous events pects before converting them into potentially valuable customers and then into
high value (low cost) repeat customers. In reality, value sequencing may not be a
strict chronology because many of the micro-level business events described in a
value chain occur simultaneously and asynchronously—not waiting for one an-
other. However, time/value sequencing is highly intuitive and by documenting
events in this way, the matrix helps stakeholders to think of next or previous events,
and spot gaps (missing links) in their value chains.

Add events to an event matrix in the order in which they increase business value
by asking “Who does what next that adds value?”

Process Sequence
Events that occur in Within the flexible chronology of value chains there will be stricter chronological
a strict sequence sequences of events that must occur sequentially to complete a significant time
often represent consuming process such as order fulfillment or insurance claim settlement. These
process milestones process sequences—which begin with a process (initiating) event and continue
serially through a number of milestone events—are denoted on an event matrix by
indentation.

Milestone events Figure 4-6 shows a process sequence of PURCHASE ORDERS to SUPPLIER
are indented PAYMENTS. This documents that a delivery only occurs after a purchase order
beneath the event (PO) has been processed and a payment is only made after a delivery has been
that triggers them. received. Notice that these events share a conformed PO dimension. This may only
A * degenerate be a degenerate PO NUMBER dimension in each event table but it ties these events
dimension creator together at the atomic detail level and allows stakeholders to track the progress of
often indicates the each PO item through delivery and payment. Notice also that POs are created by
start of a process PURCHASE ORDER events (denoted by a * on the matrix): PO numbers are
sequence generated when an employee raises a purchase order. This confirms the strict
Modeling Business Processes 105

process sequence of events: payments and deliveries reference PO numbers, they


must occur after the event that creates them.

Modeling Process Sequences as Evolving Events


Identifying a process sequence highlights an opportunity to model an evolving The milestone
event that will bring together all the individual milestone events of a process, events of a process
allowing them to be easily compared at a detail level. For example, PURCHASE can be modeled as
ORDERS to SUPPLIER PAYMENTS could be modeled as an evolving event details of a single
containing order date, order quantity, order value, plus actual delivery time and evolving event that
quantity from COMPONENT DELIVERIES, and payment date and amount. This provides additional
single evolving event would give stakeholders easy access to supplier performance duration measures
measures such as: late deliveries, average delivery time, and outstanding order
quantities.

Using Process Sequences to Enrich Events


Process sequences also help to add missing details to milestone events. The matrix Process sequences
will often reveal dimensions on an initial triggering event that can be added to the help you to find
subsequent milestone events. For example, in Figure 4-6 the CONTRACT dimen- additional details
sion of the PURCHASE ORDERS event could be added to the COMPONENT for milestone events
DELIVERIES and SUPPLIER PAYMENTS milestone events. It is possible to add
this dimension because of the strict chronology of the process sequence: everything
about the originating purchase order is known at the time of a delivery or payment.

Modelstorming with an Event Matrix


In their book “Gamestorming” (O’Reilly 2010), Dave Gray, Sunni Brown and Modelstorming is a
James Macanufo describe the “shape” of every useful brainstorming game as a three act play:
stubby pencil sharpened at both ends representing the acts of opening discussions, opening, exploring
exploring alternative ideas and closing with decisions. Modelstorming, with and closing
BEAM✲ tables, hierarchy charts and an event matrix, maps to this shape as in
Figure 4-7.

Figure 4-7
The “shape” of
modelstorming
from A to B
106 Chapter 4

Time-box Like most agile activity, modelstorming should be time-boxed. For an initial
modelstorming modelstorm use four hours as a guideline. Reserve two hours for modeling the first
meetings to four (most important?) event table and its dimensions tables. One hour for modeling
hours (maximum) related events on a matrix, and a further hour for prioritizing events and making
sure the most important event(s) and dimensions are modeled in detail. Not
enough time? Don’t overrun. Schedule another.

Use an event matrix to So far, we have covered how to open a modelstorm, at point A, with the question
identify the most “Who does what?”, and use BEAM✲ tables and 7W data story telling techniques to
important events and model the answer as single event and matching dimensions in great detail. Now we
conformed dimensions describe how you get to point B’s implementation decisions using an event matrix
to rapidly storyboard several more events, in just enough detail, to identify the most
important events and conformed dimensions for the next sprint. To show how the
matrix gets you there we shall continue modeling Pomegranate’s order processing
BI requirements.

Adding the First Event to the Matrix


Start an event matrix Start with a blank matrix—download the template from modelstorming.com—and
by adding the main add the initial CUSTOMER ORDERS event along with its main clause, leaving
clause and several rows above it for previous events and planning (importance and task
dimensions of estimate). Add its dimensions as columns in BEAM✲ sequence (who, what, where,
your initial event why, and how order), leaving blank columns between W-types for additional
dimensions. As you do so, explain to stakeholders that you are now modeling when
events occur down the page (hence no when columns), and how events are de-
scribed across the page but not how they are measured (hence no how many
columns).

Include degenerate Don’t forget to add any degenerate how dimensions, such as ORDER ID. Even
dimensions: they can though these dimensions are not modeled as tables (because they have no addi-
be conformed too tional descriptive attributes) they still need to be recorded on the matrix because
they can be conformed degenerate dimensions appearing in multiple events. You
will see shortly how important they are for identifying process sequences.

Ask if the event Tick off the dimensions referenced by the event. As you do so, ask stakeholders if
creates any new the event can create new values for any of its dimensions. For example, you might
dimension values ask:

When a customer orders a product can a new customer be created?

Mark any dimensions that can have new members created by the event (e.g. Cus-
tomer, Delivery Address and Order ID) with a * rather than a tick to record this
significant dependency. When you have finished, the matrix should look like
Figure 4-8.
Modeling Business Processes 107

Figure 4-8
Adding CUSTOMER
ORDERS to the
event matrix

Modeling the Next Event


Having safely documented the first event on the matrix, you now want to model as Ask for a new event
many related events as you can in the time available. You discover the next event, to add to the matrix.
in exactly the same way as the first event, by asking: “Who does what?”. If, at any Or ask for the next
point, you sense hesitation, or you are intent on discovering events in when order, verb in sequence
you might want to direct the stakeholders’ attention to the last event on the matrix
(so far, the only one) by pointing at it and asking a more leading question:

What happens next?

Stakeholders might say that “Packing follows Orders” or “Shipments follow What happens
Orders.” If you were given both of these verbs at once, the next one in time se- next depends on the
quence would be obvious but when you are modeling less familiar events the stakeholders’
sequence many not be so apparent to everyone, in which case you can draw a departmental
simple timeline to help sort them chronologically. With a mixed group of stake- perspective
holders, the answer to “what happens next?” can vary depending on their individ-
ual departmental perspectives.

Watch out for instances where multiple verbs refer to the same event. Stakeholders Watch out for
may use several verbs for the same activity, or multiple activities may be indivisible: verb synonyms that
captured as a single transaction by the source system. For example, if products are represent
packed and shipped by the same person within a short period of time, the two tasks the same event
may be recorded as a single shipment event. If you have any doubts, model each
verb as a separate event but if you uncover no extra details, or later discover they
represent a single transaction, you can merge the events with no loss of informa-
tion. Once you have a new verb (assume it is “ship”) you can use it to ask a more
focused “Who does what?” question to get the next event’s subject-verb-object main
clause:

Who ships what?


108 Chapter 4

Remember, this question is designed to focus the stakeholders on identifying the


responsible subject (who) and object (what) of the event, to model its atomic-level
detail. If stakeholders respond with:

Warehouse worker ships product.

Add the next event You add this new main clause to the matrix below CUSTOMER ORDERS, as in
and any new Figure 4-9, leaving enough room to add an event name later. You then check the
dimensions. Tick off dimension columns to see if the new subject (WAREHOUSE WORKER) and
its conformed object (PRODUCT) are potential conformed dimensions. PRODUCT is already on
dimensions the matrix so you should tick its use on the new event row, once you have con-
firmed that stakeholders are talking about the same products described in the same
way as before. Though it seems unlikely, you should also confirm that shipping
does not create new products otherwise you would use a * instead of a tick.

Figure 4-9
Adding “warehouse
worker ships
product” to the
event matrix

Role-Playing Dimensions
Check each new WAREHOUSE WORKER looks like a new dimension but before you add it you
dimension for should check if it is a synonym for an existing dimension; W-type can help you
synonyms among here. WAREHOUSE WORKER is a who, and there are already two other whos:
the dimensions CUSTOMER and SALESPERSON. Are either of these similar to warehouse work-
already on the ers? Customers obviously aren’t but warehouse workers and salespeople, while
matrix they’re not the same people, they maybe a specific type of who: employees of the
same organization. If so, they would share many common attributes (e.g., Em-
ployee ID, Department, Hire Date, etc.) and could be modeled as a single role-
playing conformed dimension. You should confirm with stakeholders:

Are warehouse workers and salespeople employees?


Modeling Business Processes 109

If the answer is NO, logistics could be handled by a contractor and warehouse Use [RP] to identify
workers are not Pomegranate employees. However if the answer is YES then you role-playing
have discovered two different roles for a conformed EMPLOYEE dimension. You dimensions
record this by renaming the SALESPERSON dimension to EMPLOYEE and
adding the dimension type code [RP] to denote that it is a role-playing dimension.
This change needs to be made to the matrix as in Figure 4-9 and the dimension
table as in Figure 4-10. However, you should leave the subject of the new shipping
event on the matrix as “warehouse worker” to record the specific employee role
that stakeholders used to describe the event.

Role names such as WAREHOUSE WORKER and SALESPERSON are used as Roles are
detail column headers in event tables, so the SALESPERSON column in documented as
CUSTOMER ORDERS does not need to be renamed but you do need to associate event details with a
it with the EMPLOYEE dimension. You document an event detail, such as [ ] type identifying
SALESPERSON, as a role of an existing dimension by adding the role-playing their RP dimension
dimension name to its column type using the [ ] type notation as in Figure 4-10.

Figure 4-10
A role-playing
dimension and an
event detail role

[ ] type notation can be used to qualify the type of any event detail or dimensional [ ] type notation is
attribute. Initially it can be useful to type every event detail with its W-type, such as used to record
[who], [what] or [where], to help everyone think dimensionally using the 7Ws. W-type, unit of
Details that are dimension roles use this notation instead to document their role- measure, flag
playing dimension name; for example, [employee] or [calendar]. As other details values and RP
are named after their dimension, they don’t need this qualification. For quantities dimension names
their type is their unit of measure, for example, [£], [$], or [miles] as described in
Chapter 2, while Yes/No flags can be documented with a type of [Y/N] showing
their permissible values.
110 Chapter 4

RP dimensions can A role-playing dimension can play multiple roles in the same event. For example,
play multiple roles in EMPLOYEE could appear twice in an evolving event containing both order and
the same event shipment details as both SALESPERSON and WAREHOUSE WORKER. Similarly,
CALENDAR—the most commonly occurring role-playing dimension—would play
the roles of ORDER DATE and SHIP DATE.

When using [ ] notation to document an event role you can drop its generic W-
type (e.g., [who] or [what]) to save space, because this is already documented
within the dimension table and on the matrix.

Define conformed Changing the name of a dimension (and its attributes) to make it more reusable at
role-playing the design stage is painless compared to the refactoring and testing involved if the
dimensions as early dimension had already been deployed. Hence the importance of modeling multiple
as possible events to identify conformed dimensions and their role-playing opportunities
before the first star schema is deployed.

Don’t implement any dimension until you have used an event matrix to check
whether it should be conformed across multiple events.

Role-playing Is a role-playing employee dimension the right approach? Stakeholders can often
dimensions, while feel uncomfortable with generalization (see opposite) like this, if they cannot see
more conformed, any business benefit, i.e., cannot imagine ever wanting to group together the
may not initially activities of salespeople and warehouse staff. If stakeholders do voice concerns, you
appeal to should try to encourage them to see the “bigger picture” benefit of a conformed
stakeholders dimension beyond the current scope. You can also assure them that when they
query sales or logistics they will have filtered lists of salespeople or warehouse staff
to choose from, and will never have to search through all employees.

Use the new event’s Once you have added the new event to the matrix, you ask for the rest of its details
main clause with the almost exactly as you would if filling out an event table: by turning its main clause
7Ws to ask for into a series of questions using the 7Ws. The only difference being that you ignore
further details when and how many questions as you don’t need that level of detail for the matrix.
Using the who, what and where column headings on the matrix as a checklist, you
might ask:

Warehouse worker ships product for/to/using whom?


Warehouse worker ships product with what/in what way?
Warehouse worker ships product from/to where?

Add any new In response to these who, what, where questions the stakeholder might identify
dimensions to the CUSTOMER (who) and DELIVERY ADDRESS (where) as potential conformed
matrix and then tick dimensions, and introduce new dimensions for CARRIER (who), SHIP MODE
off all dimensions (more of a how than a what) and WAREHOUSE (where) as new dimensions.
used by the event When you, or better still the stakeholders themselves, have added these to the
matrix, it should look like Figure 4-11.
Modeling Business Processes 111

Figure 4-11
Adding shipment
dimensions to the
matrix

Generalization: Model Spoiler Alert


Role-playing dimensions, such as EMPLOYEE [RP], are examples of generalization: Generalization
a technique frequently used in data modeling to increase the flexibility of a model to creates smaller
represent more varied things, and in database design to reduce the number of more flexible data
database objects that need to be created and maintained. models that work
well for packaged
Generalized data models work well for packaged application vendors, because they applications
want to create databases that do not need to be changed for each new customer or
industry. Generalization removes customer or industry-specific meanings and
business rules from the data model and places them in reference data-driven
application logic.

A common generalization design pattern is the use of a single Party entity to repre- Party and Party
sent all who details (persons and organizations), with an associative entity Party Role are common
Role to represent their various types, positions, titles, and responsibilities, ( e.g. examples of
customer, employee, supplier, etc). This database pattern is capable of recording generalized entities
the multiple positions that people might hold throughout their lives or the multiple used to model all
responsibilities they might have simultaneously, but is it a good generalization to types of people and
make when modeling a data warehouse? If BI stakeholders are explicitly looking for organizations
people who change roles—such as spies and criminals who change identities, or
government regulators who become political lobbyists—then a generic who dimen-
sion that plays multiple roles might be exactly what they need.

However, if stakeholders are not terribly concerned about role switchers, or the Stakeholders may
available data sources simply lack any reliable means of capturing role changes, not see any obvious
then this design flexibility is wasted. Worse still, it can get in the way of what BI benefit in
stakeholders really want to do. For example, a single dimension representing generalization
Customers and Employees containing every possible who related attribute would be
very confusing to use compared to separate dimensions containing customer and
employee specific attributes.
112 Chapter 4

Generalization Agile data warehouse modelers must use generalization carefully. Data models that
produces data value flexibility over simplicity are notoriously difficult to understand and use for BI.
models that are They can work for transactional software products because their data structures are
difficult for BI users completely hidden from the users by application interfaces. But “universal data
to understand and models” that rely on high levels of generalization or abstraction do not work so well
query for BI users who—despite the semantic layers provided by BI tools—need far
simpler data warehouse designs to be able to construct and run ad-hoc queries
efficiently.

Modelstorming data One of the great benefits of modelstorming is that stakeholders feel a sense of
requirements ownership in the resulting design. If they have abstractions forced upon them they
specifically rather start to lose that feeling: it’s no longer their model, their data—it could be anyone’s.
than generally The only Party Roles most stakeholders recognize are Host, Guest, or Gate-
promotes stake- crasher—or maybe political ones if that’s their specialist field. In extreme cases
holder design where generalization is taken too far, to the point where the data model can be used
ownership to represent almost anything, it will actually mean nothing to stakeholders. This
defeats the goal of modelstorming, which is not to design data structures that merely
store data but to design ones that stakeholders will use and cherish. Modeling each
interesting who, what, when, where, why and how as specifically as possible helps
to promote the data model understanding needed to construct meaningful queries
and interpret their results.

Postpone ‘technical Stakeholders are happy with “reasonable” levels of generalization if they can see an
benefit only’ obvious business benefit such as a better understanding of the commonalities
generalization until (conformance) between business processes that improves analysis. But if the
star schema design benefits are purely technical—to cut down database administration or streamline
ETL—then you should postpone generalization until you design your star schemas
and ETL processes.

Discovering Process Sequences


Conformed why and The last two Ws, why and how, are grouped together on the matrix because of their
how dimensions similarities and close relationships within processes. Whys and hows are the most
often indicate a common types of non-conformed dimension but when they are conformed they
process sequence can often change type, from how to why and vice versa. This happens when events
have a cause and effect relationship that often represents a process sequence. You
discover just such a sequence if you ask:

Why does a warehouse worker ship a product?

and get the answer:

Because a customer ordered the product.


Modeling Business Processes 113

This sounds a lot like the main clause of the previous order event. The stakeholders
have told you that orders are the reason for shipments. You can find the evidence
for this: the conformed dimension that ties the two events together, by turning
their answer into a how question:

How do you know that a customer ordered the product


for shipment (what data tells you so)?

The answer:

There is an order ID on each shipment.

reveals that the order how detail (ORDER ID [DD]) is effectively a why detail of Conformed
shipment, indicating that the two events are likely to be part of a process sequence degenerate
with shipments treated as milestones of CUSTOMER ORDERS. You can check dimensions
how strict this sequence is by asking, “What about free samples or replacement represent how
products and parts; how are shipments for these processed?” If the answer is, “We details of
don’t want to consider these when measuring our sales fulfillment processes,” then process events and
perhaps they are events for a marketing or product support data mart, another why details of
sprint, another day. Alternatively, stakeholders might tell you “Pseudo orders are milestone events
generated when we ship samples or replacements”. Either way, all the in-scope
shipments are milestone events of orders and ORDER ID is a conformed degener-
ate how/why dimension.

Figure 4-12 shows the completed shipment event, now named PRODUCT Document process
SHIPMENTS, with its final how detail a SHIPMENT NUMBER degenerate dimen- sequences by
sion. The new event name and main clause have also been indented under indenting milestone
CUSTOMER ORDERS to document the process sequence. Note that this sequence, events
shipments follow orders and not the reverse, is confirmed by CUSTOMER
ORDERS being an ORDER ID creator (denoted by the * in the ORDER ID
column). CUSTOMER ORDERS must occur first to create the ORDER IDs refer-
enced by PRODUCT SHIPMENTS.

Figure 4-12
Adding why and
how dimensions
to shipments
114 Chapter 4

The presence of common degenerate dimensions (transaction IDs) often signifies


that events are milestones in a process sequence.

After completing shipments, your search for the next event begins anew. This can
be the next one in sequence, or simply the next one the stakeholders think of when
Ask for the next they see popular dimensions like CUSTOMER and PRODUCT on the matrix. If
event but don’t their next event doesn’t sound like the very next one chronologically, don’t worry,
worry about strict just go with their train of thought—don’t try and derail it. Missing ‘next’ events are
chronology much easier to spot as gaps on the matrix once you add the events you are freely
given. Imagine that the Pomegranate stakeholders respond to your next “Who
does what?” question with:

Customer returns product.

Exceptional steps Figure 4-13 shows the matrix after PRODUCT RETURNS have been added along
within a process are with its new PROBLEM REASON dimension. PRODUCT RETURNS are depend-
documented ent on PRODUCT SHIPMENTS because customers have to order and receive
by bracketing their products to be able to return them, but this sequence of events is exceptional: only
event names a small percentage of orders are returned. You can document an optional or
exceptional event within a process by bracketing it. This acts as a visual clue that
you might want to handle the event separately to mandatory/unexceptional process
milestones. For example, order and shipment could be combined in a worthwhile
evolving order event because almost every order leads to a shipment, but the much
smaller number of returns might be better treated as part of a separate customer
support process rather than complicate orders. Exceptional events often indicate
that there may be missing events and other processes that need to be considered.

Figure 4-13
Adding PRODUCT
RETURNS to the
matrix

Stakeholders occasionally find it difficult to decide which of two events happen


first. Problematic “chicken or egg” events can occur simultaneously, loop each
other, or be mutually exclusive (e.g., payments or returns). Don’t get hung up on
perfect sequencing, just put them next to one another, above or below the events,
on the matrix, that everyone can agree they precede or follow.
Modeling Business Processes 115

Using the Matrix to Find Missing Events


When PRODUCT RETURNS has been added to the matrix, you can check for a Check for missing
missing event in the sequence by asking the obvious question: “Does anything events by looking
happen after PRODUCT SHIPMENTS but before PRODUCT RETURNS?” for gaps on the
Alternatively, you can get stakeholders thinking about what might belong in the matrix
gap, if there is one, by asking:

Does anything difficult, costly, valuable, or time-consuming


happen between shipments and returns?

This might prompt stakeholders to think of CUSTOMER COMPLAINTS. If they


agree that this event represents the start of a new process you would add it to the Look for large time
matrix as in Figure 4-14, which now shows PRODUCT RETURNS indented as a gaps or value
milestone in that process. Notice that the exceptional step brackets have been changes. They often
removed from PRODUCT RETURNS. Not every complaint leads to a return; for represent the start
example, a complaint might be that a product hasn’t been delivered yet (another of a new process
event: “Carrier delivers Product”, to add to the matrix), but a high enough percent-
age of complaints do result in returns, enough for stakeholders to view return date
as a standard milestone of this new customer support process.

Figure 4-14
CARRIER
DELIVERIES,
CUSTOMER
COMPLAINTS and
SALES TARGETS
added to the matrix

Trying to find the correct position for an event within a process sequence can often Model the first and
help to expose additional events that represent the end of one process and the start last events in a
of another. In our example, deliveries are the final milestones for most orders. process. They are
Complaints and returns, on the other hand, are thankfully not part of many orders. the basis for almost
The indentation in Figure 4-14 shows how CARRIER DELIVERIES completes the all process
order fulfillment process and CUSTOMER COMPLAINTS begins a new customer performance
support process. Documenting the first and last events of a process is particularly measurement
important. They represent cause and effect, origin and outcome and are the most
116 Chapter 4

basic events needed to measure process performance—stakeholders will not be


satisfied until you have modeled at least some of these events.

Add rollup (RU) Figure 4-14 shows another new event: SALES TARGETS. It is not part of the order
dimensions next to or customer support processes, hence no indentation, but stakeholders believe that
their base sales targets drive orders so they have placed the event before CUSTOMER
dimensions and tick ORDERS in time/value sequence. From its main clause “Salesperson has product
all the events that type target” it is immediately obvious that it should take advantage of the con-
can be rolled up to formed role-playing EMPLOYEE dimension. But it cannot reuse the conformed
their level PRODUCT dimension because stakeholders have stated that targets are set for
product types not for individual products. The good news is that the event can still
be conformed with PRODUCT at the product type level because this is a con-
formed PRODUCT attribute. You record this by adding a rollup dimension
PRODUCT TYPE [RU] (immediately to the right of PRODUCT, if possible, to
denote that it is derived from it) and ticking it for each PRODUCT-related event to
denote that they can be compared to SALES TARGETS at the PRODUCT TYPE
level. There is no need to model the rollup any further, at the moment, because it
will not contain any new attributes, just PRODUCT TYPE and any other con-
formed product attributes above it in the product hierarchy, such as
SUBCATEGORY and CATEGORY already defined in PRODUCT.

Using the Matrix to Find Missing Event Details


Use your final set of Once you have added all the events that the stakeholders are currently interested in
dimensions to (or as many as time permits), it is well worth making one more quick pass of each
recheck events for event, now that you have built up a collection of potential conformed dimensions,
missing details to see if you can get any more reuse from them. For each event, point at each
dimension it doesn’t reference and ask:

Why isn’t this dimension a detail of this event?

Simply pointing at each empty cell in turn like this takes full advantage of the
physical proximity of all the events and dimensions on the same spreadsheet or
whiteboard that the matrix provides and can often jolt someone into spotting a
valuable missing conformed detail at the last minute.

Playing the Event Rating Game


Ask stakeholders to You now need the stakeholders help to decide which event(s) to implement in your
rate the importance next release. To do this add an extra column to the matrix (as in Figure 4-15) to
of each event record event importance and ask the stakeholders to rate each event based on a few
simple rules:
Modeling Business Processes 117

Higher rated events are more important and should be implemented sooner Event Rating Rules
—if possible.

Every event gets an importance rating.

Every event gets a different importance rating, except…

Events that have been completed have an importance of 0.

Events that are truly unimportant (currently) can all have an importance of
100.

Events are rated in 100 importance point increments, e.g. 100, 200 (you’ll see
why the gaps are useful shortly).

The importance rating is only used to sort events by importance not measure
their relative business value. If Event A has importance 100 and Event B has
importance 500, B is simply more important than A, not five times more im-
portant.

Figure 4-15
Event importance
rating

If you are using the downloadable BEAM✲ matrix template, you can hide and
unhide the built in planning columns (Figure 4-15) which include event importance
and planning rows (Figure 4-16) which in turn include dimension importance.
Before you use the importance column to actually sort the event rows make sure
you fill in the event sequence column first, so that you can re-sort events back
into time/value sequence when you have finished.

As soon as the importance rules are understood, start by rating the initial event Start by rating the
that the stakeholders modeled. Theoretically, this should be the most important initial event highly.
event so you might suggest an importance based on that starting position; for Then rate other
example, if the matrix describes 10 events that haven’t been implemented yet events relative to it
118 Chapter 4

suggest an importance of 1,000. This event may not stay the most important;
stakeholders can easily give a higher importance to an event that was modeled in
less detail at the last minute but this opening gambit gets the rating game going. In
Figure 4-15 the initial CUSTOMER ORDERS event has remained the most impor-
tant and is followed by PRODUCT SHIPMENTS rather than CARRIER
DELIVERIES (perhaps stakeholders realize that data will not be readily available
from carriers). Stakeholders have also rated the customer support events as cur-
rently unimportant!

You may not wish to ask all the modelstorming stakeholders to vote on impor-
tance. Arguments may ensue! If you are using Scrum to manage your agile DW/BI
development, prioritizing requirements is the role of the product owner who
manages the product backlog. A subset of the stakeholders can act as a proxy for
the product owner and provide input to the product backlog prior to release and
sprint planning meetings. At these meetings, event importance will be adjusted by
the product owner in the cold light of source data profiling results and the DW
team’s ETL task estimates.

Add a dimension When all events have been rated and any tied positions resolved—remember
importance row and important events should have unique importance ratings—add a dimension
rate each dimension importance row to the matrix (as in Figure 4-16). Dimensions are rated after events
higher than its most because dimensions are only important if the events that use them are important.
important event Now that the stakeholders have decided which events are important, you rate each
dimension higher than its most important event, because it must be implemented
before any fact table based on the event can be implemented (due to foreign key
dependencies).

Dimension Rating Dimension importance rating follows the same rules as event rating with a few
Rules additions/variations:

A dimension should be rated higher than its highest importance event, unless it
has already been implemented, in which case it should be 0. E.g., if Event B
with importance 500 is the highest rated event using Dimension C then C must
have an importance between 505 and 595.

A dimension should be rated lower than any higher importance events that do
not use it.

Dimensions are rated in 5 or 10 importance point increments, so they fit


between events and BI functional requirements (report user stories). Hence the
reason for the 100 point gaps between events.

Rate conformed dimensions higher than non-conformed dimensions.

If an event’s importance is changed its dimensions must be re-rated.


Modeling Business Processes 119

In Figure 4-16, dimensions have been rated by first sorting events by importance. Dimensions and
CUSTOMERS ORDERS and PRODUCT SHIPMENTS have come top so their events are rated so
importance points (600 or 500) are copied to their dimensions. Stakeholders have they sort correctly
then rated order dimensions 610-670 and shipment dimensions 510-540. This on a single product
numbering scheme allows dimensions and events to be sorted correctly when backlog
transferred to a single product backlog.

Figure 4-16
Dimension
importance
rating

In subsequent sprints, the stakeholders/product owner will need to prioritize BI BI reporting


reporting requirements too. While these “report user stories” are more important requirements must
to stakeholders than data models they must be rated below the star schemas back- be rated lower than
log items they are used to query, as in Figure 4-17 which shows a product backlog the events they
containing prioritized reporting, dimension and event requirements. measure

Figure 4-17
A DW/BI product
backlog

For more advice on Scrum, sprint planning and time-boxing read Scrum and XP
from the Trenches, Henrik Kniberg (InfoQ.com 2007).

When you have finished rating all the events and dimensions on the matrix, if the When you have
most important events (top 1 or 2 usually) and all their dimensions have been modeled the most
modeled with examples, your modeling work is done, for now, and you can bring important events
the modelstorm to an end. You have reached point B with enough information. and dimensions with
However, if matrix only events have been rated highly important you may have examples, the
one or two more events to model in detail before you can proceed to star schema modelstorm is
design and sprint planning. complete. If not…
120 Chapter 4

Modeling the Next Detailed Event


Create a BEAM✲ When you discover an event on the matrix, such as PRODUCT SHIPMENTS (with
table for the next an importance of 500), which needs to be modeled in detail, you begin by creating
important event and a new BEAM✲ table and copying the event main clause to it. But before you copy
ask for its when any further details, ask for a new when detail to help tell interesting event stories.
detail For PRODUCT SHIPMENTS you ask:

When does a warehouse worker ship product?

and the stakeholders reply:

On a shipment date.

Reuse conformed Add this to the table, as in Figure 4-18, and ask for event stories just as you did
dimensions and with the initial CUSTOMER ORDERS event. The only difference this time is that
examples wherever you will be using candidate conformed dimensions that already have examples.
possible You want to re-use these examples where possible, to illustrate the conformance.

Figure 4-18
New PRODUCT
SHIPMENTS
event table

Reusing Conformed Dimensions and Examples


Conformed When you use conformed dimensions to describe new events, you should reuse
examples relate their existing examples where applicable to help relate new event stories to existing
new event stories to ones. Once you get into the habit of using conformed examples to show that the
existing stories same customers, products, employees, etc. are involved, you will soon want to
minimize the drudgery of duplicating the same examples again and again.

Don’t use cryptic You might be tempted to speed up the process by using shorter business keys
business keys to rather than rewriting dimension subjects out in longhand. This may even appear to
speed up event be good data modeling practice because event tables will then contain foreign key
story telling references to the dimension business keys that will surely make them easier to
Modeling Business Processes 121

translate into physical tables. But this rush to physically model events is counter-
productive. Business keys are mostly cryptic codes that will rob event tables and
stories of their readability and descriptive power, and—as you will see in Chapter
5—business keys do not make the best foreign keys (or primary keys) in a dimen-
sional data warehouse.

Using Abbreviations in Event Stories


Instead of business keys, use abbreviated examples: abbreviations or shortening of Use abbreviated
previously modeled examples, as in Figure 4-19 which shows shipment stories with examples to keep
abbreviated employee and product examples. The advantage of abbreviations over stories brief and
business keys is that they keep event stories readable while saving whiteboard space readable
and speeding up story telling. You can also expand them to full examples later in
your documented model relatively easily with a “replace all”.

Figure 4-19
Using
abbreviations to
tell event stories

To avoid confusing abbreviated stories, keep abbreviations unique within dimen- Keep abbreviations
sions. If an abbreviation is not unique, just add a sequence number to it. For unique by adding a
example, if your two favorite employees are James Bond and Jed Bartlet, they can sequence number if
appear in stories as JB1 and JB2. Of course employee James Bond is exceptional; his necessary
business key, employee ID 007, is so well known that it is more descriptive than his
initials, and could be used very successfully in many eventful stories.

Provide stakeholders with handouts of previously modeled BEAM✲ dimension


tables and an ‘up to date’ event matrix to help them reuse conformed examples
and decode abbreviations. the dimension tables need not show every attribute;
just the example dimension subjects (which get abbreviated) and one or two
defining characteristics that identify role subtypes, such as warehouse employees
and sales employees.
122 Chapter 4

Adding New Examples to Conformed Dimensions


Ask for any new You not only want stories to relate new events to existing events, you also want
examples needed to them to tell you as much as possible about each new event detail. You make sure
cover the five event that they do by asking for typical, different, repeat, missing, and group themed
story themes and stories, as described in Chapter 2. To cover all these themes and illustrate new roles
describe new for role-playing dimensions you will need examples that are not present in any
dimensional roles BEAM✲ tables modeled so far.

Test conformed When stakeholders give you new examples try adding them to the appropriate
dimensions by dimension before using them in event stories. Apart from allowing you to use the
adding new examples by abbreviation, filling out their dimensional attributes is also a great test
examples of conformance that helps you to spot missing or non-conformed attributes. For
role-playing dimensions, you may have to adjust some existing attributes to match
new roles; e.g., COMMISSION is a mandatory (MD) attribute of SALESPERSON,
but would be a non-mandatory exclusive (X) attribute of a conformed EMPLOYEE
[RP] dimension that must play the role of warehouse worker as well as salesperson.

Asking for examples encourages everyone to define and use conformed dimen-
sions. Why make up new example values when you can copy them from a con-
formed BEAM✲ dimension table?

Modeling New Details and Dimensions


Ask for additional After you have filled in the themed examples for who (WAREHOUSE WORKER),
when details before what (PRODUCT), and when (SHIP DATE), proceed through the 7Ws in BEAM✲
copying more order (see Figure 2-2) by asking for any other when details before moving on to
dimensions from who, and what, and where. For each of these W-types copy the relevant dimensions,
the matrix one at a time from the matrix, and ask for examples.

Don’t forget to Before you move on to the next W-type always check for additional details of the
check for additional current type. Seeing the event stories build up will often prompt stakeholders to
who, what, where, suggest additional details they couldn’t think of when modeling at the matrix
why and how details summary level. As soon as stakeholders confirm any additional who, what, where,
too and add them to why or how detail with relevant examples, add them to the matrix where they too
the matrix might become conformed dimensions.

Mark new details Figure 4-20 shows four who, and where details added to the shipping event.
with a [?] type as a CUSTOMER and DELIVERY ADDRESS, with their highly abbreviated examples,
reminder to model are conformed dimensions, while CARRIER and WAREHOUSE are new and have
their dimensional not been modeled as dimension tables yet. Any new details/dimensions, like these,
attributes can be marked as type [?] as a reminder that, while they maybe on the matrix, they
still need to be modeled at the attribute level, with examples, when the event is
completed.
Modeling Business Processes 123

Figure 4-20
Adding details to the
shipment event

Following the where details, you ask for the how manys. These quantitative details Ask how many? to
do not feature on the event matrix, just like the when details, and are the main discover the event
reason for modeling the event in table form; the matrix shows how events are measures not
described using (conformed) dimensions. The how many examples show how modeled on the
events can be measured. matrix

In Figure 4-21, two new quantity details: SHIPPED QUANTITY and SHIP-MENT Add any existing
COST have been added, along with the ORDER ID why detail. The quantity why dimensions
examples are new, supplied by stakeholders, but the order ids are copied from the from the matrix
previously modeled CUSTOMER ORDERS event because ORDER ID was identi- and ask additional
fied, on the matrix, as a conformed degenerate dimension linking the events. With why questions to
that existing why filled in, you ask for additional whys, remembering to ask why explain story
quantities vary. If you know that an event, such as shipment, is a process milestone variations in the
you should ask why similar details vary (or do not vary) within the process; for measures
example, you might ask:

Why can SHIPPED QUANTITY differ from ORDER QUANTITY?

and get the answer:

Several partial order shipments can be made when product


stock is too low to completely fulfil an order line.

This tells you there is a 1:M relationship between orders and shipment. It also tells Why answers can
you that you haven’t yet found a combination of details that would make a ship- represent the need
ment event unique. You record this by adding a new repeat story to the table, as in for additional
Figure 4-21, which demonstrates there can be multiple identical shipment events examples as well as
for the same order line item by duplicating the granular details (ORDER ID, new why details
PRODUCT) of an original order (ORD5466).
124 Chapter 4

Figure 4-21
Adding new
quantities, a
why/how detail
from a previous
event and an
additional repeat
story

Concentrate on Thanks to its ORDER ID why detail you have the option to embellish PRODUCT
completing the event SHIPMENTS with additional order dimensions, but because these dimensions are
with the when, how already well documented in the CUSTOMER ORDERS table and on the matrix,
many and granularity you can, if pressed for time, add them later without stakeholders involvement. Just
details not make sure you let the stakeholders know you will be doing this. While you have
recorded their attention now you should concentrate the modelstorm on capturing brand
elsewhere new shipment details, especially the when, how many and granularity details not
recorded on the matrix. You also need to allow time for modelstorming the attrib-
utes of any new dimensions (the details you temporally marked [?]).

You can add any useful order dimensions to shipments but you should avoid the
how many details, such as ORDER QUANTITY or REVENUE, because the 1:M
relationship between orders and shipments would cause these measures to be
overstated when summarized; e.g., 2 partial shipments events that both record (i.e.
duplicate) the original ORDER QUANTITY of 10 units will produce a total of 20
units with double the correct order REVENUE.

Don’t copy Order measures are better left in the order event and its subsequent order fact table
measures from one at their true granularity (the order line item) rather than also stored at the shipped
event to another. line item. In Chapter 8, we cover how you would instead combine shipments with
This can lead to orders to produce a single evolving order event and model the additional measures
facts that double that provides. For now, we will press on and complete the shipment event with
count how details.
Modeling Business Processes 125

Completing the Event


Copy the how details SHIP MODE and SHIPMENT NUMBER from the matrix to Complete the event
the event table and ask for examples—as neither has been modeled in table form table by recording
before. The SHIPMENT NUMBER examples should confirm that you have finally event granularity
found a detail that can differentiate two partial order shipments of the same prod- and type
uct to the same customer on the same day. With this final piece of the puzzle you
can define the granularity and type of PRODUCT SHIPMENTS. The granularity is
a combination of SHIPMENT NUMBER, ORDER ID, and PRODUCT, and is
recorded by marking these details as GD (Granular Details/Dimensions). From a
business perspective this granularity can be described as “Shipment note line
items”. This granularity makes the story type DE (discrete event). Figure 4-22
shows this information added to the final spreadsheet version of the event table.

When you finish modeling an event table don’t forget to model dimension tables
for any details that you have marked as [?]. You still need to define some dimen-
sional attributes for these details, before ending the modelstorm.

Figure 4-22
Completed
PRODUCT
SHIPMENTS event

You might be tempted to start a modelstorming session by using a matrix to


model and rapidly prioritize multiple events. While this can work well with stake-
holders who are already familiar with the process, it can be too abstract for some
brand new modelstormers. Remember many BI stakeholders would prefer to
define reports rather than data models. Starting with an event table and example
data (even if it is not for the most important event) that looks like a report can help
stakeholders get the matrix and appreciate its value.
126 Chapter 4

Sufficient Events?
Merge subject After the earlier manufacturing events in Figure 4-6 are added to the sales targets,
area matrices order processing and customer support events of Figure 4-14, the matrix should
to provide a DW- look like Figure 4-23. If this matrix inspires you to reuse more dimensions, particu-
wide overview of larly dimensions from process initiating events such as PURCHASE ORDERS or
conformance CUSTOMER ORDERS that could be carried over to their dependent milestone
events, then the matrix is doing its job. It should encourage you to maximize
dimensional reuse to make each event as descriptive as possible. In addition, if the
similarities of the dimensions of PRODUCT SHIPMENTS and WAREHOUSE
SHIPMENTS makes you think that they might actually be the same type of event,
then the matrix is also doing its job. It may turn out that wholesale shipments to
resellers are quite different to retail shipments to consumers: These events might
be handled by completely different systems. Even so, the matrix is again doing its
job in highlighting an opportunity to conform the dimensions of both events, just
in case there is business value in doing so.

The event matrix is a great technique for upholding the agile value of working
software over comprehensive documentation. The event matrix is enough com-
prehensive documentation to help you create working software based on con-
formed dimensions but if you do need more documentation, link to it from your
event matrix spreadsheet cells; e.g., events and dimensions can be hyper-linked
to their BEAM✲ tables or star schema models.

A matrix may Although the event matrix in Figure 4-23 might be complete enough for several
never contain every DW/BI development sprints, is it the complete matrix for the Pomegranate Data
event and not every Warehouse? What about customer invoicing and payment events after orders, or
event it does product configuration prior to shipments? What about events in other subject
contain will be areas such as HR, finance, R&D? Many of these events may be out of scope for
implemented sometime, or will never capture sufficiently interesting additional details to be
worth measuring.

The role of the Rather than initially modeling every possible event on a matrix, agile DW design-
matrix is to identify ers concentrate on making the matrix complete enough for the next release. When
the conformed a matrix contains enough detail to help prioritize the right events for the next
dimensions for the release and understand their conformed dimensions that will be used again in
next release future releases, its job, for now, is done.

Put a large version of the event matrix on the wall where everyone can see it.
Regardless of your preferred methods for modeling events and dimensions:
BEAM✲ tables or ER notation, flipcharts, whiteboards, or projected spreadsheets,
viewing more than a few details at once is impossible. When event and dimension
tables cover all your walls, or are buried in spreadsheets, a matrix enables stake-
holders and the DW team to see the entire design at a glance.
Modeling Business Processes 127

Figure 4-23
A complete event matrix?

Keep the event matrix up to date! It’s not an initial planning tool or a one-time
modeling technique. Use it to document the ongoing data warehouse design.
Refer to it and update it whenever you are modelstorming. A well-maintained
matrix acts as a constant reminder to everyone to reuse and enhance conformed
dimensions.
128 Chapter 4

Summary
“Just barely good enough” dimensional modeling can lead to the early and frequent deployment
of data marts that answer current departmental reporting requirements, but it also stores up
technical debt, in the form of incompatible data silos that cannot support cross-process analysis
and enterprise level BI. Due to the large data volumes associated with DW/BI, repaying this
debt can be ruinous.

To avoid silo data marts and reduce technical debt, agile DW designers need to model ahead of
the current development sprints and release plans, just enough to identify and define conformed
dimensions. These reusable components of a dimensional model enable drill-across reporting by
providing the consistent row headers and filters needed to combine and compare measures
from multiple business processes. A well documented, well publicized and well maintained set
of conformed dimensions form a data warehouse bus architecture that supports the incremental
development of truly agile data marts.

Conformed dimensions are single dimensions tables or synchronized copies shared by multiple
star schemas. They can also be swappable [SD] subsets or rollups [RU], derived from a base
dimension, conformed at the attribute level with identical business meaning and contents.
Generalized conformed dimensions that play multiple roles in the same or different events are
referred to as role-playing [RP] dimensions.

Agile dimensional modelers define conformed dimensions by modeling with examples, with
business stakeholders. BEAM✲ example data stories highlight the value of conformance to the
very people who can make it happen politically. Examples can quickly expose the inconsistent
business terms that would hinder conformance.

The event matrix is a modeling and planning tool that documents the relationship between
events and dimensions. It acts as a storyboard for an entire data warehouse design showing just
enough detail to help identify the most valuable conformed dimensions and prioritize their
development.

Listing events in time/value sequence on an event matrix helps you discover missing events by
highlighting large time gaps or value jumps in process workflows. It also helps you identify
strict chronological process sequences: candidate evolving events, that combine all the milestone
events of a business process, to support end-to-end process performance measurement.

When modeling new events, abbreviated examples allow you to quickly tell stories by reusing
conformed examples where applicable. Unlike codes, abbreviations help to keep stories brief
and readable for stakeholders. They also support the validation, reuse and enhancement of
conformed dimensions.
5
MODELING STAR SCHEMAS
We are all in the gutter, but some of us are looking at the stars.
— Oscar Wilde

In this chapter we describe the star schema design process for converting This chapter is a
BEAM✲ models into flexible and efficient dimensional data warehouse models. guide to:

The agile approach that we take begins with test-first design, by using data profiling Verifying BEAM✲
techniques to verify the BEAM✲ model against the data available in source sys- models against
tems. This results in an annotated model which documents source data characteris- available data
tics and issues. This is used for model review with stakeholders and development sources
sprint planning with the DW/BI team.

Next, the revised BEAM✲ model is translated into a logical dimensional model by Converting BEAM✲
adding surrogate keys. The resulting facts and dimensions are documented by models into star
drawing enhanced star schemas using a combination of BEAM✲ and ER notation. schemas

Finally, the star schemas are used to generate physical data warehouse schemas Validating DW
which are validated by BI prototyping and documented by creating a physical designs by
dimensional matrix. prototyping

Data profiling to verify stakeholder data requirements Chapter 5 Topics


Annotating BEAM✲ models with data sources and profile metrics At a Glance
Reviewing annotated models and planning development sprints
Converting BEAM✲ models into logical/physical dimensional models
The importance of data warehouse surrogate keys
Designing for slowly changing dimensions
Defining additive facts
Drawing enhanced star schema diagrams and creating physical schemas
BI Prototyping to validate dimensional models
Creating a physical dimensional matrix

129
130 Chapter 5

Agile Data Profiling


Profile candidate The first step in translating a BEAM✲ model into a viable data warehouse design is
data sources for the to use agile data profiling to identify candidate data sources for the model’s priori-
prioritized events tized events and dimensions. Data profiling is the process of examining data
and dimensions, to sources to learn about their structure, content, and data quality. Agile data profil-
discover their data ing (see Figure 5-1) is also:
structure, content
and quality Targeted to the candidate data sources for the business events and conformed
dimensions that the stakeholders have prioritized for the next release, rather
than all available data sources.

Done early, as a data modeling task to help define the dimensional model.

Agile data profiling Done frequently, to ensure that the model responds to change; this is espe-
is done early as a cially important for new data sources that are being developed in parallel with
modeling activity – the data warehouse.
before a target DW
schema is created Done by DW/BI team members who will load the data, to give them a feel for
its complexity that will help them with their ETL task estimates.

Recorded in the business model so that data profiles can be used to review that
BEAM✲ BI data requirements model with the stakeholders, before any techni-
cal data models are proposed.

Figure 5-1
Agile data profiling

The most expensive and painful way to discover the data profile of an operational
source is to create an idealized target schema, attempt to ETL the source into the
target and record all the errors. Don’t make this extremely late/non-existent data
profiling mistake. Agile data warehouse designers never create a detailed physical
model before profiling a source, unless they are deliberately doing proactive
DW/BI design to help define a brand new source.
Modeling Star Schemas 131

Agile data profiling is a form of test-driven (or test-first) design (TDD). Profiling Think of agile data
the source data provides you with metrics that can be used to test the fit of a data profiling as a form of
warehouse model and the content of a data warehouse database before you develop test-driven design
your SQL DDL (data definition language) and ETL code. When profiling isn’t
possible yet, a proactive DW/BI design can be viewed as an advanced test specifica-
tion for the new operational system; a test that asks “can the system supply this
data required for BI to this specification?”

Identifying Candidate Data Sources


Data warehouses should be sourced from the most current and accurate data Find the original
available, which, in practice, means identifying the system-of-record (SoR) for each system-of-record
fact and dimension. The system-of-record is the authoritative source for a particu- (SoR). Avoid
lar type of data, such as The Payroll System for employee salary data. Data should downstream copies
be extracted directly from the system-of-record rather than downstream copies— that introduce
from the original image of the data, rather than a “photocopy of a photocopy”—to latency and data
reduce latency, system dependencies and data quality issues. The only exceptions quality issues
should be where downstream systems explicitly improve data quality or unlock
proprietary data formats.

For business events and the facts they provide, finding the system-of-record that Events/facts often
creates and maintains the original transactions is relatively straightforward as there have unique
is often only one source system for a specific event type. For example, the claims sources
processing system would be the obvious and possibly only choice for sourcing
claim submission events. But where should claiming customers or insurance
products be sourced from? Identifying the system-of-record for dimensions like
these can be far more challenging.

Conformed dimensions are common to multiple business processes, which may Conformed
themselves be implemented using a mixture of purchased enterprise software dimensions can
packages and bespoke in-house applications. It is not uncommon for several have multiple
operational systems to independently maintain common reference data (some- sources that should
times called master data), such as Customers, Products, and Employees: the most be profiled to
valuable candidate conformed dimensions. You may need to profile systems that identify
are outside of your present prioritized event scope to find the best source for a conformance
conformed dimension and spot any conflicts that would hamper conformance and conflicts
reuse in the future. There may be no single best source for a dimension!

If you are fortunate, one system will have been declared as the system-of-record for Conforming ETL
each conformed dimension. But even then, facts (events) from other systems may processes may
use alternate business keys and carry additional dimensional attributes. If so, have to merge
conforming ETL processes will need to match the keys from the systems-of-record sources to obtain all
and these other sources to create the “perfect” set of conformed dimensional the necessary keys
attributes for the next sprint and ultimately to be able to load facts from these and attributes
sources in the future.
132 Chapter 5

Master data If you are extremely fortunate, you may have a Master Data Management (MDM)
management system that can help you identify the sources to profile for the most important
systems help conformed dimensions. MDM captures, cleanses, and synchronizes reference data
dimensional across operational systems and can provide the cross-referenced business keys that
conformance ETL needs to conform multiple sources.

Profile early, before the data warehouse model exists or is updated, then any data
quality issues you discover must be inherent in the established system-of-record
not a problem with the newcomer database or ETL process used to build it. Do it
the other way round and see who stakeholders (subconsciously) blame.

Data Profiling Techniques


Dedicated data Data profiling has become common practice, and sophisticated profiling tools that
profiling tools are can graphically visualize data sources are readily available as standalone applica-
incredibly powerful tions and as modules of data modeling and ETL tools. But even without specialized
but useful profiling tools, useful data profiling can be performed with simple SQL scripts, BI tools, and
can be done using spreadsheets. A full discussion of data profiling techniques is beyond the scope of
SQL scripts and BI this book, but here are three basic checks for quickly assessing whether a data
tools source is fit for purpose in the data warehouse:

Missing Values
The first, best test Nothing (literally) illustrates the value of data profiling more than the early discov-
for any source is ery of missing data that the stakeholders have deemed mandatory. Profile for
to count missing missing values by counting the occurrence of Nulls or blanks in each candidate
values and calculate source column/field and calculating the percentage missing. Knowing how often
the percentage the source data is Null is essential for any column—but especially for columns that
missing have been identified as mandatory (MD) by the stakeholders. The SQL for count-
ing Null values in a column is:

SELECT COUNT(*) FROM [SourceTable] WHERE [SourceColumn]


IS NULL
/*For character columns you should add:*/
or [SourceColumn] = ‘’

When you are working with non-database sources, such as flat files, you can map
the source data as external tables, or perform basic ETL, with minimal transforma-
tions, to move it into DBMS tables, so that it can be profiled using SQL queries
and BI tools.
Modeling Star Schemas 133

Unique Values and Frequency


Another vital property of each candidate source is the number of unique values Check source
that it contains, and the frequency of each value. The SQL for counting unique columns for
values and calculating percentage unique is: uniqueness to
identify candidate
SELECT COUNT(DISTINCT [SourceColumn]), COUNT(DISTINCT keys, and hierarchy
[SourceColumn])/COUNT(*) * 100 FROM [SourceTable]
levels

A source column with 100% unique values may be a good candidate for a business
key while progressively lower percentage uniqueness can suggest that a set of
columns represent a viable hierarchy. The SQL for ranking each value in a column
by its frequency is:

SELECT [SourceColumn], COUNT(*) FROM [SourceTable]


GROUP BY [SourceColumn] ORDER BY 2

Source column value frequency can be graphed to help you spot columns that have Graph source
no informational content in spite of not being Null. For example: column values
by frequency
Columns where values are (almost) all the same (equal to the default) to discover poor
Empty or spaces only strings: the logical equivalent of Null quality content
Favorite dates for lazy data entry staff such as “1/1/01”

Data profiling requires full table scans, making some of the queries involved very
resource intensive. You should avoid profiling a live operational system directly,
because transactional performance can be adversely effected. This is clearly not
the ideal first impression that any data warehouse team wants to make on opera-
tional support! Instead use snapshots (off-line copies) of the candidate data
sources held on your own server or wait until after-hours.

Data Ranges and Lengths


The third category of simple data profiling tests identify source data ranges by Query data ranges
querying the minimum, maximum and average values for numeric columns, the to help define data
earliest and latest dates for datetime columns, and the shortest and longest strings types and spot
for character columns. As well as helping you to define data types and set date erroneous outlier
ranges for the data warehouse these queries help you to spot outliers that often values
represent errors.

If source data is reliably time-stamped, try grouping your data profiling queries by
the Month, Quarter, or Year that the data was inserted/updated to see how data
quality changes over time. The worst quality issues may be older than the histori-
cal scope of the warehouse—if you’re lucky.
134 Chapter 5

Automating Your Own Data Profiling Checks


Use SQL scripts If you don’t have a data profiling tool but you do have hundreds or thousands of
to generate data source columns to check, you can use SQL-generated SQL to create data profiling
profiling queries tests that write their results to a table for analysis and presentation with BI tools.
that write their For example, the following SQL generates a set of INSERT statements that count
results to a table Nulls for all columns in a schema, and write the results to a PROFILING
_RESULTS table:

SELECT
'INSERT INTO PROFILING_RESULTS(TABLE_NAME, COLUMN_NAME,
MISSING_COUNT) SELECT '''
|| Table_Name
|| ''', '''
|| Column_Name
|| ''', COUNT(*) FROM '
|| Table_Name
|| ' WHERE '
|| Column_Name
|| ' IS NULL;'
FROM SYS.All_Tab_Columns
WHERE …

Search online for Search online for “SQL data profiling script” and you should be able to find ready-
ready-made made scripts that you can adapt for your database platform that will create all the
profiling scripts tests recommended above and more and store the results in table form.

For in-depth coverage of data profiling, data quality measurement, and ETL
techniques for continuously addressing data quality read:

Data Quality: The Accuracy Dimension, Jack E. Olsen


(Morgan Kaufmann, 2003)
The Data Warehouse ETL Toolkit, Ralph Kimball, Joe Caserta
(Wiley, 2004) Chapter 4, pages 113–147

No Source Yet: Proactive DW/BI Design


Proactive DW/BI What if there is no source to profile? This might not be a disaster, just a timing
designers have to issue that needs to be anticipated. When agile BI systems are developed in parallel
cope without stable with new operational systems, a proactive data warehouse design can preempt
data sources to operational system development or the installation of a packaged solution. Initially,
profile (yet) there will be no indicative source data, possibly not even a source data model. Even
when there is a well documented data model, as in the case of a packaged source, it
can provide little useful information until the system has been configured and real
data migrated to it.

Use no source as ETL development is especially challenging when source data definitions are still
an opportunity to fluid (non-existent), but this does present an opportunity for the agile data ware-
define the perfect house team to negotiate a better “data deal.” The BEAM✲ model can be used to
BI data source provide a detailed specification of business intelligence data requirements to the
Modeling Star Schemas 135

operational development team—while they are still in design mode. An agile


operational team will welcome your early input to ensure their system will capture
crucial business intelligence information needed by the data warehouse. If you
have this level of cooperation you can press on with your dimensional design and
ETL development.

When source database development lags behind data warehouse design, you can
avoid delaying ETL development by defining extract file layouts, based on your
BEAM✲ tables, and getting the operational development team to agree to their
scheduled delivery. The agile ETL team can then get on with mapping these
initially empty files to their star schema targets.

Once data take-on has begun for a new operational system you should profile the Profile sources as
initial data and the previously agreed-upon extract files as early as possible to help soon as they are
the operational team keep to their promises. Trust no one! available

Annotating the Model with Data Profiling Results


The results of data profiling need to be presented to the stakeholders, so that they Present data
can review the data issues, decide on next steps, and if necessary reprioritize profiles to
development based on the data realities. While data profiling tools can provide stakeholders using
many useful graphical reports for the warehouse team, the profiling results for a the BEAM✲ format
BEAM✲ model are best delivered to the modelstorming stakeholders in a format familiar to them
that they are familiar with: the BEAM✲ model itself.

In Figure 5-2, the PRODUCT dimension has been extended with data profiling BEAM✲ tables are
results showing counts and percentages for missing, unique, minimum and maxi- extended to hold
mum values for each column. These simple profiling measures are a great start for profiling metrics and
highlighting potential issues, and can be augmented with more sophisticated annotated to
measures and graphics generated by data profiling tools. The Figure 5-3 table has highlight source
also been annotated to show data sources, unavailable details, new attributes and data issues
definition mismatches. The following sections describe the model review notation
used.

Data Sources and Data Types


For each dimensional attribute and event detail, record its best candidate data For each column
source within braces ({ }); for example, {ERP.Employee.Grade} identifies the add its source
source system ERP, Employee table or file, and Grade column or field. If a single name, data type
source table or file is the source for all columns in a BEAM✲ table you can add its and length
name once to table header and only name the individual source column or field in
each BEAM✲ column; For example, in Figure 5- the table header shows that the
source for all PRODUCT attributes is the ERP system table PRD and the column
46#$"5&(03: source UBCMFcolumnJT PRD_4$"5. If a table or column will
be derived from multiple sources, you can comma delimit them or use source
136 Chapter 5

reference numbers within the braces and expand upon the mapping rules in
hyperlinked supporting documentation or footnotes. If there are conflicting
sources for the same data, slash (/) delimit the choices. As well as identifying
column sources you should also record their data type and length using the codes
in tables 5-1.

Figure 5-2
Dimension table
annotated with
data sources and
profiling results

Place data source references on new rows in the table header and column type (as
in Figure 5-2) so they can be hidden when not needed; e.g., during a model review,
if the source names are not meaningful to the stakeholders.

Table 5-1 CODE DATA TYPE


Data type codes
C(n) Character( length )
DT Date and Time
D Date
N(n.n) Numeric( digits, precision )
T Text. Long character data used to hold free format text
B Blob. Binary object used to hold documents, images, sound etc.
Modeling Star Schemas 137

Additional Data
While profiling the candidate sources, it is extremely likely that you will discover Use italics to
relevant data that the stakeholder didn’t request. If any of it looks like potential highlight
facts, or dimensional attributes for the currently prioritized events, you should add additional data
them to the model for review. Additional business keys that represent further reuse
of conformed dimensions are especially interesting. Use bold italics to highlight
new columns.

Unavailable Data
If you cannot find a data source, or the only available source conveys little or no Use strikethrough
information, use bold strikethrough on the unavailable column and its examples. to highlight
Figure 5-2 shows that PRODUCT WEIGHT is unavailable. If an entire event or unavailable data.
dimension is unavailable you should strikethrough the whole table and the appro- If an entire table is
priate row or column on the matrix (and inform the stakeholders as soon as unavailable highlight
possible). Figure 5-3 shows the (thankfully unlikely) situation that there is no this on the matrix
reliable source for a product dimension. If this really was the case you would also too
strikethrough all PRODUCT details in event tables—making them non-viable.

Figure 5-3
BEAM✲ diagram
showing missing
data source for an
entire dimension

Nulls and Mismatched Attribute Descriptions


If you discover an attribute definition mismatch you should highlight these by Highlight missing
using bold strikethrough on the appropriate column code. For example, mandatory data
PRODUCT SUBCATEGORY in Figure 5-2 was defined by the stakeholders as a using MD.
mandatory (MD) description of all products (and would therefore be a good level Strikethrough other
in the default product hierarchy). However, data profiling shows that it is missing mismatched column
for 20% of products so it has been marked as MD. codes

It can be very useful to point out ‘not Null’ sources for any event details and di- Highlight mandatory
mensional attributes that the stakeholder did not explicitly identify as mandatory, source data as NN:
by highlighting them as NN. These rare cases, where data is more reliably available Not Null
than stakeholders thought, may open up new areas of analysis that they previously
didn’t consider.

Use the following notation to annotate a model with source definitions:


{source} : Data source system, table, column, file or field name
Value : Unavailable or incorrect data or conflicting definition
NN : Not Null. Column cannot contain null values
138 Chapter 5

Model Review and Sprint Planning


Use the data Once the profiling results are in and have been added to the model it’s time to hold
profiling results to an initial planning meeting (Figure 5-4) with the DW/BI team prior to stakeholder
rank the data issues model review. Armed with their new-found knowledge from running the data
and estimate the profile checks, the team should rank the data issues by severity (see Table 5-2) and
ETL tasks provide ETL task estimates, in man-days, for loading the viable events.

Figure 5-4
Initial planning
meeting

Team Estimating
Estimating is an agile DW/BI team activity, every team member should be involved
to bring them up to speed with the emerging design. Everyone can usefully con-
tribute; e.g., BI developers can often help with ETL estimates if they are familiar
with the data sources, from having had to report directly off them in the past.

Play planning poker A downside of team estimating is that one person, who “knows best” or has the
to get unbiased loudest voice, can influence everyone’s estimate. A great way to avoid this is to play
team estimates planning poker: using a special deck of cards, everyone reveals their estimate for a
task simultaneously, and the team learns a lot from the differing opinions.

Dimension and When task estimates have been agreed, the totals for each table are added to the
event estimates event matrix so that star schema estimates can be calculated by summing the
are added to the relevant dimension and event totals. These estimates, used in conjunction with the
event matrix team’s velocity (work delivered per iteration), will give stakeholders an idea of what
could be prototyped after the next sprint or delivered in the next release.

For information on calculating team velocity and estimating by playing planning


poker with agile teams, read Scrum and XP from the Trenches, Henrik Kniberg
(InfoQ.com 2007) Chapter 4: How we do Sprint Planning.

Review the annotated data model and task estimates with stakeholders as soon as
possible. Delaying the review can allow unrealistic expectations for the data
warehouse to grow. Don’t let the stakeholders dream for too long!
Modeling Star Schemas 139

Running a Model Review


With the aid of the annotated tables and the event matrix estimates you are ready Review the
for the model review (Figure 5-5) with stakeholders. The purpose of this meeting is annotated model
to make stakeholders aware of what could be achieved in the next release, based on with stakeholders to
the currently available data sources and task estimates. If the data sources are in make them aware of
good shape and the business priorities have not changed then it should be a short what is possible in
meeting. If not, you need to concentrate on the serious data issues and large task the next release
estimates that need the stakeholders’ attention most.

Figure 5-5
Model review

Start with the most severe issues (see Table 5-2) and work your way down by Concentrate on
reviewing any major missing sources first. Completely missing event or dimension severe data issues:
sources—strikethrough tables like Figure 5-3—may cause a serious rethink of missing sources and
priorities. Missing individual details are generally less disruptive, but some may be conflicting sources
indispensable. Problems with conflicting conformed dimension sources are highly for conformed
significant as they can have a knock on effect for future iterations and have the dimensions
potential to build up the greatest technical debt.

SEVERITY ISSUE OUTCOME Table 5-2


Data source issues
1 Missing conformed dimension Stop
ranked by severity
1 Missing event Stop
(1=highest,
3 Missing or incorrect business key Stop
12=lowest)
3 Conflicting data for conformed dimension Stop
5 Event granularity is different Stop/Pause
6 Missing non-conformed dimension Pause
6 Missing (or poorly populated) event detail Pause
8 Missing mandatory values Pause
8 Incorrect hierarchical relationship Pause
10 Missing (or poorly populated) dimensional attribute Go
10 Mismatched detail and attribute values Go
12 Additional event details or dimensional attributes Go

Stop : a major rethink or reprioritization is necessary.


Pause : provide feedback before you develop the physical database.
Go : proceed (with caution) but still provide feedback on the gaps in
the model or the additional BI opportunities that may exist.
140 Chapter 5

Revise the model As you go through each table or column issue, update the model (the individual
with the help of tables and the matrix) with the stakeholders assistance. Ask them to help you to
the stakeholders decide:

Should we include, exclude, add or adjust this item?


Which of these conflicting sources should we choose?
!

If stakeholders want You should finish the review by asking the stakeholders if they want to reprioritize
to reprioritize events in light of the data issues and task estimates—bearing in mind that the task
events, revise the estimates may also need to be adjusted based on the changes they have just agreed
matrix accordingly to. If the stakeholders do want to alter their priorities, revise the matrix by replay-
ing the event rating game described in Chapter 4.

Sprint Planning
Use the revised Following the model review, you hold a sprint planning meeting (Figure 5-6)
model, estimates where the DW/BI team will revise their estimates based on the model amendments
and priorities to and the product owner will decide on the data items that will make their way onto
define the sprint the sprint backlog: the list of data and user stories (tables and BI re-
backlog ports/dashboards) to be implemented in the next sprint. To help the team revise
their estimates you may need to draw some quick star schemas. It is at this point
you would introduce some of the design patterns described later in this book.

Figure 5-6
Sprint planning
meeting

Dimensions that have already been implemented should be given an estimate of


zero. The estimate for all non-viable or low priority tables (that have not been
profiled) should be left blank. The estimate for degenerate dimensions should also
be blank—their development effort is included in fact table estimates. The total
estimate for two star schemas that share conformed dimensions should not
double count the conformed dimension estimates—the conformed dimension
estimates should be high enough individually to include all the tasks involved in
merging and conforming multiple attribute sources.
Modeling Star Schemas 141

Star Schema Design


After you have updated the BEAM✲ model to reflect the data realities and altered The model is
priorities, you are ready to create a (logical) dimensional model and draw the star now ready to be
schemas that will be used by the DW/BI team to generate the physical data ware- drawn as
house and design ETL and BI applications. This involves the purely technical steps star schemas
shown in Figure 5-7, none of which require stakeholder input or participation.

Figure 5-7
Creating a (logical)
dimensional model

If you are using the BEAM✲Modelstormer spreadsheet, copy your BEAM✲ model
to a graphical modeling tool for star schema layout by using the customizable
SQL DDL it generates. If you haven’t done so already, download the spreadsheet
template and find full instructions for using it at modelstorming.com

Adding Keys to a Dimensional Model


The major difference between the BEAM✲ business model and a dimensional To convert the
model that can be used to create a physical database schema is the addition of BEAM✲ model into
primary and foreign keys that define the relationships between the dimension and a dimensional
fact tables. There is no need to discuss these keys with the business stakeholders, model just add keys
because data warehouse key definition is purely a technical activity.

Choosing Primary Keys: Business Keys vs. Surrogate Keys


During modelstorming you defined at least one attribute in each dimension as a Do not use business
business key (denoted by the code BK) to uniquely identify each dimension mem- keys as dimension
ber. Business keys, such as PRODUCT CODE or CUSTOMER ID, are the unique primary keys. Use
primary keys of source system reference tables. They may appear the obvious (data warehouse)
candidate keys for similar-looking dimension tables, but source system business surrogate keys
keys never turn out to be as unique, stable, minimal or omnipresent across multi- instead
ple business processes as a data warehouse needs them to be. Instead dimensional
modelers use (data warehouse) surrogate keys (SK) as the primary keys for dimen-
sions. These are integer sequence numbers assigned uniquely to each dimension
table row, by ETL processes, and used by BI applications to join dimensions to fact
tables—where they act as foreign keys.
142 Chapter 5

Database Key Definitions


Primary key (PK): A column or combination of columns that uniquely identifies
each row in a table. In addition to being unique, a primary key should ideally be:

Stable: not change value over time.


Minimal: be short, use as few columns as possible (ideally 1).
Not Null: be present, have a value for all rows in the table.

Foreign key (FK): A column or combination of columns in one (child or referenc-


ing) table that relate to the primary key of another (parent or referenced) table. In
a dimensional model, foreign keys within a fact table relate to the primary keys of
its matching dimensions.

Composite Key: A key made up of two or more columns. Identified in a BEAM✲


model by numbering a group of key codes alike; e.g., two columns in the same
table marked PK1 represent a two-part composite primary key.

Alternate Key: A column or combination of columns that can be used in place of


a primary key. Identified by numbering alternatives differently; e.g., three col-
umns in the same table marked PK1, PK2, PK2 represent a primary key and a
composite alternate key.

Candidate Key: A column or combination of columns that could act as a key.

Natural key (NK): A key that is used to uniquely identify something in the “real-
world” outside of a database; e.g., a barcode printed on a product package or a
Social Security number on an ID card. Natural key values are sometimes known
by stakeholders and used directly in reports and queries. The Employee ID 007
belonging to our favorite salesperson James Bond, has taken on a life of its own
beyond the HR system and become a natural key.

Surrogate key (SK): A key with a “meaningless” or artificial value, typically a


sequence number, generated by a database or application that is used instead
of a natural key.

Business key (BK): A primary key from a source system. This can be a mean-
ingful natural key or a meaningless system-generated surrogate within the
source system, but by the time it reaches the data warehouse it has meaning to
the business outside of the warehouse and so is referred to as a business key.

Benefits of Data Warehouse Surrogate Keys


Surrogate keys = Data warehouse surrogate keys, referred to simply as surrogate keys, have the
big DW/BI benefits follow benefits over source system business keys:
Modeling Star Schemas 143

Insulate the Data Warehouse from Business Key Change


Surrogate keys protect the data warehouse from changes or glitches in the way When business
business keys are administrated. For example, if business keys change when a keys are changed or
business process is migrated to a new packaged application they can be updated in reused, surrogate
a dimension without affecting millions or billions of facts. If business keys are keys prevent facts
reused when products are discontinued or customers depart, new dimensional from been affected
rows with new surrogate keys can be assigned to these reused codes so they remain
distinct from their previous use. Business key instability like this is often hidden
because it may not cause a problem in the operational systems if older transactions
are archived, but the problem surfaces when the data warehouse has to take a
longer-term view.

Cope with Multiple Business Keys for a Dimension


Surrogate keys allow the data warehouse to integrate events from multiple opera- Surrogate keys
tional sources that store information about the same conformed dimensions using provide a single
different business keys. By using a surrogate key, you can sidestep the question of primary key for
which business key is best for a dimension—a politically sensitive issue within conformed
organizations that have grown by merger or acquisition. The safest answer is “there dimensions that
is no best business key.” They are all important non-key attributes that should be have multiple
stored in the dimension. The multiple business keys will be used by ETL processes business keys
to assign surrogate foreign keys to the facts derived from the multiple sources. originating from
They may also all have analytical value to various stakeholder groups, especially if multiple source
some are natural keys that the stakeholders work with outside of their databases. systems

Track Dimensional History Efficiently


The data warehouse must provide history for the descriptive attributes of slowly Surrogate keys
changing dimensions (SCDs) as well as rapidly changing facts. Surrogate keys allow dimensional
provide a simple mechanism for storing this history directly in dimension tables history to be tracked
and efficiently joining the correct historical descriptions to the historical facts at and joined efficiently
query time. to facts

Handle Missing Dimensional Values


Every dimension needs at least one special record that represents ‘missing’ or ‘not Surrogate keys can
applicable’ to cope with errors or minor variations in business events. For example, represent special
in-store and telephone CUSTOMER ORDERS are handled by a SALESPERSON missing dimension
whereas online orders are not; they do not naturally have a SALESPERSON dimen- values for which
sion. When online orders are loaded into the same fact table as other orders their there are no
Null or missing SALESPERSON IDs are replaced with a special surrogate key value business keys; e.g.,
that points to a “No Salesperson” record in the SALESPERSON dimension. This “No Customer”,
allows order queries that group by SALESPERSON to still include online orders “Missing Date”
but display them using “Missing” labels defined by stakeholders. If the
SALESPERSON foreign key was left as Null, online orders would always be ex-
cluded. Having a special missing row in every dimension simplifies query joins; all
joins can be inner joins as every fact will always find a matching dimension record.
144 Chapter 5

Reserve the surrogate key value zero for the default Missing row in each dimen-
sion. You use this row zero to hold the stakeholders’ Missing labels recorded in
the BEAM✲ dimension table example data. If different types of missing are needed
you can add additional special rows, using negative surrogate keys, to represent
“Unknown”, “Not Applicable”, “Error” etc., leaving the normal dimension values
to use positive integers. Being consistent in the use of special value surrogate
keys can greatly simplify ETL processing.

Support Multi-Level Dimensions


Surrogate keys Some business events have variable-level dimensional details. For example, tele-
enable multi-level sales orders are normally attributed to individual salespeople, but occasionally an
dimensions to order is attributed to a sales team, when the salesperson is on probation or has left
describe variable- before the order is processed. Orders are a variable-level who event. By using an
level business extension of the missing value SK technique, a number of additional special value
processes rows can be added to the SALESPERSON dimension to represent teams, branches,
regions, or other levels in the sales organization hierarchy, creating a multi-level
dimension to which both normal and exceptional facts can be attached. The multi-
level design pattern is covered in Chapter 6.

Protect Sensitive Information


Surrogate keys For data protection or security reasons you might need to create anonymous
keep sensitive customer or employee dimensions for analyzing sensitive purchase habits or salary
data anonymous. payment facts. Anonymized dimensions obviously must not contain name, exact
Unlike business address, date of birth, or other descriptive attributes that could be used in combi-
keys they cannot be nation to identify individuals. However, if they use business keys (such as Cus-
used to join tomer ID or Employee Number) as primary keys, then fact tables will contain
sensitive data to business key foreign keys that can be cross-referenced with other systems to
source systems that provide full disclosure. You can prevent this by replacing business keys with
might provide full surrogate keys, that do not exist outside of the data warehouse, and limiting access
disclosure to the disclosing business key to surrogate key mapping tables, to secure ETL
processes only.

Reduce Fact Table Size


Use integer Integer surrogate keys are more compact than datetime keys and most alphanu-
surrogate keys meric business keys—especially so called ‘smart keys’ that have embedded business
instead of logic. This can lead to significant reductions in the size of fact tables. For example,
alphanumeric a typical detailed fact table has 10 dimensions and 5 measures. If the average length
business keys to of a business key is 10 characters (bytes), replacing each foreign key with a 4-byte
reduce the size of integer would halve the size of the fact table and its corresponding indexes. Even if
fact tables and savings can only be made on a few foreign keys, every byte counts when it comes to
indexes fact tables.
Modeling Star Schemas 145

Adding surrogate keys to dimensions in addition to their business keys slightly Don’t worry about
increases the size of the dimensions, but this is insignificant. In general do not the size of
worry about the size of dimensions. Although dimension rows can be long with dimension tables…
dozens of descriptive attributes, dimensions typically only contain hundreds or
thousands of rows in total, whereas fact tables can record millions of rows per day.
If you want to see where you should concentrate on saving storage take a look at
the Figure 5-8 “scale” diagram of a star schema. In a dimensional data warehouse,
fact tables along with their indexes and their staging tables account for 80%–90% of
the storage requirements.

Of course not every dimension is small. Customer dimensions that contain indi- …except customer
vidual consumers can contain tens or hundreds of millions of rows. These need to dimensions – they
be treated with the same respect as fact tables and will benefit too from a primary can be big!
key index based on a compact 4-byte integer rather than a longer “smarter’ alpha-
numeric Customer ID that contains a check digit. Chapter 6 covers specific tech-
niques for handling very large dimensions (VLDs), also known as “monster
dimensions”.

Figure 5-8
Scale diagram of a
star schema by
space used

For fixed length surrogate keys, a 4-byte integer is suitable for most dimension
populations. If you are expecting a crowd (more than 2.1 billion whos or whats) or
have specialized calendar dimension requirements (discussed in Chapter 7), you
should use an 8-byte integer surrogate key.

Improve Query Performance


Because integer surrogate keys reduce the size of fact tables and indexes, they help Integer SKs join
to fit more records into each read operation. This in turn leads to improved star faster than date and
join processing and dramatic improvements in query performance. character keys

Enforce Referential Integrity Efficiently


Surrogate keys do add an extra layer of complexity to ETL. It is true that fact SK lookups slow
records can be loaded quicker if business keys are not replaced by surrogates. But down fact loads but
this overhead is more than outweighed by the benefits listed above and a byproduct enforce referential
of surrogate keys is efficient referential integrity (RI). integrity (RI)
146 Chapter 5

RI prevents bad Referential integrity means that every foreign key has a matching primary key.
keys getting into Without RI checks, facts could be loaded into fact tables with corrupt dimension
good fact tables foreign keys that do not match any existing dimensional values. If this happens,
any query that uses these bad keys will fail to include those facts because they will
not join to the appropriate dimensions. If these “bent needles” find there way into
the giant haystack of fact tables the (SQL “NOT IN”) queries needed to find them
are prohibitively expensive.

DBMS constraints RI can be enforced by defining foreign key constraints in the database. However, in
can enforce RI but practice, DBMSs can be too slow at loading data warehousing quantities of data
ETL SK lookups can with RI switched on. In contrast, ETL processes are optimized for performing the
often do this more type of in-memory lookups required to check foreign keys against primary keys—
efficiently this is exactly what ETL does when translating business keys into surrogate keys
prior to loading fact tables. Effectively, the surrogate key processing provides “free”
procedural referential integrity, which allows DBMS RI checking to be safely
disabled.

DBMS query optimizers often benefit from having fact table RI constraints defined.
You can retain these optimization clues but still boost ETL performance by setting
the constraints to unenforced (you might call this “trust me I know what my ETL
process is doing” mode). This tells the optimizer what it needs to know about the
relationships between facts and dimensions to speed up queries, but avoids
unnecessary insert and update checks that would slow down ETL.

Enable DBMS-enforced RI for fact tables during ETL development and initial data
take-on to provide “belt and braces” data integrity assurance and test ETL surro-
gate key lookups. If no DBMS RI errors are raised, the ETL processes are assign-
ing valid surrogate keys to facts and the additional DBMS checks are
unnecessary. You can then disable the DBMS RI (drop the constraints or set them
to unenforced), if it is having an adverse effect on load times.

Slowly Changing Dimensions


Change stories When you model dimensions with stakeholders you ask them how each dimen-
match the behavior sional attribute should handle change. You record their answers as change stories
of slowly changing which illustrate the behavior of fixed value (FV) attributes that can only be cor-
dimensions rected and have no valid history, current value (CV) attributes with no required
history and historic value (HV) attributes that preserve their history. Figure 5-9
shows two Employee change stories for James Bond. The historical values of his
MARITAL STATUS and CITY (his HV attributes) have been carefully tracked on
each of his stories while his CV and FV attributes DEPARTMENT and DATE OF
BIRTH show only the current or corrected single values for all his stories. This
example data matches exactly how slowly changing dimensions (SCDs) are imple-
mented by ETL processes in a dimensional data warehouse using surrogate keys.
Modeling Star Schemas 147

Figure 5-9
Slowly changing
EMPLOYEE
dimension

Overwriting History: Type 1 Slowly Changing Dimensions


CV and FV dimensional attributes are implemented as Type 1 slowly changing CV and FV
dimensions. When they are updated in the source system they are similarly updated attributes are
in a dimension. For FV attributes such as DATE OF BIRTH this is entirely implemented as
appropriate because an employee or customer can have only one date of birth, the Type 1 SCDs.
update must be a correction and nothing of value is lost. For CV attributes, the Changes are
stakeholders have decided that historically correct values are unimportant and handled as
current values are the only ones that matter. For both CV and FV attributes history updates and
is lost, making reports that use them “unrepeatable”. They will give different history is
answers if re-run because the reports will be grouped or filtered using “as is now” overwritten
not “as was then” descriptions.

Tracking History: Type 2 Slowly Changing Dimensions


HV dimensional attributes are implemented as Type 2 slowly changing dimensions. HV attributes are
They are not overwritten when they change in the source system. Instead new rows implemented as
are inserted with the new values, just like the change stories in Figure 5-9 that show Type 2 SCDs
Bond’s various statuses and locations.

Creating new rows to track change presents an issue for uniquely keying a dimen- Tracking history
sion. For example, Bond’s business key “007” no longer uniquely identifies a single within a dimension
Employee row and must be combined with the effective date of the change to means you cannot
provide a valid primary key. Unfortunately a composite key such as EMPLOYEE rely on the business
ID, EFFECTIVE DATE would ruinously complicate the joining of historical facts key alone as a
to the correct historical descriptions. Prior to tracking history a simple equi-join primary key
would locate Bond’s expenses:

Employee.Employee_ID = Expenses_Fact.Employee_ID
148 Chapter 5

Composite keys With a composite key involving effective date this becomes a far more difficult to
involving effective optimize complex (or theta) join:
dates require Employee.Employee_ID = Expenses_Fact.Employee_ID and
complex joins to Expenses_Fact.Expense_Date
between Employee.Effective_date and Employee.End_date
fact tables

Without the between join on the dates, all of Bond’s expenses would be joined to
each historical version of him, triple counting his total based on the three versions
of Bond in Figure 5-9. If the above join looks complex, imagine now that
EMPLOYEE isn’t the only HV dimension that must be joined to the facts, each
join would be just as complex. This would not be a viable query strategy against
typical data warehousing quantities of facts.

A Type 2 SCD Instead, Type 2 SCDs use a surrogate key as an efficient minimal primary key that
surrogate key uniquely identifies each historical version of a dimension member. Figure 5-9
partitions history shows the surrogate key EMPLOYEE KEY being added to the dimension. This
by using a simple would become a foreign key in all employee related fact tables. For Bond, his
equi-join earliest expense claims and sales transactions would have an EMPLOYEE KEY of
1010 while his most recent will be have 2120. A Type 2 SCD surrogate key guaran-
tees that efficient equi-joins will automatically join historical facts to the correct
historical descriptions and the most recent facts to current descriptions. They also
have the effect of making reports “repeatable”; for example, Bond’s 1968 expenses
will always be reported as incurred by a single man never a widower.

Surrogate keys should be hidden from stakeholders wherever possible. BI tools


use them as the mechanical way of joining facts and dimensions only. Surrogate
keys are never to be used for sorting, grouping or filtering data. For example, you
cannot rely on the highest surrogate key for an employee being the most recent
version. A late-arriving employee change will be assigned a higher sequence
number surrogate key than the current version. BI tools also have to be careful to
count distinct employee IDs rather than employee version rows and show distinct
lists of Employees so that so that stakeholders don’t see “Bond, Bond, Bond, …”

Current Values or Historical Values? Why Not Both?


Define attributes as You should carefully consider current value only (CV) dimension definitions. If
HV if you think their you suspect that historical values might be needed at some point you should define
full history will be these attributes as HV and design your dimensions and ETL processes accordingly
needed in the to record historical values from the outset. Refactoring an attribute as HV and
future. Refactoring adding its “late arriving” dimensional history to an established warehouse is
“late arriving” history complex and expensive, and can involve updating the foreign keys of hundreds of
is expensive millions of existing facts.

Treat CV as a As discussed in Chapter 3, you should often treat CV codes added to the model by
reporting default stakeholders as reporting directives rather than storage decisions. CV tells you that
rather than an ETL the stakeholders would prefer their reports to initially default to current values
instruction (because that is what they are used to). When their analysis requirements become
Modeling Star Schemas 149

more sophisticated they may change their minds. With modern DW/BI hardware Store dimensional
you have the luxury of storing and processing dimensional history for most dimen- history if possible.
sions as standard practice. And just because you track history you don’t have to Enable CV reporting
give it to BI users who don’t want it (yet). Chapter 6 covers the hybrid SCD pattern by providing a
for providing both current value (“as is”) reporting and historic value (“as was”) hybrid SCD
reporting without further complicating the model or ETL processes.

The Data Warehouse ETL Toolkit, Ralph Kimball, Joe Caserta (Wiley, 2004),
Chapter 5, pages 183–196 provides information on the ETL processing needed to
support slowly changing dimensions. Pages 194–196 describe the complexities of
handling late-arriving dimensional history.

Updating the Dimension Definitions


You complete each dimension by adding a surrogate key primary key, and addi- You complete a
tional ETL administrative attributes to support SCD processing and audit require- dimension by
ments. Because you are only adding columns there is no need to create separate adding a primary
spreadsheet versions of dimensions. The extra columns can easily be hidden when key and audit
you use the tables again for modelstorming with stakeholders. columns

Adding Surrogate Keys


Add a leading surrogate key column marked SK to each dimension, using a nam- Add a surrogate key
ing convention of [Dimension] KEY; e.g., PRODUCT KEY. This usually works (SK) with examples
well as the suffix “KEY” is seldom used for business keys (ID, CODE, and NUM including zero for
are far more common). Fill in the example missing row in each dimension with a the missing row
zero surrogate and use simple sequential integers for other examples to make it
obvious that they are surrogate keys. Figure 5-10 shows a surrogate key added to
the PRODUCT dimension.

Figure 5-10
Updated PRODUCT
dimension
150 Chapter 5

Use unique example ranges for the most common surrogate keys; e.g., 1–1000 for
customers, 2000–3000 for products. This can help the DW team read the foreign
key examples in fact tables (stakeholders would never look at these values). This
convention is just for human readability; reserving value ranges for specific
dimensions keys in the physical database is not recommended.

ETL and Audit Attributes


Add effective dating If you have already modeled how the dimension should track history and discov-
attributes for ered its CV and HV attributes, you may already have the following SCD ETL
managing SCDs administrative attributes in your dimensions (if not you should add them now):
EFFECTIVE DATE
END DATE
CURRENT

Effective dating EFFECTIVE DATE and END DATE define the valid date range for each dimen-
attributes support sion row. For example, in Figure 5-9 employee Bond’s three MARITAL STATUS
point in time (HV) changes have unique effective date ranges—with no overlaps or gaps. For the
dimension queries current version of each EMPLOYEE there is no END DATE. But rather than
leaving END DATE as Null, make sure ETL processes set it to the maximum date
supported by the database. This allows query tools to use simple BETWEEN logic
when asking questions about the dimension population at a specific point in time.
For example, a query to count the number of employees in each city at the close of
2011 would be:
SELECT city, count(*)
FROM employee
WHERE TO_DATE('31/12/2011','DD/MM/YYYY') BETWEEN
effective_date AND end_date
GROUP BY city

Queries should use The CURRENT flag indicates whether a row version is current (Y). This could be
a CURRENT flag inferred from the value of END DATE but this saves stakeholders and query tools
rather than the max from remembering the otherwise meaningless maximum date value, which can
DBMS date vary by DBMS.

SCD administrative SCD administrative attributes should all be defined as Not Null. END DATE
attributes should be should have a default value of the maximum database date, and CURRENT should
Not Null default to “Y”.

EFFECTIVE DATE and END DATE in Figure 5-10 are shown as dates. This would
allow the dimension to track one set of changes per day because the minimum
effective range for a historical version is one day. Multiple changes on the same
day (if they could be detected from the source system feed) would have to be
batched into a single update to the dimension. This is a reasonable approach if
multiple changes to the same attribute on the same day are corrections. If inter-
day changes are more significant and must be tracked to match inter-day facts,
EFFECTIVE DATE and END DATE need to be stored as full timestamps.
Modeling Star Schemas 151

If a dimension contains only CV and FV attributes, effective dating attributes are


unnecessary. Implement them anyway and you will be ready for the day when a
new HV attribute is added, or stakeholders finally realize they needed historically
correct attributes all the time. The cost of redundantly maintaining these attributes
initially is insignificant compared to the cost of refactoring ETL processes.

There are five additional administrative attributes in Figure 5-10 that you should Additional ETL
also consider adding to every dimension: attributes can be
useful…
MISSING
CREATED DATE
CREATED BY
UPDATED DATE
UPDATED BY

The MISSING flag “Y” indicates that the row is a special “Missing” dimensional …for identifying
record (usually with a zero or negative surrogate key). “N” indicates a normal dim- special “missing”
ension record. This can be useful for filtering out all forms of missing (e.g. “N/A” rows…
and “Error”) without exposing their surrogate key values to BI users.

The CREATED and UPDATED attributes provide basic ETL audit information …and providing
on the date/time and ETL version used to create and update the dimension. Chap- dimension audit
ter 9 provides more details on audit techniques. information

When you present existing BEAM✲ dimension tables as spreadsheets to stake-


holders, hide surrogate keys and other ETL-only columns to limit the discussion
to business attributes.

Time Dimensions
If you haven’t already done so you should model when dimensionally—just like Model time
any other W-type. A CALENDAR dimension is essential to the data warehouse dimensionally
because it provides the conformed time hierarchies (discussed in Chapter 3) and as separate
descriptions that stakeholders need to analyze every business process. You should CALENDAR and
also model time of day to discover if stakeholders have custom descriptions for CLOCK dimensions
periods during a day, such as peak/off peak or shift names. If so, these should be
implemented as attributes of a separate minute granularity CLOCK dimension to
avoid a single monster TIME dimension that would contain 2.6 million minutes
for every 5 years of warehouse history.

Figure 5-11 shows how a single time of day granularity when event detail: CALL Datetime details
TIME is replaced by two surrogate keys: CALL DATE KEY and CALL TIME KEY become separate
in a fact table. Both CALENDAR and CLOCK are role-playing (RP) dimensions date and time
that will be used to replace the when details of every event. Chapter 7 provides full surrogate keys
details on time dimensions and their special property surrogate date keys.
152 Chapter 5

Figure 5-11
Splitting a when
detail into separate
Calendar and
Clock dimensions

Modeling Fact Tables


Save copies of Once all the dimensions have been updated with surrogate keys, you can then
event tables before convert the event tables into fact tables by replacing their who, what, when, where,
you convert them and why details with dimension foreign keys, while leaving quantities (how many)
into fact tables and degenerate dimension (DD) how details in place. Because you are changing
columns, not just adding new ones, you should save copies of the original event
tables for future modelstorming. The fact table versions will be used for creating
star schemas and communicating design techniques within the DW/BI team.

Rename fact tables With the event table copies saved you can replace event names with fact table
and record their fact names and change story types to fact table types. In Figure 5-12 the CUSTOMER
type ORDERS discrete event has been renamed ORDERS FACTS and its story type DE
replaced with the fact table type TF for transaction fact. Chapter 8 describes each
of the fact table types in detail.

The following table codes are used to identify fact table type:
[TF] : Transaction Fact table, the physical version of discrete events
[PS] : Periodic Snapshot, the physical version of recurring events
[AS] : Accumulating Snapshot, the physical version of evolving events

Replace Event Details with Dimension Foreign Keys


Change dimensional Replace all of the dimensional details with surrogate keys by renaming the columns
details to surrogate and changing their type to SK. In Figure 5-12 CUSTOMER, PRODUCT, ORDER
keys by marking DATE, SALESPERSON and PROMOTION have been replaced by the appropri-
them as SK ately named surrogate keys, and their examples changed to surrogate key integer
values.
Modeling Star Schemas 153

Replacing the examples is an optional step; you might change them to integer Leave descriptive
sequence numbers, as we have here, if you are using a BEAM✲ table to explain a examples for
surrogate key technique to the team. Alternatively, you can leave the descriptive readability or
examples from the original event modeling unaltered so that you don’t have to change them to
keep referring to the separate dimension tables to understand the event stories integers to explain
behind the facts. Regardless of what you do to the examples, a column type of SK SK techniques
documents that a fact column is an integer dimension foreign key in the physical
database schema.

Figure 5-12
Creating the
ORDERS FACT
table

Modeling Degenerate Dimensions


Degenerate dimensions (DD) such as ORDER ID are not replaced by a surrogate Degenerate
key because they have no additional descriptive attributes that need to be refer- dimensions (DD)
enced. Degenerate transaction IDs allow stakeholders and ETL processes to tie remain in fact tables
facts back to their original operational transactions. They also provide useful ways to tie facts back to
of uniquely counting business events, especially in the presence of multi-valued source transactions
dimensions (covered in Chapter 9). If a fact table contains a large collection of and provide unique
degenerate dimensions you should consider moving these to a new how dimension counts
(sometimes referred to as a junk dimension) to reduce fact table size. How dimen-
sions are also covered in Chapter 9.

Modeling Facts
The remaining quantity columns in the fact table are defined as facts. Facts should The remaining how
be modeled in their most additive form, so that they can be easily aggregated at many details are
query time. Additivity describes how easy or possible it is to sum up a fact and get a converted into facts
meaningful result. The ideal facts are fully additive (FA) ones that can be summed that can be
using any of their dimensions. aggregated

The three order facts in Figure 5-12 have all been defined as fully additive (FA). To Full additive (FA)
convert the raw how many details to (fully) additive facts they must be stored using facts are ideal
consistent additive units of measure (UOM). ORDER QUANTITY can use the because they can
product units from the original business events, but REVENUE and DISCOUNT be summed using
which originally showed examples in numerous currencies must be transformed any dimension
154 Chapter 5

into dollars during ETL otherwise they would be non-additive (NA) facts. In the
case of DISCOUNT some of the source figures were percentages. You could create
a consistent UOM by transforming all discounts into percentages but that would
not be an additive UOM. Additive fact design is covered in detail in Chapter 8.

[FA] : Full Additive fact, can be summed by any dimension


[NA] : Non-Additive fact, cannot be summed
[SA] : Semi-Additive fact, can be summed by certain dimensions only

Percentages make great measures and key performance indicators (KPIs) on


reports and BI dashboards but they make for poor inflexible NA facts in fact
tables. You should define facts that represent the additive components of per-
centage measures and calculate the percentages in BI applications.

Drawing Enhanced Star Schema Diagrams


Star schemas When all the dimension surrogate keys are in place and the facts defined it is time
represent facts and to draw star schemas: ER diagram versions of the dimensional model that ETL and
dimensions using BI developers will find useful and familiar. The best way to create and maintain
ER notation star schemas is to use a graphical data modeling tool that will also generate physical
schema definitions for your chosen data warehouse platform.

The SQL DDL generated by the BEAM✲Modelstormer spreadsheet allows you to


transfer your model directly into graphical modeling tools that support SQL
import. Alternatively, the DDL can be used to create default physical database
tables, which can then be reverse engineered by many modeling tools.

Create a Separate Diagram for Each Fact Table


Display one fact Once you have imported the dimensional model into a graphical modeling tool,
table per star arranging the tables into readable star schemas is usually straightforward. Most
schema ER modeling tools support multiple ER diagram views for a single model. Use this
diagram feature to create one diagram for each fact table and add their relevant dimensions,
making sure you are not duplicating the underlying dimensions as you do.

Don’t attempt to create one single ER diagram showing all the fact tables and
dimensions in the data warehouse. Even for a small subset of stars this quickly
becomes a mess of overlapping lines. ER notation is best restricted to viewing
one star at a time. Instead, develop a data warehouse matrix (covered shortly) to
provide a more useful overview of multiple stars or the entire model.

Enhanced star You can do two simple things to turn a standard star schema into an enhanced star
schema = star + schema. The first is to consistently place dimensions based on their W-type. The
consistent layout + second is to add BEAM✲ short codes to the tables and columns to describe their
BEAM✲ codes dimensional properties.
Modeling Star Schemas 155

Use a Consistent Star Schema Layout


Drawing star schemas with a consistently dimensional layout may seem trivial but Consistent star
as the number of stars grow with every release, the DW/BI team will find it tre- schema layout helps
mendously helpful when they scan multiple stars every day. Figure 5-13 shows the developers to speed
recommended layout for star schemas designed using the 7Ws framework. The read multiple stars
four corners are reserved for the four major W-types: who, what, when and where
with top left reserved for the most common W: when. Think of this as the dimen-
sional modeling equivalent of drawing maps “north up”.

Figure 5-13
Consistent
dimensional layout
based on W-type

Discover the BI
Model Canvas at
modelstorming.com
to help you model
collaboratively using
this layout

Display BEAM✲ Short Codes on Star Schemas


Enhancing a star schema with BEAM✲ codes allows you to document dimensional Star schema
properties and design decisions not supported by standard ER modeling tools. BEAM✲ codes
Figure 5-14 shows an enhanced star schema for the CUSTOMERS ORDERS event document
(as described in Chapter 2). BEAM✲ codes are used to describe ORDERS FACT as dimensional
a transaction fact (TF) table and the CALENDAR, and EMPLOYEE as role-playing properties not
(RP) dimensions. Table level codes (FV, CV, HV) describe the default slowly supported by
changing policy for each dimension. Column level codes identify surrogate keys standard ER
(SK) degenerate dimensions (DD), and fully additive facts (FA). notation

If your modeling tool allows you to include table and column comments or ex-
tended attributes on ER diagrams you can use these to display BEAM✲ codes.
Alternatively, if this feature isn’t available you may be able to display the codes by
appending them to the business or logical table and column name in your model
and setting the tool’s model view to conceptual or logical. The BEAM✲Model-
stormer spreadsheet contains configurable options for export names and codes
as comments or extended database attributes that can be imported by many
modeling tools.
156 Chapter 5

Figure 5-14
Enhanced star
schema for
customer orders

Avoid the Snowflake Schema Anti-pattern


A snowflake A snowflake schema is a dimensional model where one or more dimensions have
schema is a star been normalized, producing additional lookup tables known as outriggers. If this is
with normalized done to each of the dimensions that surround a fact table, the simple star begins to
dimensions look more like a snowflake, as in Figure 5-15.

Resist the urge to Once the model is in a familiar ER modeling tool you (or the DBAs) may be
snowflake. For most tempted to introduce snowflaking to reduce data redundancy and simplify dimen-
dimensions there sion maintenance. However, snowflake schemas are not generally recommended.
are no advantages They are too complex for user presentation (if required by your BI tool), offer no
significant space savings (see Figure 5-8), exhibit poor dimension browsing per-
formance, and negate the advantages of bitmap indices. There are legitimate
reasons for snowflaking very large dimensions, covered in Chapter 6, but resist any
3NF (third normal form) urges brought on solely by using an ER modeling tool.
Modeling Star Schemas 157

Figure 5-15
Snowflake schema

Do not use snowflake schema outriggers to document hierarchies if your database Use hierarchy
or BI toolset doesn’t need them explicitly defined as 1:M relationships—most charts rather than
don’t. Draw hierarchy charts instead. Do be pragmatic, if your toolset works better snowflakes to define
with snowflake schemas, create them as a physical optimization. hierarchies

Do Create Rollup Dimensions


Conformed rollup dimensions (RU), such as the product rollup PRODUCT TYPE Rollup dimensions
[RU] or the Calendar Rollup MONTH [ RU] described in Chapter 4, can look (RU) look similar to
similar to outriggers and have similar relationships with their base dimensions outriggers but do
(their surrogate keys are often carried in the base dimension). The important not normalize their
difference is that rollup attributes are not normalized out of the base dimensions. base dimensions

Rollup dimensions are often not explicitly modeled as BEAM✲ tables because they Define rollups by
do not contain any attributes or values that are not present in their base dimen- copying and editing
sions. If a rollup dimension is as yet undefined you should create it at the star their base
schema level by copying its base dimension and removing all the attributes below dimensions
the rollup level in the base dimension hierarchy. This is analogous to the ETL
process that should build the rollup from its base dimension data, rather than
source data, to keep the two in sync and guarantee conformance.

Creating Physical Schemas


Dimensional modeling does not make a strong distinction between logical and Logical and physical
physical modeling. Aside from the addition of DBMS-specific storage and indexing star schemas are
options there is very little difference between logical and physical star schemas. very similar
These database-specific additions are best defined in a data modeling tool that can
apply them consistently to each table and column type, eliminating the need to
directly edit Data Definition Language (DDL) scripts by hand.

If you are using the BEAM✲Modelstormer spreadsheet, you can edit its DDL
template to generate custom SQL for your DBMS.
158 Chapter 5

Choose BI-Friendly Naming Conventions


Use business What’s in a name? Database object naming is a strangely emotive subject and
friendly names for naming conventions for facts and dimensions vary greatly. The convention used in
physical facts and this book is singular nouns for dimensions (for example, CUSTOMER and
dimensions to PRODUCT) and plural nouns with a FACT suffix for fact tables (for example,
reduce BI tool SALES FACT and ORDERS FACT). Doubtless you will have your own table name
metadata standards, but before you adopt any semi-cryptic standard that exists for tradi-
tional database application development consider the BI users who will use these
tables and how they will interact with them. Adapt naming standards to work well
with your BI tools. The closer you can name dimensions and facts to the labels
stakeholders want to see on their reports—the terms they used during model-
storming—the less BI tool metadata the DW/BI team will have to maintain.

A common naming convention is to prefix all dimension tables with DIM_ so that
they sort together. What do stakeholders and developers (subconsciously) think
every time they see DIM_CUSTOMER or DIM_EMPLOYEE? Instead, reserve a
common schema or owner name, perhaps “DIMENSION”, for creating dimensions,
to achieve the same grouping and avoid such pejorative table names.

Use Data Domains


Data domains Many database modeling tools allow you to create data domains (or user-defined
enable consistent data types) to help standardize physical column properties for similar column
translation into types. If your modeling tool supports domains take advantage of them to translate
database-specific the default data types imported from the BEAM✲ model into database-specific
data types data types with appropriate constraints. Data domains are especially useful for
making a data warehouse design portable across different database management
systems. Table 5-3 shows a starter list of recommended domains for a dimensional
model.

Table 5-3 DOMAIN USAGE DATA NULLS DEFAULT


TYPE VALUE
Example data
domains for a
Surrogate Key Dimension primary keys, Integer Not Null 0
dimensional model Fact table foreign keys
Flag Yes/No flags Char(1) Not Null “?”
Code Short codes, Varchar(20) Not Null “Unknown”
Business keys
Name Longer names Varchar(60) Not Null “Unknown”
Description Full descriptions Varchar(99) Not Null “Unknown”
Count Count facts Integer Nullable
Amount Monetary facts Currency or Nullable
Number
Duration Duration facts Integer Nullable
Modeling Star Schemas 159

Prototyping the DW/BI Design


You cannot know how well your data warehouse design matches the available data Validate the design
until you try to load it, nor how well it matches the stakeholders’ actual BI re- by prototyping with
quirements until they use it. That is why the agile principle of early delivery of real data, real BI
working software is vital for reducing DW/BI risk. So, as soon as you have a phys- tools, and real
ical schema (some working software)—don’t postpone the moment of reality any stakeholders
longer—validate the design by prototyping the reports and dashboards that stake-
holders have wanted to talk about all along.

Turn end of sprint demos into prototyping workshops; have BI developers help the Stakeholders will be
original modelstormers (real stakeholders) get their “hands dirty” using their ready to define their
design with real data and real BI tools, as in Figure 5-16. These workshops can be reports using the
remarkably productive because the stakeholders—having used the 7Ws to model- 7Ws
storm their data requirements—will already be thinking about their business
questions and report layouts in terms of these 7W dimensional interrogatives.

Figure 5-16
DW/BI Prototyping

You should value working software over comprehensive documentation and


maximize the work you don’t have to do: Don’t waste time mocking up reports or
dashboards specifications using spreadsheets or word-processors when you
have a database schema, sample data and the stakeholders’ BI tools of choice.

For prototypes, avoid test data generation—it proves nothing. Instead, validate the Load prototype stars
ETL process by sampling small amounts of real data, extracted from the actual with 10,000 recent
sources documented in the model. 10,000 recent facts with matching dimensional facts and similar
descriptions plus similar samples from one or two previous years is usually just samples from
enough representative data for stakeholders to get a true feel for what the final previous
solution will be like. Use data profiling to set realistic expectations of the prototype time periods
before any queries are run. Make sure stakeholders understand that counts and
totals will be low because a small percentage of the data has been sampled.

Speed up ETL prototyping by not indexing the data. BI prototyping with un-
indexed sample data on modest hardware will also help to set realistic expecta-
tions for query performance against complete data, fully-indexed on specialist
DW/BI hardware.
160 Chapter 5

The Data Warehouse Matrix


A data warehouse A data warehouse matrix is a version of the dimensional matrix that documents the
matrix documents relationships between physical fact tables (or OLAP cubes) and physical dimension
the actual tables. You can create an initial physical matrix by copying your event matrix and
relationships editing its row and columns to show the actual physical tables. When you do, you
between fact and should also add additional technical details that will be useful to the DW/BI team.
dimension tables Figure 5-17 shows an example matrix with additional columns for data sources,
fact table type, primary time dimension granularity and fact volumetrics.

The event matrix is This physical matrix and the event matrix should be kept in sync as much as
for planning. The possible but will diverge at times because of their distinct functions. The event
physical matrix matrix is a modeling and planning tool that reflects the stakeholders’ requirements,
documents the whereas the data warehouse matrix is a management tool that reflects the current
current live model state of the data warehouse—including any conformance failures.

Document If you have to compromise within a sprint and postpone conforming a dimension,
conformance failure or you inherit a warehouse that has evolved without conformed dimensions, you
on the matrix by should record these conformance failures on the data warehouse matrix by using
using dimension dimension version numbers. Rather than create a separate column for each non-
version numbers conformed version of a dimension, continue to use a single column for each
planned conformed dimension but number each different version in use, rather
than just tick usage. Reserve the highest number for the best version of each di-
mension (usually the most recently developed). For example, Figure 5-17 shows
that Pomegranate has failed to conform product across manufacturing, sales and
customer support, instead there are three different versions of PRODUCT (per-
haps it really should be called DIM_PRODUCT). Thankfully, PRODUCT is
partially conformed; the best version is already the most widely used and only two
stars (Customer Orders and Customer Complaints) need to be refactored.

If you need to document conformance failure, hyperlink each dimension version


number to its non-conformed dimension table definition.

Update the event After each sprint, the event matrix should be updated with conformance failures
matrix with any (planned conformance that did not happen) and non-conformed realities (planned
conformance issues conformance that could not happen because it was wrong) so that these issues can
and address them be addressed with the stakeholders during the next modelstorm. Use the updated
with stakeholders event matrix to plan the refactoring of older stars with newer more conformed
dimensions as part of your iterative development approach.

A live version of the matrix, showing up-to-date volumetrics and the current ETL
status for each star, is the ideal dashboard for a DW/BI team. You could develop
one by using BI tools to summarize ETL and DBMS catalogue metadata.
Modeling Star Schemas

Figure 5-17
Data warehouse
161
162 Chapter 5

Summary
Agile data profiling targets the data sources implicated by the BEAM✲ model. It is done early as
a data-driven modeling activity to validate the stakeholders data requirements before detailed
star schemas are designed. When data sources don’t yet exist, proactive DW/BI designs based
on the BEAM✲ model can help define better BI data feeds from new operational systems.

Annotated models present data profiling results in a format stakeholders are familiar with. An
annotated table contains source names, data types and summary data profiling metrics. Data
source issues such as missing data and mismatched definitions are highlighted using
strikethrough. Additional data is highlighted using italics.

The DW/BI team uses the annotated model and detailed data profiling results to provide initial
task estimates for building and loading the proposed facts and dimensions. These ETL estimates
are added to the event matrix for use during model review and sprint planning.

During a model review, stakeholders use the annotated model and the DW/BI team estimates to
agree amendments to the design and reprioritize their requirements in light of the data realities
and available development resources.

BEAM✲ models are easily translated into logical dimensional models and star schemas.
Dimension tables are updated by adding primary keys and administrative attributes. Event
tables are converted into fact tables by replacing dimensional details with foreign keys and
changing quantities (how many details) into fully-additive (FA), semi-additive (SA), or non-
additive (NA) facts with standardized (conformed) units of measure.

(Data warehouse) surrogate keys are used as dimension primary keys to insulate the data
warehouse from business keys, provide dimensional flexibility (manage SCDs, missing values,
multi-levels, etc.) and improve query efficiency.

Enhanced star schemas convey additional dimensional information. Consistent dimensional


layout documents dimensions by W-type and increases multi-star schema model readability.
BEAM✲ short codes document table and column level dimensional properties not handled by
standard ER notation.

In addition to the standard documentation provided by modeling tools, a data warehouse


matrix provides an overview of all the star schemas and OLAP cubes in the data warehouse.
Similar in layout to an event modeling and planning matrix, this physical matrix provides
additional information about the actual warehouse for a technical audience. It is a vital tool for
managing the warehouse and must be kept up to date along with the event matrix and star
schema diagrams as the data warehouse design evolves.
PART II: DIMENSIONAL DESIGN
PATTERNS
DIMENSIONAL MODELING TECHNIQUES FOR PERFORMANCE, FLEXIBILITY, AND USABILITY

Computers are to design as microwaves are to cooking.


— Milton Glaser

Chapter 6: Who and What: People and Organizations, Products and Services

Chapter 7: When and Where: Time and Location

Chapter 8: How Many: Facts and Measures

Chapter 9: Why and How: Cause and Effect


6
WHO AND WHAT
Dimensional Design Patterns for People and Organizations, Products and Services

Who’s on first?
— Bud Abbott and Lou Costello

What’s next?
— President Jed Bartlet, “The West Wing”

Who and what dimensions such as CUSTOMER, EMPLOYEE and PRODUCT Who and what are
represent some of the most interesting, highly scrutinized, and complex dimen- the most important
sions of a data warehouse. Modeling these dimensions and their inherent hierar- dimensions
chies presents a number of challenges that can be addressed by design patterns.

In the first of our W-themed design pattern chapters we begin by describing mini- This chapter
dimensions and snowflaking for handling large, volatile customer dimensions, describes design
swappable dimensions for mixed customer type models and hierarchy maps for patterns for defining
recursive customer relationships. We then move on to employee dimensions to flexible, high
cover hybrid SCD views for current value/historic value (CV/HV) reporting re- performance who
quirements and multi-valued hierarchy maps for multi-parent HR hierarchies with and what
dotted-line relationships. We finish by looking at product and service dimension dimensions
issues and introduce multi-level dimensions for variable detail facts and reverse
hierarchy maps for component analysis.

Large, rapidly changing customer populations Chapter 6 Design


Mixed business models: businesses and consumers, products and services Challenges
Simultaneous current and historic value reporting requirements At a Glance
Variable-depth hierarchies, recursive relationships
Multi-valued hierarchies
Business processes with variable levels of dimensional detail
Product bill of materials
165
166 Chapter 6 Who and What

Customer Dimensions
Customer Customer dimensions are particularly challenging because of their size. Business-
dimensions are to-consumer (B2C) customer dimensions can be deep (millions of customers),
typically very large wide (many interesting descriptive attributes), and volatile (people are volatile).
dimensions (VLDs) This combination of high data volumes and high volatility is the reason customer
dimensions are often referred to as “monster dimensions”—they’re a little scary.

How large is a very large dimension (or table of any type)? Everything is relative.
Any absolute figure we quote will be trumped by future hardware and that
trumped again by unimagined requirements for capturing big data. The only
definition that stands the test of time is: “a very large table is one that does not
perform as well as you wish it to.”

Mini-Dimension Pattern
Problem/Requirement
Stakeholders are very interested in tracking descriptive changes to the customer
base to help to explain changes in customer behavior. So they have defined many
Customer attributes historic value (HV) customer attributes. Unfortunately using the Type 2 SCD
can be too volatile technique for each HV attribute is likely to cause explosive growth in the customer
to track using the dimension; for example, a 10 million row CUSTOMER [HV] dimension with an
Type 2 SCD AGE [HV] attribute will grow by 10 million rows per year. Obviously AGE is a
technique poor choice as an HV attribute; it alone would turn CUSTOMER into a rapidly
changing dimension. This issue is quickly solved by replacing AGE in the dimen-
sion with the fixed value (FV) attribute DATE OF BIRTH [FV] and calculating the
historically correct age, at the time of the facts, in the BI query layer. Sadly, very
few customer dimension historical value requirements are as easy to solve as age.

If AGE is in constant use—perhaps with medical data—it can be treated as a non-


additive fact (NA) and stored with other facts in a fact table.

Customer HV It only takes 5 HV attributes (that cannot be calculated) that change on average
attributes must be once every two years for each customer, for an initial 10M row CUSTOMER
carefully chosen. dimension to grow by up to 25 million rows per year. With growth like that, you
Not every change will have to be careful about which attributes you define as HV, and what types of
should be tracked, change you track. You don’t want to track attributes that have no historical signifi-
can be tracked, or cance—they should be defined as current value (CV). Nor do you want to track a
is worth tracking history of corrections that should be handled as simple updates. Corrections are
easy to spot for FV attributes, such as date of birth (cannot change, can only be
corrected), but may require group change rules (described in Chapter 3) that look
for combinations of HV and CV attributes changing together to detect genuine
change. You may also want to avoid tracking macro-level changes.
Dimensional Design Patterns for People and Organizations, Products and Services 167

Micro and Macro-Level Change


Dimension members can experience two different types of change which will
impact how well ETL processes handle HV attributes:

Micro-level change occurs when individual dimension members experience Most Type 2 SCDs
change that is unique to them; for example, a customer changes can cope well with
CUSTOMER CATEGORY and goes from being a “Good Customer" to a individual micro-
"Great Customer". If CUSTOMER CATEGORY is an HV attribute, one row level changes
will be updated to give the old value an end date and one row will be in-
serted with the new value. Hierarchically, this is “change from below”.

Macro-level change occurs when many dimension members are changed Macro-level, global
at once; for example, every "Great Customer" becomes a "Wonderful Cus- changes are more
tomer". Rows affected: 1,000,000 updated, 1,000,000 inserted. Hierarchi- challenging. Should
cally, this is “change from above”: it’s not customers who have changed but they be handled as
CUSTOMER CATEGORY itself. The category "Great Customer" has changes or
changed to "Wonderful Customer". corrections?

For most dimensions with moderately volatile HV attributes, micro-level changes


can be easily and usefully tracked, but it is much harder to justify macro change
tracking. A single macro-level change can cause millions of historical versions of
customers to be created for little or no analytical value. In the case of every
“great customer” becoming a “wonderful customer” should this be tracked using
normal HV attribute behavior? You many need to define separate ETL processes
that treat certain macro-level changes as one-time corrections.

Solution
Rapidly changing HV customer attributes have a high cardinality many-to-many
(M:M) relationship with customer. One possible solution for tracking these attrib-
utes is to model them just as you would model other customer M:M relationships, Volatile HV
such as the products they consume, or the sales locations they visit. Products and attributes have a
locations are of course modeled as separate dimensions and related to customers high cardinality M:M
through fact tables. The same can be done with volatile customer attributes by with their dimension
moving them to their own mini-dimension.

Figure 6-1 shows CUSTOMER DEMOGRAPHICS, a customer mini-dimension They can be stored
formed by relocating the volatile HV CUSTOMER attributes relating to location, in a separate mini-
family size, income, and credit score. This mini-dimension has its own surrogate dimension with its
key (DEMO KEY) which is added to customer-related fact tables to describe the own surrogate key
historic demographic values at the time of each fact. With fact relationships used to and related to the
track history, all the problematic HV attributes can be removed from dimension through
CUSTOMER, or changed to CV only. This would leave you with an entirely CV fact tables
CUSTOMER dimension that only grows as new customers are acquired.
168 Chapter 6 Who and What

Figure 6-1
Removing volatile
attributes from
CUSTOMER

Poorly designed So, problem solved? Unfortunately, it might just be a case of problem moved. If the
mini-dimensions can mini-dimension contains several high cardinality attributes, the number of unique
be almost as large demographic profiles may be almost as high as the number of customers and
and volatile as the customer changes will create new profiles rather than reuse existing ones because
original dimension they are too specific. The CV customer dimension might not grow but the so-
called “mini-dimension” will, to become the new “monster dimension”.

Create stable mini- Mini-dimensions need to be mini and stay mini if they are to solve the VLD HV
dimensions by problem. Figure 6-2 shows a redesign of CUSTOMER DEMOGRAPHIC where
removing high some of the original high cardinality attributes (CITY, POST CODE) have been
cardinality attributes removed and the continuously valued attributes (INCOME, CREDIT SCORE)
or reducing their have been converted into low cardinality discrete bands. This dramatically reduces
cardinality by the number of unique profiles and increases the chances of reusing them when
banding customers change.

When you have defined a small stable customer mini-dimension you may be able
to add additional frequently queried, low cardinality (GENDER, AGE BAND) attrib-
utes without significantly increasing its size. These would increase the filtering
power of the mini-dimension and reduce the need for many queries to access the
much larger CUSTOMER dimension at all.

Add mini-dimension Figure 6-2 also shows that CURRENT DEMO KEY, a CV foreign key to
keys to their main CUSTOMER DEMOGRAPHICS, has been added to CUSTOMER. This creates a
dimensions, to single table containing the customer business key: CUSTOMER ID and the two
support efficient customer surrogate keys: CUSTOMER KEY and CURRENT DEMO KEY needed
ETL processing to load customer facts. ETL processes would use this to build an efficient lookup.
Dimensional Design Patterns for People and Organizations, Products and Services 169

Figure 6-2
Creating a
mini-dimension

The CURRENT DEMO KEY also allows queries to ask questions using current A mini-dimension
demographic descriptions; for example, the stakeholder question “how many high foreign key (CV, FK)
income customers are there?” (with no further qualification it must mean currently allows the mini-
high income) can be answered by joining CUSTOMER and CUSTOMER dimension to be
DEMOGRAPHICS directly without having to go through the fact table. This uses a used to answer
shortcut join which is not compatible with fact related queries. For BI tools that do current value
not support shortcut joins or for queries that need both current and historic questions
demographics, a view can be created on the mini-dimension to play the role of
CURRENT DEMOGRAPHICS as in Figure 6-2. This customer dimension outrig-
ger could be used to answer interesting questions like “How many customers who
are currently married, last purchased from us when they were single?”

A mini-dimension surrogate key should be added to all fact tables associated with
the main dimension, where it represents a historical value foreign key (HV, FK).
The mini-dimension surrogate key should also be added to the main dimension as
a current value foreign key (CV, FK) to support ETL processing and CV “as is”
reporting.

Consequences
Mini-dimensions increase the size of fact tables by adding foreign keys. If many
high cardinality HV attributes must be tracked, they may need to be separated into
multiple mini-dimensions, to control both main and mini-dimension size. Each
mini-dimension that you create will contribute an extra foreign key and index to
the fact tables.
170 Chapter 6 Who and What

Mini-dimensions Banding continuous valued attributes, such as INCOME or CREDIT SCORE,


create additional causes historical detail to be lost. Having said that, these high cardinality details do
fact table keys not make good dimensional attributes. Remember, the role of a dimensional is to
provide good report row headers and filters. A high cardinality INCOME attribute
Banding causes is a poor report row header. Precise income figures (if you actually knew them) are
some loss of virtually unique to customers and would group very little data, resulting in unread-
dimensional detail able long reports—more like database dumps than BI reports. Low cardinality
but creates better banded attributes, such as INCOME BAND make far better row headers, just so
attributes for report long as the bandings are carefully designed with the stakeholders. When
grouping and stakeholders give you examples of high cardinality dimensional attributes you
filtering should ask them for the rollup descriptions or bands they would like to see on their
reports. When they define their favorite numeric bandings make sure the bands are
contiguous—no gaps and no overlaps.

If stakeholders need access to the history of continuous valued customer details,


you should model these not as dimensional attributes, but as facts in a customer
profile fact table.

Sensible Snowflaking Pattern


Problem
Stakeholders want to use customer first purchase date in conjunction with com-
mercially available geodemographic information to segment the customer base for
Large numbers of marketing purposes. Adding all the necessary descriptive attributes related to first
low cardinality purchase date (first purchase quarter, first purchase on a holiday indicator etc,)
attributes can waste and geodemographic code would greatly increase the size of an already very large
space in very large customer dimension. Stakeholders are worried that existing queries, that don’t
dimensions need the new marketing attributes, will be adversely affected.

Solution
Generally, it’s a good idea to denormalize as many descriptive attributes as possible
into a dimension to simplify the model and improve query performance. But
CUSTOMER dimensions, because of their size, are exceptional and can often
benefit from sensible normalization or “snowflaking”. The FIRST PURCHASE
DATE and GEODEMOGRAPHICS outriggers, shown in Figure 6-3, represent
“Snowflake” the sensible snowflaking because they avoid a large number of much lower cardinality
CUSTOMER date and geodemographic attributes being embedded in the CUSTOMER dimen-
dimension. Move sion. Keeping these attributes separate will make a worthwhile storage saving that
large collections of will improve query performance—especially for all the queries that are not inter-
low cardinality ested in the first purchase dates or geodemographics. In this specific case the use of
attributes to an outrigger for FIRST PURCHASE DATE makes even more sense as it can be
outriggers implemented as a role-playing view of the standard CALENDAR dimension.
Dimensional Design Patterns for People and Organizations, Products and Services 171

For commercially supplied geodemographic information there may be additional Attributes that
administrative or legal reasons for snowflaking. It may be supplied on a periodic are administered
basis and updated independently of customer data, or there may be licensing differently may
restrictions on the number of users who can access it, therefore it cannot be held in need to be
a customer dimension available to the entire BI user community. snowflaked

Figure 6-3
Useful customer
“snowflaking”

The outriggers in Figure 6-3 do not track history. This is not a problem for first
purchase attributes as they are fixed values that can only be corrected. For
geodemographic attributes, history could be tracked by defining CUSTOMER as
HV but this would lead to uncontrolled growth in the dimension. Alternatively, the
GEODEMOGRAPHICS outrigger could be used as a mini-dimension by adding
GEOCODE KEY to existing fact tables or a newly created customer demographics
fact table.

Consequences
Outriggers complicate dimensional models and are generally unnecessary for most
dimensions. Once you have introduced useful outriggers to one dimension, your
colleagues, especially those with a 3NF bias, may be tempted to define less useful
outriggers in other dimensions that might not have such a positive effect.

You should only model outriggers that have far fewer records than the monster
dimensions they are associated with. If any attributes of a proposed outrigger
have a cardinality that approaches that of the dimension, leave them there.
172 Chapter 6 Who and What

Swappable Dimension Patterns


Problem
For organizations with mixed business models, selling heterogeneous products and
services to both consumers and businesses, customer and product dimensions can
Mixed business quickly become complex and unwieldy. Each different type of customer or product
model customer and can have its own set of exclusive attributes (X) that are not valid/relevant for the
product dimensions other types. This can lead to wide, sparsely populated dimension records. For very
can contain many large customer populations this can lead to performance and usability issues.
sparsely populated Although product populations rarely approach the size of customer populations,
exclusive attributes for heterogeneous products and services, the number of product type specific
attributes can be very large indeed, causing similar problems.

Solution
Dimensions that contain large groups of exclusive attributes (based on one or more
defining characteristic (DC) attributes) can be modeled as sets of swappable dimen-
sions to improve usability and performance. Swappable dimensions are so named
because they can be swapped into a query in place of (or in addition to) another
swappable dimension that shares the same surrogate key. Figure 6-4 shows swap-
pable sets of customer and product dimensions that would be useful for a mixed
business model data warehouse. The main CUSTOMER dimension contains
attributes that are common across the entire customer population, this includes the
defining characteristic CUSTOMER TYPE [DC1,2] which identifies which of two
exclusive groups of attributes are relevant: X1 consumer attributes or X2 business
attributes. The swappable CONSUMER and BUSINESS CUSTOMER dimensions
contain these common attributes and the exclusive attributes relevant to just their
customer type.

More efficient Swappable subset dimensions are easier for many BI users to navigate because they
swappable subset contain only the rows and columns that are relevant to them. For example, BI users
dimensions can working in corporate sales would use the BUSINESS CUSTOMER version of
be created based CUSTOMER—using database synonyms, it can be renamed to be their default
on defining CUSTOMER dimension. Because BUSINESS CUSTOMER only contains corpo-
characteristics rate customers they would see only the business attributes they want, and would
not have to add where Customer_Type=‘Business’ to every query. If
businesses made up only 10% of the customer base, the corporate sales analysts
would see a significant performance boost too.

Hot swappable Swappable dimensions that have identical column names are referred to as hot
dimensions can be swappable, because they can be used in place of each other without rewriting any
used in place of queries. Hot swappable dimensions can be used to implement restricted access
each other without (row-level security), study groups, sample populations, national language transla-
rewriting queries tion and alternative CV/HV reporting views (See the Hybrid SCD pattern covered
later in this chapter).
Dimensional Design Patterns for People and Organizations, Products and Services 173

Figure 6-4
Swappable
dimensions

Swappable Dimension Consequences


Swappable dimensions increase the number of database objects that need to be
maintained. You should only create them as physical tables if they improve per-
formance. For databases that support variable length column types, sparsely
populated exclusive attributes can take up very little space and cause less of a
performance issue. In which case, swappable dimensions that are needed only to
improve usability can be implemented as database views.

Customer Relationships: Embedded Whos


The stakeholder’s specification for the CUSTOMER dimension, in Figure 6-5, Who dimensions
contains attributes that relate each business customer to two other whos: an em- can often contain
ployee playing the role of ACCOUNT MANAGER [HV], and another customer references to other
playing the role of PARENT COMPANY [CV]. Both of these embedded whos need whos that need to
to be fully described with HV or CV attributes for reporting purposes. be fully described

Figure 6-5
CUSTOMER
dimension with
embedded who
attributes
174 Chapter 6 Who and What

Embedded ACCOUNT MANAGER, should be modeled as a separate dimension using a


whos can be combination of the mini-dimension and sensible snowflaking patterns, as shown in
remodeled as Figure 6-6. The ACCOUNT MANAGER attribute is implemented as a foreign key
separate who to the EMPLOYEE dimension. This makes all the account manager’s descriptive
dimensions attributes available without bloating the CUSTOMER dimension. To prevent this
potentially volatile relationship between account managers and customers from
turning CUSTOMER into a rapidly changing dimension, the foreign key is defined
as CV, so that it represents the current account manager. The stakeholder’s re-
quirement for reporting by the HV account manager is handled by fact table
relationship. As with the mini-dimension pattern, the direct relationship can be
defined as a shortcut join, or a second view can be created on EMPLOYEE to act as
an outrigger playing the role of CURRENT ACCOUNT MANAGER.

Figure 6-6
Modeling an
embedded who
as a separate
HV dimension and
a CV outrigger

Recursive Relationship
A who within a who Looking at the PARENT COMPANY examples, in Figure 6-5, you can see that it
of the same type is contains companies that are present as customers in the CUSTOMER dimension.
a recursive This represents a recursive relationship which would be drawn in ER notation, as in
relationship Figure 6-7, with a M:1 relationship between the customer entity and itself. The
relationship documents that each customer may own one or more customers and
each customer may be owned by one customer.

Figure 6-7
M:1 recursive
relationship or
“head scratcher”

This relationship can be implemented in CUSTOMER by replacing the remaining


embedded who with another foreign key. But this time the new foreign key:
PARENT KEY will not refer to a different dimension, instead it will point back to
the primary key of CUSTOMER itself. This makes PARENT KEY a recursive
foreign key, denoted by the code RK in Figure 6-8.
Dimensional Design Patterns for People and Organizations, Products and Services 175

Figure 6-8
BEAM✲ recursive
relationship

Recursion is a very efficient way of storing parent-child relationships. However Recursive


merely storing the information is only a fraction of the DW/BI story—stakeholders relationships are
have to be able to use it for analysis. Because a parent customer can in turn be often used to
owned by another customer and that in turn owned by another, the recursive represent variable-
relationship can represent a variable-depth hierarchy. Wherever a hierarchy exists depth hierarchies
stakeholders will invariably want to use it to explore the facts. An ownership
hierarchy offers the possibility of rolling up individual customer activity to the
topmost owners to see who becomes significant when all their indirect sales reve-
nue is consolidated. This would be quickly followed by drill down analysis on
ownership to see a breakdown of sales revenue by subsidiaries.

Variable-Depth Hierarchies
Customer ownership is a classic example of a variable-depth hierarchy. Some Customer
business customers will be self-contained or privately-owned companies, repre- ownership is a
senting a hierarchy of only a single level. But other customers may be the top, classic example of a
middle, or bottom of a deep hierarchy of corporate ownership (stretching all the variable-depth
way to Liechtenstein or Delaware!). The Figure 6-9 organization chart reveals that hierarchy, best
all the Figure 6-8 examples customers are ultimately owned by Pomegranate. illustrated by an
organization chart

If you have difficulty giving meaningful names to the levels in a hierarchy, it is a


strong clue that you need to model it as a variable-depth hierarchy. The levels in a
balanced or ragged hierarchy normally have distinct names because they repre-
sent different things or concepts; for example, day and month in a balanced time
hierarchy or street and country in a ragged geographic hierarchy. The levels in a
variable-depth hierarchy do not have distinct names, because they typically
represent things of the same type; for example, each level in a variable-depth
customer ownership hierarchy is a customer.
176 Chapter 6 Who and What

Figure 6-9
Variable-depth
hierarchy

Hierarchy Map Pattern


Problem/Requirement
A consulting firm, that has billed each one of the customers in Figure 6-9 sepa-
rately, wants to use company ownership information to aggregate all billing facts
Recursive by parent companies. The data is available in the billing system, stored as a recur-
relationships are sive relationship, referred to as a “pig's ear” or “head scratcher” by its designers—
difficult to query for because of its shape (see Figure 6-7). Unfortunately, both terms are also very
BI purposes accurate descriptions of how difficult it is to query recursive relationships. The
problem is, recursive relationships cannot be navigated using the standard non-
procedural SQL generated by most BI tools. Although some databases have recur-
sive extensions to SQL (for example, Oracle’s CONNECT BY), they are not sup-
ported by BI tools, and seldom perform adequately against data warehousing
volumes of fact data.

Profile the data to Because of the challenges involved in making recursive relationships report-
see if it represents friendly, the first thing to do, when you spot one, is to profile the data to see
a simple balanced whether it actually represents a variable-depth hierarchy or not. If the data itself
hierarchy represents a balanced hierarchy with a fixed number of levels, or a hierarchy that is
only “slightly” ragged, then the design can be kept simple by flattening (denormal-
izing) the data into a fixed number of well-named hierarchical attributes within a
standard dimension.

Check that variable- If data profiling confirms that there is a variable-depth hierarchy, it is worth
depth is necessary double-checking that the variable-depth is truly required for analysis purposes. If it
and the hierarchy is and it cannot be simplified then the following hierarchy map techniques will
cannot be simplified, help, but they should also motivate you to, whenever possible, balance and fix the
before implementing depth of all hierarchies that are under your control! For customer ownership
a complex solution analysis, there is no opportunity to simplify the hierarchies involved. You cannot
tell a customer like Pomegranate that its ownership hierarchy is more complex
than your other clients and ask it to sort itself out. This hierarchy is external—
beyond your control—and must be represented as is.
Dimensional Design Patterns for People and Organizations, Products and Services 177

Solution
A hierarchy map is an additional table that resolves a recursive relationship by
storing all the distant parent-descendent relationships it represents. Recursive
relationships record only immediate parent-child relationships, whereas a hierar-
chy map stores every parent-parent, parent-child, parent-grandchild, parent-great-
grandchild relationship, and so on, no matter how distant. Its structure is best Hierarchy maps
explained by looking at the BEAM✲ diagram Figure 6-10 which shows store variable-depth
COMPANY STRUCTURE, a hierarchy map for the Figure 6-9 Customer owner- hierarchies in a
ship hierarchy. It is documented as CV, HM to denote that it is a current value BI-friendly format
hierarchy map: it records only the current ownership hierarchy because it is based
on the CV definition of PARENT COMPANY in Figure 6-5.

Figure 6-10
Hierarchy map table

The first thing you notice about COMPANY STRUCTURE is that it contains far Hierarchy maps
more rows than the original CUSTOMER dimension. This may explain why the explode all the
technique is sometimes referred to as a hierarchy explosion. But don’t worry—it’s hierarchical
not a very big bang! The row count is rarely an order of magnitude higher, and relationships
hierarchy maps are quite narrow—made up of a pair of surrogate keys and just a (it’s not a big bang)
few useful counters and flags. Table 6-1 describes these attributes for COMPANY
STRUCTURE.
178 Chapter 6 Who and What

A hierarchy map COMPANY STRUCTURE contains 11 rows where Pomegranate is the parent: one
treats each for each subsidiary customer on the organization chart in Figure 6-9. Explicitly
dimension member storing a relationship between all Pomegranate subsidiaries and their topmost
as a parent and parent makes it easy to answer any Pomegranate-related parent questions. If they
records all its child, were the only questions, these would be the only rows needed in the map but to
grandchild etc. support fully flexible ad-hoc reporting to any level of ownership, the map needs to
relationships contain additional rows where each of the subsidiary customers is treated as the
parent of its own small hierarchy.

Table 6-1 ATTRIBUTE TYPE USAGE


COMPANY Parent Key Surrogate Key Foreign key to the CUSTOMER dimension
playing the role of parent company. Part of the
STRUCTURE
primary key.
Attributes
Subsidiary Surrogate Key Foreign key to the CUSTOMER dimension
Key playing the role of subsidiary company. Part of the
primary key.
Company Integer The level number of the subsidiary company.
Level Level 1 is the highest company in a hierarchy

Sequence Integer Sort order used to display subsidiaries in the


Number correct hierarchical order.

Lowest [Y/N] Flag Y indicates that the Subsidiary Key is the lowest
Subsidiary company in an ownership hierarchy, it is not the
owner of any other customer.
Highest [Y/N] Flag Y indicates that the Parent Key is the highest
Parent company in an ownership hierarchy, it is not
owned by any other customer.

PARENT KEY and SUBSIDIARY KEY in Figure 6-11 are documented as SK. They
contain company names for model readability (in true BEAM✲ fashion). The
physical database columns will contain integer surrogate keys.

If you know how You can calculate the number of hierarchy map rows needed for a complete hierar-
many members chy by summing the number of members at each level times their level. For the
there are at each data shown on the organization chart in Figure 6-10 that would be 1×1 + 3×2 +
level you can 3×3 + 2×4 = 24 rows. COMPANY STRUCTURE has three more rows to handle
calculate the size slowly changing customer descriptions for customers (Pomegranate and PicCzar
of a hierarchy Movies) in the HV CUSTOMER dimension (Figure 6-8). They make the calcula-
tion 2×1 + 4×2 + 3×3 + 2×4 = 27 rows.

A quick estimate of the number of rows in a hierarchy map is:

dimension members × (max levels – 1)

For the 11 Pomegranate related dimension members (only 9 customers but there
are 2 additional versions of the slowly changed customers) with 4 levels the
estimate would be 33. This simple formula always gives you an overestimate,
which is good! You will be pleasantly surprised when the map is populated.
Dimensional Design Patterns for People and Organizations, Products and Services 179

Hierarchy Maps and Type 2 Slowly Changing Dimensions


The observant reader may have noticed three seemingly duplicate records in CV hierarchy maps
COMPANY STRUCTURE (Figure 6-10) relating to Pomegranate and PicCzar must contain all the
Movies as subsidiaries. These are revealed to be subtly distinct when you see the HV surrogate key
surrogate key values in Figure 6-11 which show that the hierarchy map contains values to join to
each current parent key to historic subsidiary key combination. This is necessary every historic fact,
even though the hierarchy map is defined as CV (records the current hierarchy even though they do
shape only) so that the current hierarchy can still be used to roll up all the fact not track hierarchy
history. To do this it must contain every historical subsidiary surrogate key so that history
it can be joined to all the historical facts just like the Type 2 SCD Customer dimen-
sion that it is built from. This design requirement is documented by modeling the
SUBSIDIARY KEY as HV—even though the PARENT KEY and all other hierarchy
map attributes are CV.

Figure 6-11
Hierarchy map with
Type 2 SCD
surrogate keys

When HV customer attributes change, their new surrogate key values must also
be inserted into the company ownership hierarchy map, as new SUBSIDIARY KEY
values even if their ownership remains unchanged.

Using a Hierarchy Map


Joining the customer dimension directly to a fact table in the normal way allows Parent totals are
queries to report the facts by the customers directly involved. To report the same queried by joining a
facts rolled up to parent customer levels you insert the hierarchy map between the dimension to the
customer dimension and the fact table, and join on the parent and subsidiary keys facts through a
as shown in Figure 6-12. To make the business meaning of joining through the hierarchy map
hierarchy map more explicit you should create a role-playing view of the
CUSTOMER dimension called PARENT CUSTOMER and use that to define the
join path in a BI tool.

Figure 6-12
Using a hierarchy
map table to rollup
revenue to the
parent customer
level
180 Chapter 6 Who and What

Once PARENT CUSTOMER, COMPANY STRUCTURE and BILLING FACT are


correctly joined, it becomes simple for stakeholders to ask:

What is the total revenue for Pomegranate,


including all of its subsidiaries?
!
Rolling up By constraining a query on Parent_Customer = ‘Pomegranate’, one row
descendent facts is from PARENT CUSTOMER joins to the hierarchy map, finding 11 matching
straightforward once subsidiary records. These 11 subsidiary keys are then presented to the fact table,
the hierarchy map is and their revenue is aggregated accordingly. Thanks to the parent-parent rows,
joined correctly where both the parent and subsidiary keys represent Pomegranate, the total reve-
nue automatically includes any work done for Pomegranate directly. To exclude
this from the total you simply add Company_Level <> 1 to the query con-
straint, and only nine keys will be presented to the fact table.

Descendent levels Queries can be further refined using COMPANY LEVEL and LOWEST
can be filtered using SUBSIDIARY. For example:
the LEVEL and
LOWEST columns To get the total revenue just for customers that are directly owned by
in the hierarchy map Pomegranate, change the constraint to Company_Level = 2

To get the total revenue for only the Pomegranate companies that do
not own other customers, add Lowest_Subsidiary = ‘Y’

One of the strengths of the hierarchy map is that all of these questions can be
answered without knowing (or caring) how many subsidiaries or levels there are.

A CV hierarchy map such as COMPANY STRUCTURE that does not track parent
history is not symmetrical for query purposes if its matching dimension contains
Type 2 SCD surrogate keys. You cannot reverse the joins in Figure 6-12 and use it
to roll up all the historical revenue for the parents of a selected subsidiary, be-
cause the map only contains the current surrogate key values for each parent. If
there is a requirement to roll up historical parent facts using current subsidiary
descriptions a different version of the hierarchy map must be built that contains
the full history of parent surrogate keys.

Displaying a Hierarchy
A hierarchy map The example queries described so far use the hierarchy map to aggregate facts to
can be used to the parent level. But the hierarchy map can also be used to display all the levels of a
display a hierarchy hierarchy on a report. To do this you join the customer dimension to the parent
by joining a parent customer view through the hierarchy map, as shown in Figure 6-13. This gives you
view of a dimension both a parent customer name and a (subsidiary) customer name to group on and
to the dimension display in your reports—allowing reports to display facts for each level in the
hierarchy. However, to make sense of the hierarchy itself on such reports, the
subsidiaries have to be displayed in the correct hierarchy sequence.
Dimensional Design Patterns for People and Organizations, Products and Services 181

Figure 6-13
Using a hierarchy
map to browse a
customer hierarchy
and report facts at
the subsidiary level

Hierarchy Sequence
Sorting on company name would destroy the hierarchical order. But sorting by To display a
hierarchy level is no better, because this would display all the level 1 customers, hierarchy correctly
followed by all level 2 customers, then all level 3 customers, and so on. You would the hierarchy map
not be able to tell which level 2 customer owns which level 3 customers. To solve must contain a
this problem the hierarchy map needs a Sequence Number attribute that sorts the hierarchy sequence
nodes in the hierarchy correctly “top to bottom before left to right” as shown in number that sorts
Figure 6-14. The Sequence Number can then be used to sort the decedents of a top to bottom before
customer (top to bottom) ahead of the next customer (left to right) at the same left to right
level; i.e., ensures that all the level 3 subsidiaries of a level 2 customer will be
displayed before the next level 2 customer is displayed.

Figure 6-14
Hierarchy
sequence
numbers

The report in Figure 6-15 shows how you use SEQUENCE NUMBER with Sort hierarchy
COMPANY LEVEL to display the hierarchy, by sorting down the page on reports by sequence
SEQUENCE NUMBER, and indenting across the page on COMPANY LEVEL. number and indent
The following snippet of Oracle SQL shows how an indenting Company Name using level
could be defined in a BI tool:

LPAD( ' ', 3*(Company_Level-1)) || Customer_Name

This will indent each level 2 customer name by three spaces, each level 2 by six
spaces, and so on. A level 1 customer would display on the left margin (indented by
zero spaces).

To allow new nodes to be added to a hierarchy without updating the existing


sequence numbers you can create the initial sequence numbers in increments of
10 or 100 depending on the growth you expect.
182 Chapter 6 Who and What

Figure 6-15
Hierarchy report,
indented to show
subsidiary level

Consequences
To populate the sequence number column correctly you have to build the hierar-
chy map in hierarchy sequence order. This precludes the use of SQL techniques
that populate the table “a whole level at a time”. It also means that maintenance is
more complicated—if nodes are moved their sequence numbers and the sequence
numbers of many others around them need to be updated. Because this involves
complex coding it is often easier (time permitting) to rebuild (truncate and reload)
the hierarchy map than update it.

Drilling Down on Hierarchy Maps


Drill down analysis The default drilling features of most BI tools have difficulty working with hierarchy
on a hierarchy map maps, because they expect a fixed number of levels to drill to. However, drilldown
can be implemented can still be achieved by using report hyperlink features available in browser-based
using recursive BI interfaces. You could create a report similar to Figure 6-15 that only shows the
hyperlinks within immediate level 2 subsidiaries for a selected owner with the subsidiary names
browser-based formatted as hyperlinks. When BI users click on a link the same report is called
reporting interfaces again, passing the selected subsidiary as a parameter to the report. The newly
invoked report will then show the next level down in the hierarchy—the subsidi-
ary’s level 2 subsidiaries. Even though hierarchy maps remove recursion from the
data model to keep queries simple, you can still take advantage of procedural
recursion to implement efficient variable-depth drilling by recursively calling the
same report.

Querying Multiple Parents


Queries that do not Care must be taken when queries do not constrain to a single parent company. A
constrain to a single revenue query that simply groups by Parent Customer and returns Pomegranate
parent have to be £30M and PicCzar £5M must not sum these to a grand total of £35M. As you can
careful not to over- see from the report in Figure 6-16 the Pomegranate total of £30M already includes
count descendent the £5M PicCzar revenue because it’s a subsidiary. Perhaps what the consulting
facts firm would prefer is a report that showed Pomegranate £30M, EyeBeeM £15M,
Dimensional Design Patterns for People and Organizations, Products and Services 183

MegaHard £27M, and so on; i.e., a report showing total revenue for all the top level Use the HIGHEST
clients without listing any of their subsidiaries. This is where the HIGHEST PARENT flag to
PARENT flag (see Figure 6-10) is useful. By constraining on Highest_Parent filter out partial
= ‘Y’ a query will include only the full hierarchy for each top most customer, and hierarchies and
the revenue figures for each of its subsidiaries will be summarized only once. avoid over-counting

Queries that include multiple parent customers without constraining on highest


parents only, must fetch subsidiaries distinctly to avoid overstating the facts. For
example, a query that asks for the total revenue for all customers with any parents
in California must present a distinct list of SUBSIDIARY KEYs to the fact table
before summing revenue, otherwise the revenue will be double counted for any
subsidiary that has both a parent and grandparent in California.

For example SQL that handles subsidiaries distinctly while querying multiple
parents see: The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy
Ross (Wiley, 2002) page 166.

Building Hierarchy Maps


Earlier we stated that proprietary recursive extensions to SQL are not suitable for Recursive SQL can
ad-hoc BI queries against recursive relationships, but they can be used very suc- be used by ETL to
cessfully during ETL processing to unravel recursive relationships and populate load hierarchy maps
hierarchy maps.
You can download an Oracle PL/SQL stored procedure for loading the COMPANY
STRUCTURE hierarchy map from modelstorming.com. You will also find SQL for
creating the map and a CUSTOMER dimension populated with the Pomegranate
example data which you can use to test it or your own ETL hierarchy map loader.

Tracking History for Variable-Depth Hierarchies


When you are designing a hierarchy map to record history, you have to decide
which of three histories you are going to track inside the map:

Parent History: tracking changes to the HV dimensional attributes of a parent Parent history is
level; for example, when a parent company is reclassified or a manager’s salary tracked by adding
grade changes. This involves populating the hierarchy map with every surrogate every HV parent key
key value for a parent and adding new rows with the new parent key value for value to the hierarchy
every level in its hierarchy every time a parent HV attribute changes. map

Hierarchy History: tracking changes to the hierarchical relationships; for example, Hierarchy history can
a parent company sells a subsidiary or an employee starts reporting to a new be tracked by adding
manager. This involves the effective dating of all the rows in the hierarchy map. effective dates to
New rows are added to the hierarchy map with the appropriate effective date when each hierarchy map
new children are added to a hierarchy and end dates are adjusted on existing record
hierarchy relationships when they are changed or deleted. A change/move—for
184 Chapter 6 Who and What

example, when an employee reports to a different manager—is handled as a logical


deletion and a new relationship.

Child history is Child History: tracking changes to HV attributes of a child level; for example, the
tracked by adding location of a subsidiary company or an employee’s marital status changes. This
every HV child key involves populating the hierarchy map with every surrogate key value for a child
value to the hierarchy and adding new rows with the new child key value for every parent level above it,
map. This should be every time a child’s HV attribute changes. For a hierarchy map built from an HV
the default for most dimension, such as COMPANY STRUCTURE, it must at least track child history
hierarchy maps to correctly join to child level facts and rollup all their history.

Tracking the historical version (HV) of a variable-depth hierarchy is a particularly


vexing design challenge, not to be undertaken simply because all other attributes
of a dimension have been defined as HV.

Historical Value Recursive Keys


To track full history a If you want to expand COMPANY STRUCTURE to track full hierarchy history so
recursive key in the that the historically correct ownership hierarchies could be rolled up or filtered
dimension must be using historically correct parent company values, the PARENT KEY [RK] in the
defined as HV CUSTOMER dimension must first be redefined as HV and the dimension popu-
lated accordingly with the historically correct recursive data.

The Recursive Key Ripple Effect


HV RK attributes Figure 6-16 shows what happens to the CUSTOMER dimension when a change
cause a ripple effect. occurs to a high level parent company, such as Pomegranate if PARENT KEY is
A change to a parent defined HV RK. When Pomegranate is upgraded from “Good” to “Great” a new
HV attribute will record—with a surrogate key value of 106—is created to record the change to the
cause all its HV attribute CUSTOMER CATEGORY. This new CUSTOMER KEY value of 106
descendants to be must be reflected in the PARENT KEY of all of Pomegranate’s children. As
changed and new PARENT KEY is also HV, new child customer records must also be created to
records to be added preserve its history. This causes a ripple effect as each new CUSTOMER KEY must
to the hierarchy map in turn be reflected in the PARENT KEY of its children right down to the bottom
of the ownership hierarchy. What would have been a micro-level change to a single
customer (Pomegranate) becomes a macro-level change to 9 customers in total—
Pomegranate and all of its 8 subsidiaries. This is startling enough, but these 9 new
rows in the Customer dimension translate to 25 new rows in the COMPANY
STRUCTURE hierarchy map.

Ripple effect growth Using an HV recursive key to track every parent or child change will cause a
can be manageable if dimension to grow more quickly, but the technique is still viable if hierarchies
a dimension contains make up a small amount of the data. For example, if only a small percentage of
a small number of customers are owned by another customer (PARENT KEY is mainly NULL) and
small hierarchies ownership hierarchies are typically only a couple of levels deep, the resulting
additional growth would be manageable.
Dimensional Design Patterns for People and Organizations, Products and Services 185

Figure 6-16
Recursive key
ripple effect

Ripple Effect Benefits


Using surrogate keys to track every type of change ensures that HV hierarchy maps The surrogate key
will correctly join the correct historical facts to the correct historical hierarchies ripple effect keeps
using simple SQL. Each new surrogate key will automatically join the historical the correct HV joins
facts to the correct historical parent version through the correct hierarchical path between parent,
using simple inner joins with no additional date logic—just like a normal slowly hierarchy map and
changing dimension. So although the HV hierarchy map requires additional rows fact table, simple and
to record each historical version of a hierarchy, thanks to the ripple effect, its efficient
structure and usage remains the same.

Small and relatively stable variable-depth hierarchies can be tracked using an HV


recursive key, just like any other HV attribute. While the HV recursive key will
cause some additional growth in the dimension, it keeps the joins between hierar-
chy maps and fact tables simple and efficient.

Ripple Effect Problems


Unfortunately some variable-depth hierarchies are too large or too volatile to be HR hierarchies are
tracked using an HV recursive key. A human resources hierarchy is a classic too large and volatile
example of this, because it is one single hierarchy which contains all the employees for their HV recursive
in the employee dimension. A minor HV change at the highest level would result keys to be stored in
in the entire active employee population being issued with new recursive keys as employee dimensions
the change ripples down to the ‘shop floor’.

Avoid tracking history for large or volatile variable-depth hierarchies by using an


HV recursive key, because it will cause explosive growth in the dimension. In-
stead, track hierarchy history outside of the dimension by adding effective dating
attributes to hierarchy maps. See the HV MV HM pattern shortly.
186 Chapter 6 Who and What

Employee Dimensions
HV Employee After customers, employees are the next most interesting who for BI. Thankfully,
dimensions are because there are usually far fewer employees than customers, the Type 2 SCD
typically Type 2 technique can work well for tracking the majority of employee HV attributes. But
SCDs employee dimensions are not without their challenges. More descriptive informa-
tion may be known and recorded about employees and the departments they work
in. If that information is tracked historically it can lead to additional BI require-
ments to analyze the organization, as represented by its employees, using current,
previous, historical and year-end descriptions.

In The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross
(Wiley, 2002) Chapter 8, “Human Resource Management” covers many of the
basic issues involved in supporting Type 2 SCD Employee dimensions.

Hybrid SCD View Pattern


Problem/Requirement
Employee attributes have been defined as HV/CV so that stakeholders can use the
HV/CV attribute historically correct values for “as was” reporting by default but can also use current
requirements values for “as is” reporting; for example, the stakeholder question

What were the annual expenses by employee location


for the last 5 years?
!
requires the HV location where employees were based when they incurred the
expenses. Whereas the question

What are the total expenses over the last 5 years for every
employee currently based in the London office?
!
doesn’t care where employees were based in the last 5 years, it only needs their CV
location to filter on.

Solution
Create and maintain an HV version of EMPLOYEE using Type 2 SCD ETL proc-
essing to satisfy the default reporting requirements. Create a separate current value
swappable dimension (CV SD) from the HV employee dimension. Figure 6-17
Create separate HV shows how a CV swappable version of EMPLOYEE can be defined by joining the
and CV swappable HV EMPLOYEE dimension to a copy of itself constrained to current employee
dimensions definitions. In the example every version of James Bond is joined to the current
James Bond. The resulting CV SD dimension initially appears rather wasteful,
containing 3 identical Bonds but on closer examination you notice that each Bond
has a different surrogate key value.
Dimensional Design Patterns for People and Organizations, Products and Services 187

The self-join, in Figure 6-17, picks the historical EMPLOYEE KEYs and the cur- A CV hot swappable
rent descriptive values. When this identical-looking EMPLOYEE dimension is dimension can be
used instead of the original HV version it will roll up all of Bond’s facts from his 3 built as a self-join
different era’s (EMPLOYEE KEY 1010, 2099 and 2120) to a single location of view of an HV
London or a single status of “Widowed”. Because the CV and HV swappable dimension
dimensions are identically described they can be “hot swapped” for each other to
change the temporal focus of a query without rewriting any SQL.

Figure 6-17
Defining a CV
swappable
dimension

CV swappable dimensions can be built as views but for better query performance
store them as real tables or materialized views. The small amount of “wasted”
space avoids having to do the self-join inside every query.

The HV and CV swappable dimensions are not mutually exclusive, both can be HV and CV
joined to a fact table in the same query, to group or filter on current and historical swappable
values simultaneously. If this is a common requirement, you can build a more dimensions can be
query efficient hybrid HV/CV dimension by selecting both the CV and HV ver- used in the same
sions of attributes and co-locating them in the same swappable dimension to query to provide
provide side by side attributes for easy comparisons and more ambitious queries; current and
for example, a hybrid EMPLOYEE dimension would allow a query to group by historical values
HISTORICAL CITY while filtering on CURRENT CITY.
188 Chapter 6 Who and What

The self-join pattern As well as creating CV swappable dimensions the self-join view technique can be
can be used to create used to create Year-End dimensions for “as at” reporting; for example, a Year-End
Year-End dimensions dimension for 4 April 2011 can be created by replacing the CV.Current = ‘Y’
for “as at” reporting constraint in the view definition with:
‘4/4/2011’ between CV.Effective_Date and CV.End_Date

Consequences
The CV swappable dimension is a “gold star” agile design pattern. As long as you
have tracked history for a dimension from day one, a CV view can be added at any
time, when CV reporting requirements emerge, without increasing ETL program-
ming effort (if implemented as a materialized view) or rewriting any existing
queries, because the view is hot swappable.

Having said that, it is often the case that CV reporting is the first choice.
Stakeholders will define attributes as CV/HV rather than HV/CV (with the CV
default first) because they initially want BI solutions to mimic the CV only per-
spective of existing operational reporting systems. Or perhaps, stakeholders just
define CV attributes because they simply cannot see a need for history—yet. Either
Don’t trust CV only way, unless you are dealing with a very large dimension (customer), you should
requirements. Build default to HV ETL processing, hide the HV dimension and make a CV view
HV dimensions and available. That way, when stakeholders finally demand HV reports, you can simply
deliver CV views swap views rather than reload the entire warehouse.

Previous Value Attribute Pattern


Problem/Requirement
Occasionally, BI users will simply need the previous value of a dimensional attrib-
ute rather than its full history. This can be sufficient when there is a “one off”
CV/PV attribute macro-level change such as the renaming/relocating of branch offices. Previous
requirements values can also be necessary when BI users want to look at “alternative realities”. By
running “as previously” reports they can see what things would be like if a change
had not occurred. Stakeholders can define previous value requirements by docu-
menting an attribute as CV/PV.

Solution
CV/PV attributes are implemented by defining additional PV columns (also
known as Type 3 SCDs). Figure 6-18 shows PREVIOUS TERRITORY PV1 added
to the EMPLOYEE dimension. This is marked PV1 to link it to the current value
TERRITORY attribute marked CV1. During ETL processing, whenever
TERRITORY is updated its existing value is saved into PREVIOUS TERRITORY
Implement separate prior to storing the new value. PV attributes work well for small numbers of
CV and PV attribute attributes that must be tracked but do not change frequently, because they only
columns allow users to go back one version. Multiple PV attributes would allow for more
versions; for example TERRITORY LAST YEAR PV1, TERRITORY 2YR AGO
PV1, all linked to the current TERRITORY CV1 but can soon become unwieldy.
Dimensional Design Patterns for People and Organizations, Products and Services 189

CV/PV : Current and previous value requirement.


CVn : Current Value attribute linked to a PVn previous value attribute.
PVn : Previous Value attribute. Always linked to a CVn attribute.

Figure 6-18
Implementing a
previous value
attribute

PV attributes can be used to hold initial or “as at specific date” values; for exam-
ple, INITIAL TERRITORY PV1 or YE2011 TERRITORY PV1.

Previous Value Attribute Consequences


Defining a small number of hard-coded PV attributes can work well but maintain-
ing large sets of hard-coded PV attributes within a dimension is cumbersome for
both ETL and BI. Instead, define an HV only dimension and provide PV (and CV)
attributes through hot swappable dimension views.

Human Resources Hierarchies


Organization reporting structures are another example of variable-depth hierar- HR hierarchy maps
chies. These human resources (HR) hierarchies can be even more challenging than can be challenging
customer ownership hierarchies due to their high level of interconnection. Em- because employees
ployees are far more related to one another than customers are: all employees are highly
ultimately work for the same parent, the CEO. This results in a HR hierarchy map interconnected
containing a single large volatile hierarchy, rather than thousands or millions of
small relatively stable ones. This, coupled with greater availability of data, and
requirements to track history more precisely, can make HR hierarchies the most
difficult hierarchies to implement in the data warehouse.
190 Chapter 6 Who and What

Multi-Valued Hierarchy Map Pattern


Problem/Requirement
The Pomegranate organization chart, in Figure 6-19, shows one of the main com-
plexities of HR hierarchies: employees can report to more than one manager.
HR hierarchies with When employee activity is rolled up to a manager or department level these multi-
dotted-line ple relationships needs to be taken into account. The problem is illustrated by
relationships must James Bond, who reports to M, but also has a dotted-line relationship with George
rollup employee Smiley. This dotted-line represents a temporary or part-time posting with a full-
activity to multiple time equivalence (FTE) of 20%. One must therefore assume, as Smiley might say,
managers that M receives 80% of Bonds efforts. The dotted-line makes this a multi-parent
variable-depth hierarchy as defined in Chapter 3, with the possibility of multiple
immediate parents (managers) for any child (employee).

Figure 6-19
HR hierarchy
with a dotted line
relationship

A multi-parent A multi-parent hierarchy is another example of a variable-depth hierarchy that can


hierarchy is be represented in a source system by a recursive relationship, only this time it is a
represented by a many-to-many (M:M) recursive relationship, as shown in Figure 6-20. The M:M
M:M recursive relationship requires an additional association table containing a pair of employee
relationship foreign keys.

Figure 6-20
M:M recursive
relationship

Solution
M:M recursive relationships can be recorded in a multi-valued hierarchy map (MV,
HM) simply by storing additional rows for the multiple parent relationships but
will require additional attributes (Role Type and FTE) to describe the meaning and
value of each parent-descendent relationship correctly. Figure 6-21 shows the
multi-valued hierarchy map REPORTING STRUCTURE [CV, MV, HM] popu-
lated with all the employee relationships documented on the Figure 6-19 organiza-
tion chart. The first notable thing about this hierarchy map is the number of Bond
Dimensional Design Patterns for People and Organizations, Products and Services 191

records. Hierarchy maps always contain more records for the lowest levels because A multi-valued
they are repeated for all the parent levels above them. But in the case of Bond the hierarchy map
number is inflated by his dual roles. The easiest way to understand why this occurs (MV, HM) is used to
is to imagine that there are two Bonds, one directly under each of his managers. represent a multi-
This gives you an idea of how the hierarchy map for a large organization with parent hierarchy
highly interconnected staff and a deep reporting structure can grow (especially if
you were to track its history).

MV: Multi-Valued dimension or multi-valued hierarchy map (when used in con-


junction with HM). Typically contains a weighting factor.

Figure 6-21
HR hierarchy map

Additional Multi-Valued Hierarchy Map Attributes


REPORTING STRUCTURE contains two additional attributes to cope with the MV hierarchy maps
dotted-line HR relationships: contain additional
rows to represent
Role Type allows the HR map to record whether a manager-to-employee relation- the multiple parent
ship is permanent line management (solid line) or temporary project management relationships and
(dotted line). additional columns
to describe their
Weighting Factor allows queries to allocate an employee’s activity to his multiple type and value
managers based on the FTE of his role. For example, a weighted revenue measure
would be defined as: Sum(Revenue × Weighting_Factor).

Multi-parent who hierarchies are by no means limited to HR. The earlier customer
relationship hierarchy would also be multi-parent if it had to support fractional
company ownership or joint ventures. Family trees are multi-parent hierarchies!
192 Chapter 6 Who and What

All multi-parent hierarchies—even balanced ones with a fixed number of named


levels—require hierarchy maps to store their multiple parent relationships, to-
gether with their appropriate weighting factors.

Handling Multiple Weighting Factors


Weighting factors The weighting factors in Figure 6-21 are relatively straightforward, because there is
become more only one part-time employee (Bond) and he does not manage anyone else. How-
complex when there ever, if circumstances were to change, the allocations become more complicated.
are multiple dotted- For example, if Moneypenny permanently divides her time equally between M and
line relationships Bond, a new solid-line needs to be drawn between her and Bond worth 50% FTE,
as shown in Figure 6-22.

Figure 6-22
Rolling up
weighting factors

Distant relationship In the recursive source data this change would create one new association record in
weighting factors the EMPLOYEE_EMPLOYEE table (Figure 6-20) for Bond-Moneypenny, and
are calculated by update the existing M-Moneypenny record to 50% FTE. In the unraveled hierarchy
multiplying the map more work is required, because records for each new distant relationship need
weighting factors to be inserted and all the existing distant relationships need to be updated with the
of the intermediate appropriate weighting factors. Figure 6-22 shows how the new direct Bond-
direct relationships Moneypenny permanent relationship also creates an distant temporary relationship
between Smiley and Moneypenny, with a weighting factor of 10%. This is calcu-
lated by multiplying all the weighting factors of the direct relationships between
the two distant employees: 0.2 × 0.5 = 0.1 or 10%.

Updating a Hierarchy Map


If hierarchy history is Figure 6-23 shows 6 new rows and 9 updated rows in the REPORTING
not required, it is STRUCTURE hierarchy map after all the necessary processing has been per-
often easier to drop formed. Given that the table was initially 18 rows, dealing with only one new
and rebuild hierarchy relationship has updated 50% of the original table, and caused it to grow by 33%.
maps rather than Because seemingly small changes can cause a large number of complex updates to a
update them hierarchy map, dropping and rebuilding the map is often easier than updating it—
if you do not need to track history.
Dimensional Design Patterns for People and Organizations, Products and Services 193

Figure 6-23
Updating a
hierarchy map

Historical Multi-Valued Hierarchy Maps


Earlier, we described how the customer ownership hierarchy map was able to track To track history, large
full hierarchy history without schema modification if it was populated from an HV volatile, multi-valued
version of its recursive key held in the customer dimension. This is not an option hierarchy maps
for the multi-parent HR hierarchy because multiple MANAGER KEYs cannot be cannot rely on HV
stored in an employee dimension. Even if HR was a single parent hierarchy, it is recursive keys.
too large and too volatile for an HV recursive key to be viable—the ripple effect Instead effective
would cause uncontrollable growth in the dimension. Instead the HR hierarchy dating must be used
map must be modified to use effective dating to track history.

Figure 6-24 shows an HV version of the REPORTING STRUCTURE hierarchy Effective date, end
map that includes the effective dating attributes EFFECTIVE DATE, END DATE date and a current
and CURRENT typically found in HV dimensions. These attributes allow BI Users flag should be added
to browse both the current hierarchy (Where CURRENT =‘Y’) and any point in to HV, MV, HM tables
time hierarchy (e.g., where ‘31/3/2011’ Between EFFECTIVE_DATE and
END_DATE). To understand how this hierarchy map records changes to an
employee, a manager and their relationship, take a look at the timelines in Figure
6-25 for Bond, M and Bond’s HR relationships during the first six months of 2011.

Consequences
With its effective dated relationships, the REPORTING STRUCTURE [HV] map
can be used to rollup employee facts using historically correct hierarchy descrip-
tions, but its join to fact tables is more complex than before. When the hierarchy
map did not contain historical relationships the join to SALES FACT was simply:

Where Reporting_Structure.Employee_Key = Sales_Fact.Employee_Key


194 Chapter 6 Who and What

Figure 6-24
HV hierarchy map
with effective dating

Effective dating must Now the multiple historically correct versions of the hierarchical relationships
be used to correctly must be joined to the correct historical facts to avoid over-counting. This requires
join the hierarchy a complex (or theta) join involving the hierarchy map effective dates and the
map to fact tables primary time dimension of the fact table:
Where Reporting_Structure.Employee_Key=Sales_Fact.Employee_Key
and
Sales_Date is Between Reporting_Structure.Effective_Date and
Reporting_Structure.End_Date

To avoid this, the This is likely to be a very expensive join. To get round this problem for HR fact
hierarchy map can tables that must be constantly joined to the hierarchy map, a surrogate key must be
be given its own added to the hierarchy map such as HR HM KEY in Figure 6-24. This surrogate
surrogate key which key works like any other dimensional surrogate key to avoid effective dated joins. It
is then added to HR can be added to any specialist HR fact tables to simplify the join to:
fact tables
Where Reporting_Structure.HR_HM_Key = Salary_Fact.HR_HM_Key

This creates a new Implementing this surrogate key would require the REPORTING STRUCTURE
dependency: the table to be built and updated ahead of the SALARY FACT table, like any normal
hierarchy map must HR dimension, because the HR HM KEY values must be ready before the fact table
be built and updated load begins. The simpler CV version of REPORTING STRUCTURE without its
before any HR fact specialist surrogate key does not require this dependency and can be maintained
tables independently of the facts.

Alternatively, large An alternative approach, to avoid effective dating joins or a hierarchy map surro-
HR hierarchies could gate key, is to break the single organization hierarchy into a number of far smaller
be split into a departmental hierarchies by removing the executive level(s) from the hierarchy
number of smaller map; in this case: Eve Tasks. The smaller hierarchies would be less susceptible to
hierarchies that can macro change from above and its resulting ripple effect, which would enable the
be tracked using employee surrogate key to be used to track all HV changes to employees and their
surrogate keys HR relationships.
Dimensional Design Patterns for People and Organizations, Products and Services 195

Figure 6-25
HR timelines

Product and Service Dimensions


Product dimensions have their own unique design challenges. Though not as large Product dimensions
or volatile as who dimensions, these what dimensions can be just as complex to are complex because
model because products can be described in so many different ways, by so many stakeholders know
different stakeholder groups. While stakeholders never know enough about their too much about their
customers, they know almost too much about their products and services! products and services

Product hierarchies are important for BI reporting because businesses are often Product hierarchies
closely organized around them. Yet despite their importance, they may not be well are typically the most
designed from a BI perspective. Thankfully, product hierarchies are fixed-depth important hierarchies
rather than variable-depth, but they can still be difficult to define. Established but they can often be
product hierarchies often represent the single biggest conformance issue for agile ragged and difficult to
data warehouse design, because they have become ragged and full of conflicting conform
definitions, through years of ad hoc growth and redefinition by many different
departments.

Another challenge unique to what dimensions is the need to ask BI questions about Product “bill-of-
what is going on inside a product or service. This is rare for who dimensions— materials” is another
unless you are dealing with medical data. For products, “what is going inside” may type of variable-depth
be other products and services, in the case of product bundle sales, or components hierarchy that the
and parts, in the case of design and manufacturing. To answer questions about data warehouse may
these, a data warehouse design must handle “bill-of-materials” information—and need to support
that is another example of a variable-depth hierarchy.
196 Chapter 6 Who and What

Describing Heterogeneous Products


Heterogeneous Product dimensions can become very complex when an organization like
products and Pomegranate deals with very different types of products—such as hardware,
services can have software, third party accessories, consulting services, licenses and support
too many specialist subscriptions—that all need to be described in very different ways. These
dimensional heterogeneous descriptions can lead to wide sets of dimensional attributes that are
attributes to fit only valid for certain product types. This is the same mixed business problem as
comfortably into a dealing with multiple customer types but greatly compounded by the fact that
single product organizations usually know a lot more about their products than their customers
dimension and have many more specialist ways of describing them. Even though product
dimensions rarely approach the row count of customer dimensions, the row length
of product dimensions, that attempt to describe every radically different product
type can cause just as many performance, usability and manageability problems.
With large sets of specialist attributes for heterogeneous products and services, BI
users will have to scroll through pages of attributes to find the ones that interest
them and will be daunted when trying to find correlated attributes. You may even
exceed the maximum number of columns in a single table supported by your
DBMS.
For both query performance and usability, you may want to break a monolithic
Large sets of product dimension into several swappable subset dimensions (SD), as in Figure 6-
specialist attributes 4, based on defining characteristics (DC) such as product type or subcategory.
should be grouped Each swappable dimension would contain exclusive (X) attribute groups that
together in their own require specialist knowledge to use or interpret correctly. For example, a retailer
swappable subset may have more than 300 attributes that describe clothing products—in minutia.
dimensions, based These would be fascinating to clothing buyers but of little interest to finance or
on a defining logistics. If query performance is not an issue, swappable subsets can be delivered
characteristic such as views, otherwise materialized views or separate tables may be needed to over-
as product type come poor performance or column number limitations.

Balancing Ragged Product Hierarchies


Product hierarchies Product hierarchies can sometimes appear to be of variable-depth, especially when
can sometimes look source data is held in recursive structures within an Enterprise Resource Planning
like variable-depth (ERP) package, or when every department seems to have a different number of
hierarchies but they levels on the product hierarchy charts pinned to their walls. However, what is most
are in most cases often thought of as “the product hierarchy” is not a true variable-depth hierarchy—
ragged hierarchies a hierarchy of products within products—but actually a ragged hierarchy of prod-
ucts within groups that are based on the physical, organizational, or geographic
properties of products. Many of the attributes used to assign products to these
groups are not mandatory, or are exclusive to certain subsets, leading to raggedness
when you try to create a single conformed hierarchy of all products, across all
business processes and departments.
Dimensional Design Patterns for People and Organizations, Products and Services 197

Ragged hierarchies, as described in Chapter 3, can look similar to variable-depth Ragged hierarchies
hierarchies but the important distinction that makes them easier to deal with is have a fixed number
that they have a known maximum number of named levels. This means that they of uniquely nameable
can be implemented in a dimension by simply defining the missing or unused levels. They can be
levels as nullable. Figure 6-26 shows an example of a product dimension contain- implemented in a
ing a ragged hierarchy (that matches the hierarchy chart of Figure 3-8). It contains dimension by defining
the product “POMServer”, which does not have a subcategory, perhaps because it non-mandatory
is the only product of its type. This simple “flattening” of the hierarchy, into a fixed attributes for the
number of columns within the dimension, is in stark contrast to the complexity of hierarchy levels that
building a separate hierarchy map, but it can result in a “Swiss cheese” dimension, have missing values
full of “NULL holes” that show up as gaps on reports. Even if these holes are filled
in with the stakeholders’ preferred label, such as “Not Applicable”, they can cause
problems for drill-down analysis: all the “Not Applicable” values are grouped
together and cannot be further drilled on. Also, it does not inspire stakeholder
confidence in the data warehouse when BI applications use such a common level as
a product subcategory and return “Mobile,” “Desktop,” and “Not Applicable” .

Figure 6-26
Balancing a
ragged
hierarchy

In most cases, the best approach is to balance a ragged product hierarchy by filling Balance slightly
in the missing values with the stakeholders during a modelstorming workshop as ragged hierarchies
part of conforming these dimensional attributes. Where filling in the gaps with the with the help of
stakeholders is not possible—stakeholders cannot agree on the appropriate new or stakeholders: by
existing values or there are just too many missing values to tackle in the available asking them to fill in
time—there are three methods for automatically generating usable interim values the missing values
for the missing levels (that will induce the stakeholders to create their own):
198 Chapter 6 Who and What

Temporary balancing Top down balancing: A value is copied down into the missing level from the level
can be achieved by directly above it. For example, the POMServer CATEGORY value “Computing” is
filling in a missing copied into SUBCATEGORY.
level with values
from the level Bottom up balancing: A value is copied up into the missing level from the level
directly above or directly below it. For example the POMServer PRODUCT TYPE value “Server” is
below copied into SUBCATEGORY.

Top down and bottom up balancing: Gaps are filled with new unique values
created by concatenating the values directly above and below the missing level. For
example, “Computing/Server” might be used to fill the SUBCATEGORY gap for
“POMServer”.

When a ragged hierarchy is discovered during data profiling, if only a very small
percentage (1-2%) of it is ragged (skips a level) this usually indicates errors in the
data rather than intentional design. The errors should be corrected and a simple
balanced hierarchy defined.

Multi-Level Dimension Pattern


Problem/Requirement
Having different business processes associated with different levels of the Product
hierarchy is common. For example, sales plans are set on a monthly basis at the
brand level, whereas sales transactions are recorded daily for individual products.
These different business processes are handled by separate fact tables, and the
different levels of product detail should be handled by separate dimensions too: a
A business event full PRODUCT dimension for sales facts and a BRAND [RU] rollup dimension for
needs to be sales plans. Attaching the appropriate dimension to each fact table clearly docu-
described using ments its fixed product granularity. However, there are circumstances where a
various levels of a single business process can be associated with different levels of the Product hier-
dimensional archy. For example, a web page event on a Pomegranate website can describe a
hierarchy visitor viewing a single product, multiple products of the same product type, a
product category description or no product information at all.

Solution
It is possible to attach a product description to the majority of page visits recorded
on the Pomegranate’s websites, especially the online store pages. But not every
page refers to products; some pages describe multiple products: whole product
categories, subcategories, or specific brands. You can easily handle non-product
pages by using the “special” zero surrogate key that represents “missing product”,
as discussed in Chapter 5. In a similar way, you can use other “special” surrogate
key values to help you describe the page visits that relate to the higher levels in the
product hierarchy by designing a multi-level dimension.
Dimensional Design Patterns for People and Organizations, Products and Services 199

Multi-level dimensions contain additional rows that represent all the multiple A multi-level
levels within their hierarchies that are needed to describe mixed-level facts. For dimension (ML)
example, a multi-level product dimension contains records for each product and contains additional
additional records for each brand, subcategory, and category if facts need all these rows that represent
levels. Figure 6-27 shows a multi-level PRODUCT dimension, denoted by the code level values within its
ML, that contains example additional rows that represent entire categories (SKs -1 hierarchy
and -2), a subcategory (SK -3), and a brand (SK -4). The complete table would
contain one additional row for every value at every level needed.

Figure 6-27
Multi-level
Product
dimension

ML: Multi-Level dimension that contains additional members representing


multiple levels in its hierarchy. Also used to document an event detail or di-
mensional foreign key that represents a multi-level dimension.

Multi-level dimensions also contain an additional attribute LEVEL TYPE that Multi-level
documents the meaning of each row in the dimension. The majority of rows will be dimensions contain
normal members that represent individual products (or employees in the case of a a LEVEL TYPE
multi-level employee dimension). Their LEVEL TYPE defaults to the name of the attribute which
dimension itself, whereas the additional rows will be labeled after the level attribute documents the
in the hierarchy they represent. LEVEL TYPE is useful for ETL processes that meaning of each
manage the use of these additional records, and can also be used by queries that row
want to constrain on specific level facts only. LEVEL TYPE can be ignored by most
queries that simply want to roll up all the facts to a particular level. For example, a
query using PRODUCT [ML] could group by CATEGORY and count web page
visits to get the total pages viewed for each category; the figures would automati-
cally include pages for individual products, brands, and subcategories within each
category, as well as pages for the categories themselves.

You capture multi-level dimension requirements when stakeholders tell group


themed event stories using example values that normally appear at higher levels
in hierarchies; e.g., they give you “MI6” when you were expecting "Bond”.
200 Chapter 6 Who and What

A multi-level employee dimension would allow you to handle events where


groups of employees occasionally act like individual employees. For example,
sales transactions are normally assigned to individual employees but when
several members of staff are involved or the employee is unavailable (perhaps the
individual has left the company or the transaction was customer self-service)
sales facts can be assigned an EMPLOYEE KEY that represents a team, branch or
division. In Chapter 9, we will combine multi-level and multi-valued employee
dimensions to describe how joint sales commissions can be calculated.

Do not use a multi- The additional flexibility of multi-level dimensions can be confusing, so they
level dimension to should never be used where their flexibility is unnecessary. Create separate single
describe fixed-level and multi-level versions of a dimension to make their usage explicit. If a star
facts schema has a fixed level of dimensional detail, use normal (single-level) dimen-
sions with no LEVEL TYPE attributes. The presence of a LEVEL TYPE in the star
implies that facts are multi-level when that is not the case. If a fact table truly needs
a multi-level dimension you should explicitly document it by marking the dimen-
sional foreign key as ML in the fact table, as in Figure 6-28.

Figure 6-28
Documenting single
and multi-level fact
tables

Consequences
You should never use a multi-level dimension to change the meaning of a fact. For
example, do not store target revenue at the brand level and actual sales revenue at
the product level in the same fact table. Sales and planning are two very different
business processes, two different verbs. How would you name and easily describe
Never use a multi- the resulting fact table? Even sticking to a single business process, do not store
level dimension to summary sales for a category in a product sales fact table. Performance enhancing
create facts with summaries require their own aggregate fact tables (described in Chapter 8). If you
mixed meanings used a multi-level dimension to store targets, summaries and actuals, the resulting
revenue fact would not be additive across LEVEL TYPE. To avoid over-counting,
every query would have to remember to constrain to a single LEVEL TYPE—a
recipe for disaster. The multi-level product dimension works perfectly with the
page visit fact table, in Figure 6-28, because it does not change the meaning of the
facts; they are all page visits no matter if they are for a product or a category. That
is why dwell time and total pages viewed remain additive across LEVEL TYPE.
Dimensional Design Patterns for People and Organizations, Products and Services 201

Parts Explosion Hierarchy Map Pattern


Problem/Requirement
Stakeholders need to analyze product sales down at the level of the components
that went into the products using product bill of materials (BOM) data. The bill of
materials for a product can be represented as a variable-depth hierarchy of compo-
nents within components; for example, Figure 6-29 shows the BOM hierarchy for a A product bill of
new product sold by employee James Bond. It reveals that the “POMCar” is made materials represents
up of an “off the shelf” Aston Martin DB5 and an enhanced safety pack, that is a variable depth
itself made up of a number of interesting gadgets. A bill of materials like this is hierarchy of
typically stored in an operational system using M:M recursive structures that allow components
the components of a product or service to be made up of other products and
services.

Figure 6-29
Bill of materials
for a POMCar

Solution
A BOM can be represented by the PARTS EXPLOSION hierarchy map in Figure 6-
30. This is a reverse hierarchy map which joins to product facts (and the product
dimension) by its parent key (PRODUCT KEY) as in Figure 6-31, allowing the
facts to be rolled down to or filtered on child components. It contains a SUB A reverse hierarchy
ASSEMBLY flag that indicates “Y” if a component is made up of other identifiable map joins to fact
components and QUANTITY, which records the number of components that go tables by its parent
into the finished product. This is similar to a distant weighting factor in that it key and allows facts
needs to be adjusted in the hierarchy map based on its parent quantities. For to be allocated to
example, a single defense system contains 4 motion sensors, but a POMCar con- child levels
tains 2 defense systems, so the quantity of motion sensors it contains is 8.

Figure 6-30
PARTS
EXPLOSION
hierarchy map

Add a cost or revenue recovery weighting factor to a BOM hierarchy map to


allocate revenue facts to product components (or the sub-products of a bundled
offering) based on their quantity and unit value.
202 Chapter 6 Who and What

Figure 6-31
Component
Analysis

Consequences
Don’t try to use the PARTS EXPLOSION hierarchy map pattern to describe the bill
of materials for anything as complex as a real car, submarine or aircraft—unless
you are prepared for a very large table.

Summary
Mini-dimensions track historic values for very large volatile dimensions, like CUSTOMER, that
cannot use the Type 2 SCD technique. Volatile HV attributes are moved to a separate mini-
dimension to keep the size of the main dimension under control and the historical values are
related back to the main dimension via fact table relationships. Mini-dimensions typically band
high cardinality values to control their size and volatility and to provide better report row
headers and filters.

Snowflaking makes sense for very large dimensions when a large set of lower-cardinality,
seldom used attributes can be normalized into outriggers. The calendar dimension can be a
particularly useful outrigger for any dimension that contains embedded dates.

Swappable dimensions (SD) are used to break up large mixed type dimensions into specialist
subsets that are easier to use and faster to query. Swappable dimensions can be swapped into a
star schema in place of one another because they share a common surrogate key.

Hybrid SCD requirements for current value and historical value reporting are best handled by
creating separate hot swappable CV versions of HV dimensions. These CV dimensions can be
created as material views using simple self-joins of HV dimensions.

Hierarchy maps (HM) are used to store variable-depth hierarchies in a report-friendly format
and avoid recursive structures that cannot easily be queried by BI tools.

Multi-valued hierarchy maps (MV HM) are used to represent multi-parent hierarchies that are
typically stored in source systems as M:M recursive relationships.

Multi-level dimensions (ML) describe business events that vary in their level of dimensional
detail. A multi-level dimension will contain additional special value members that represent
higher levels in the dimension’s hierarchy.
7
location-specific

WHEN AND WHERE


Dimensional Design Patterns for Time and Location

The past is a foreign country: they do things differently there.


— L.P. Hartley, The Go-Between

Every business event happens at a point in time or represents an interval of time. Time is the most
Time is the primary way that BI queries group (“show me monthly totals”), filter frequently used
(“show me sales for Financial Q1”), and compare business events (“How are we dimension for BI
doing year to date, versus last year?”). That is why every fact table has at least one analysis
time (when) dimension.

Most business events occur at a specific geographical or online location. Many Location dimensions
interesting events represent changes of location. Hence, a large number of fact and attributes are
tables have distinct where dimensions in addition to the location attributes that can frequently used too
be found in who and what dimensions, such as customer and product.

Although when and where are separate dimensions, they can influence one an- Time and location
other: Time zones, holidays and seasons, are all examples of location-specific time are separate
attributes that are affected by event geography. Similarly, analytically significant dimensions but can
locations such as the first and last locations in a sequence of events are timing- affect one another
specific location dimensions, affected by event chronology.

In this chapter, we describe dimensional design patterns for efficiently handling This chapter
time and location, in particular, patterns for correctly analyzing year-to-date facts, describes when and
and journeys—facts that represent changes in space and time, that are all about where patterns
where and when.
Efficient date and time reporting Chapter 7 Design
Correct year-to-date analysis Challenges
Time zones, international holidays and seasons At a Glance
National language support
Trip and journey analysis

203
204 Chapter 7 When and Where

Time Dimensions
When details Every event contains at least one when detail, which should always be modeled as a
are modeled as time dimension, rather than left as a timestamp in the fact table. But why do you
physical time need a time dimension when you have datetime data types, date functions and date
dimensions arithmetic built into data warehouse databases and BI tools?

Physical date Descriptive time attributes, such as day of week, month, quarter and year, are
dimensions help constantly used to group and filter the information on BI reports. Deriving them
to simplify the most from raw timestamps in every query is woefully inefficient and puts an unnecessary
common grouping burden on BI users and BI tools, that cause mistakes and inconsistencies. Why
and filtering decode the month or day of week every time they are needed, when they could be
requirements of BI stored once in a dimension and reused consistently and efficiently, like any other
queries dimensional attribute? Also, many commonly used time attributes—such as fiscal
periods, holidays, and seasons of the year—simply cannot be derived from time-
stamps alone because they are organization or location specific.

You should build a physical time dimension to:


Avoid duplication of calendar logic in each report or BI application
Remove date arithmetic from constraints to increase index use
Insulate queries from DBMS-specific date functions
Support organization-specific fiscal periods
Define conformed time hierarchies
Provide consistent time-related business labels and definitions

Date and time of Time is actually best modeled dimensionally by splitting it into date and time of
day are modeled day. This may seem odd at first, but it does reflect how time is queried. Almost
as separate every query will group or filter on sets of days (Years, Quarters, Months or Weeks).
dimensions to Many queries will do the same with periods within a day (AM/PM, work shift, peak
match their periods). But very few queries will use arbitrary periods that span dates and times
dimensional usage, (e.g., “sales totals by periods of 2 days and 8 hours”). Financial queries are grouped
manage their size by the date-related fiscal calendar, ignoring time of day all together, while opera-
and make the time tional and behavior queries can group months of data together by time of day to
granularity of facts see peak and average activity levels. In recognition of these query schisms, when
explicit details (logical time dimensions) should be implemented as two distinct and
manageable physical dimensions: a calendar date dimension, and a clock time of
day dimension, each with its own surrogate key. Separating date and time, like this,
also makes the time granularity of facts explicit. If time of day is not significant (or
not recorded) for a business event, its fact table design simply omits the clock
dimension, and includes only the calendar dimension.

Calendar and clock Figure 7-1 shows typical examples of Calendar and Clock dimensions related to an
are role-playing order fact table. Each of these dimensions play two roles: representing distinct
dimensions Order and Delivery dates and times. Although only one physical instance of each
Dimensional Design Patterns for Time and Location 205

dimension will exist, BI tools should present the two roles as separate dimensions
using views that rename each time attribute based on its role. For example,
ORDER TIME, ORDER DATE, ORDER MONTH and DELIVERY TIME,
DELIVERY DATE and DELIVERY MONTH.

Figure 7-1
CALENDAR and
CLOCK dimensions
used to play two
roles with ORDER
FACT

The ORDERS FACT table in Figure 7-1 documents Delivery Time (duration) as a
fact. This is the elapsed time in hours taken to fulfill the order. This duration
would be difficult to calculate and aggregate using the separate time dimensions
alone, and is best stored as a fact.

Calendar Dimensions
Calendar dimensions should support all the groupings of day, week, month, A good calendar
quarter, year, fiscal period, and season that are needed as report row headings and dimension should
query filters; for example, CALENDAR in Figure 7-1 contains the commonly used include all the date-
calendar attributes: DAY (Sunday–Saturday), DAY IN WEEK (1–7), MONTH related attributes
(January–December), MONTH IN YEAR (1–12), and YEAR. It also contains that stakeholders
several “Overall” attributes such as DAY OVERALL and MONTH OVERALL. need. Ideally BI
These are epoch counters that increment for each new Day, Week or Month from tools should never
the earliest date in the data warehouse (the epoch date). Overall values are used for have to decode a
calculating interval constraints that can cross year boundaries, such as: “last 60 date to provide a
days” or “last 4 weeks”. The BEAM✲ excerpt of Pomegranate’s CALENDAR, in good report row
Figure 7-2, shows that the company has 13 fiscal periods per year, and that its fiscal header or filter
year runs February to January—not January to December. The full dimension
would make all of Pomegranate’s calendar information available for reporting, so
that BI users do not have to decode any dates or remember which fiscal periods
contain 29 days rather than 28 or even the name of the current period.
206 Chapter 7 When and Where

Figure 7-2
Pomegranate
CALENDAR
dimension
(excerpt)

The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley,
2002) pages 38–41 provide further examples of useful calendar attributes.

Date Keys
Date keys are Calendar dimensions, like every other dimension, should be modeled with integer
integer surrogates surrogate keys. But unlike other surrogate keys, date keys should have a consistent
but they should be sequence that matches calendar date order. Sequential date keys have two enor-
in calendar date mous benefits:
order
Date key ranges can be used to define the physical partitioning of
large fact tables. Chapter 8 discusses the benefits of doing this.

Date keys can be used just like a datetime in a SQL BETWEEN join
to constrain facts to a date range.

Historic value hierarchy maps (HV, HM) that use effective dating to track history
can use sequential date keys (EFFECTIVE DATE KEY and END DATE KEY) rather
than datetimes to improve efficiency when joining to fact tables. For example,
Where Reporting_Structure.Employee_Key = Sales_Fact.Employee_Key
and
Sales_Date is Between Reporting_Structure.Effective_Date_key and
Reporting_Structure.End_Date_key

This will join the historically correct version of the reporting structure HR hierar-
chy to the salary facts. Joins like this are complex (or theta) joins that are hard to
optimize and need all the help they can get.
Dimensional Design Patterns for Time and Location 207

ISO Date Keys


Something else is unusual about the DATE KEYs in Figure 7-2: they are based on ISO date keys are
the ISO date format YYYYMMDD, which breaks the rule that surrogate keys in the format
should not be “intelligent.” For date keys—and only for date keys—the benefits for YYYYMMDD
breaking this rule outweigh any negatives:

ETL benefits: Date keys can be derived from the source date values directly, ISO keys are easy
rather than with a surrogate key lookup table. This can be significant when to generate
processing events that contain many when details that all need to be converted
into fact table date keys.

DBMS benefit: The readable ISO format makes it easier to set up fact table ISO keys are easy
partition ranges. to read, which can
be good (for ETL)
BI benefits: None! BI queries should not use the YYYYMMDD format as a and bad (for BI)
quick way of filtering facts and avoiding joins to the Calendar dimension with
its consistent date descriptions. Best not to tell BI developers or users—keep
this little secret between the ETL team and the DBAs.

Epoch-Based Date Keys


Epoch-based date keys are generated by subtracting an epoch or origin date from Epoch date keys are
the date; for example, the DATE KEYs in Figure 7-5 have been generated using an based on an origin
epoch of 1/1/1900. Epoch keys can be a good alternative to ISO format if your ETL date; e.g., 1/1/1900
toolset is faster at date arithmetic than date reformatting. Epoch keys are also small
contiguous numbers that may work better with some DBMS query optimizers than
ISO keys, which are much larger—with a gap of 8770 between every December 31st Epoch keys are also
and January 1st. The downside of epoch keys is that they are harder to read when easy to generate
setting up partition ranges. However, this may also be a BI benefit, because epoch
keys are far less likely to be abused by queries using them directly as filter or
decodes (instead of using the appropriate calendar dimension attributes). Which is Epoch keys are
the best approach ISO or epoch? If performance is paramount, you should speed more compact than
test both with your ETL toolset and DBMS platform. You should also wait until ISO keys but less
you have read Date Version Keys, later in this chapter. before you decide on your easy to read
date surrogate key strategy.

Create a version of your CALENDAR dimension that is keyed on a date rather than
a surrogate key (as a materialized view). This can be useful as an outrigger that
can be joined to date data type dimensional attributes such as FIRST PURCHASE
DATE in CUSTOMER or HIRE DATE in EMPLOYEE. This allows them to be
grouped and filtered using all the rich CALENDAR attributes for very little extra
ETL effort. A date-keyed CALENDAR can also be useful for prototyping BI queries
using sample data that has not yet been loaded into fact tables and converted to
DATE KEYs. But it should never be used in place of a surrogate key-based
CALENDAR for querying fact tables.
208 Chapter 7 When and Where

Populating the Calendar


CALENDAR Because the CALENDAR dimension is relatively small and static, it is often pre-
dimensions often populated with all the dates needed for the foreseeable future. For example, loading
need to cover wider the calendar with 20 years of data—enough to cover 10 years of history and 10
date ranges than years into the future—would create a modest dimension of only 7,308 records.
you think: to cope Having said that, calendars often need to cover a wider date range than first antici-
with birth dates and pated. For example, a financial services data warehouse might only hold 10 years of
future maturity dates transactions, but may need a CALENDAR dimension that can cope with customer
dates of birth up to 120 years ago and policy maturity dates 50 years into the
future. For many of these future dates holiday information will not be available and
will need to be left as null.

Spreadsheets, database functions, stored procedures, and ETL tools are all appro-
priate for populating the calendar—any of these can quickly generate the standard
calendar attributes from any origin date. Search online for “date dimension genera-
tor” to find SQL code and spreadsheets that you can reuse. Table 7-1 includes
additional, less automated, sources for enriching the calendar.

Table 7-1 ATTRIBUTE EXAMPLE SOURCE


Calendar attribute Standard Day, Month, Spreadsheets, SQL functions/stored procedures,
sources calendar Quarter ETL tools, online date dimension generators

Fiscal Fiscal Period, Finance department


calendar Fiscal Year

Holiday Holiday Flag HR, manufacturing, national calendar


schedule

Seasonal Sales Season Sales, marketing, national calendar


information

BI Tools and Calendar Dimensions


Design the You should design your CALENDAR dimension with the features and limitations
CALENDAR of your BI toolset in mind. For example, some BI tools require specific date col-
dimension to umns to help calculate time series measures and make efficient time comparisons.
take advantage of Many BI tools have the ability to define a default display format for each column;
your BI tool features for example, DISCOUNT can be defined to always display as a two-digit percent-
age. You can use this feature, if available, to create a “correctly sorted month”
report item by defining MONTH as a date column that stores the first (or last, just
be consistent) day of each month, with a BI display format of “Mmm YYYY”. Even
though MONTH is represented as a date, it will group the facts correctly because it
contains only 12 distinct values for each year and it will display the month name
correctly thanks to the default display format, and most importantly: it sorts
correctly in calendar order because it is a date. This saves BI users from having to
pick two columns: MONTH NAME to display and MONTH NUMBER to sort by.
Holding the month as a date also enables automatic national language translation
of month names, if your BI toolset supports localization.
Dimensional Design Patterns for Time and Location 209

Period Calendars
A day granularity calendar is not the only calendar you will need. Periodic snap- Do not use the
shots and aggregate fact tables hold weekly, monthly, or quarterly facts and will standard day
require rollup (RU) calendar dimensions. Theoretically you could attach the CALENDAR
CALENDAR dimension to these higher granularity fact tables by using the last dimension with
date of the period that the facts represent; for example, a monthly sales snapshot higher level periodic
could join to the CALENDAR using the last day of the month—the date on which snapshots and
the snapshot was taken. However, this is not a good idea, because it does not aggregate fact
explicitly document the time granularity of the facts, and could lead BI users to tables
incorrectly believe they can analyze the monthly sales facts using any calendar
attribute, including day-level attributes like DAY OF WEEK or WORKDAY flag.

Month Dimensions
Instead of using a day calendar with monthly fact tables you should create a Monthly fact tables
MONTH rollup dimension similar to the one shown in Figure 7-3, and define should use a
monthly fact tables using MONTH KEY foreign keys. This makes the granularity MONTH dimension
of these fact tables explicit, and limits queries to using only the valid monthly to make their time
attributes. You can build the MONTH dimension from the CALENDAR dimen- granularity explicit
sion using a materialized view. MONTH KEY can be created using the DATE KEY
for the last day of each month: the MAX(DATE_KEY) if the materialized view
groups by month.

Even though CALENDAR and MONTH dimensions have different time granularities
they are still conformed dimensions because they use common attribute values:
they are conformed at the attribute level.

Figure 7-3
Period calendars

Some BI tools find it difficult to cope with separate day and month calendar tables
and prefer all common date dimension attributes to be defined using a single
table. If this is the case, having a MONTH KEY that matches the last day of the
month DATE KEY can be useful. In that way, BI tools that need to, can use the
CALENDAR dimension instead of MONTH at query time.
210 Chapter 7 When and Where

Offset Calendars
An offset calendar Events such as insurance claims or policy payments can benefit from having their
dates facts from a own specialized calendar dimension in addition to the standard calendar. A
fact-specific origin POLICY MONTH dimension like the one shown in Figure 7-3 would be used to
date; e.g. policy offset the facts from the creation date or last renewal date of the policy rather than
facts are dated from January 1 or the first day of the financial year as the normal calendar dimension
a policy start date would. For example, if a policy renews on April 1, an August claim fact for that
policy would be labeled as MONTH “August” or MONTH NUMBER 8 by
CALENDAR but POLICY MONTH 4 by the POLICY CALENDAR.

An offset calendar, like POLICY MONTH, can be use in conjunction with a stan-
dard MONTH dimension to define a MONTHLY POLICY SNAPSHOT with a granu-
larity of POLICY by MONTH by POLICY MONTH. This fact table will contain exactly
twice as many rows as a standard monthly snapshot will allow the facts to
be queried by either calendar or policy month or a combination of both.

Year-to-Date Comparisons
Problem/Requirement
To perform year-to-date (YTD) comparisons—such as YTD Sales 2011 versus
YTD Sales 2010—the following needs to be known about the date range:

What is the “year to The “from date” when the year began. This seems obvious but are we talking
date” date for valid about the beginning of the calendar year, or the organization’s fiscal year or
comparisons with the tax year?
previous years
The “to date.” Are you running the YTD calculation up to now or some
specific date in the past. If you are defaulting to “up to now”, what does “now”
mean? Do you have complete data loaded right up to today or yesterday?

Which days to include. Should YTD figures from previous years include facts
up to the same “to date” in those years, or the same number of days (this copes
with the extra day caused by the February 29 in leap years)? If it is based on
the number of days, is that calendar days or workdays (for example, the same
number of weekdays excluding public holidays)?

CALENDAR The CALENDAR dimension can support consistent year-to-date (YTD) calcula-
dimensions support tions by providing conformed definitions for the beginning of each year (calendar
YTD comparisons and fiscal) and which workdays to include. The attributes needed to do this are:
by providing
conformed DAY (NUMBER) IN YEAR
definitions of DAY (NUMBER) IN FISCAL YEAR
workday and WORKDAY IN YEAR
fiscal year WORKDAY IN FISCAL YEAR
WORKDAY FLAG
Dimensional Design Patterns for Time and Location 211

While these calendar attributes help tremendously, there is still the question of You need to know
what date the “year to date” should be. For data warehouses that are loaded nightly, when the YTD facts
common sense might suggest a “year to date” of yesterday (SYSDATE –1). How- were last loaded to
ever, not every business process runs on the same schedule, and therefore not every make valid
fact table is loaded nightly. Some fact tables may be loaded weekly, monthly, or on- comparisons with
demand when source data extracts becomes available—a common requirement for previous years
external data feeds. This causes problems when trying to compare YTD figures for
this year with YTD figures for last year. YTD figures for this year may not contain
data up to yesterday whereas the YTD figures for last year will contain data right
up to yesterday minus one year.

Even when fact tables are loaded nightly, they may not be loaded completely. ETL Because of ETL
errors will occur from time to time, and complete data will not be available for errors or “late-
reporting until these errors are fixed. It may also be quite normal for some ETL arriving data”, you
processes to encounter “late-arriving data” where the complete set of events for a also need to know
particular date will not be fully available until several days (or weeks) after that the last complete
date; for example, roaming call charges from international mobile networks, or load for YTD facts
medical insurance claims submitted long after treatments were given. Comparisons
between the current year and last year are inaccurate whenever data is complete for
last year and the current year is still a work in progress.

Solution
Information about the status of each fact table—when it was last loaded and the
last complete day’s worth of data it contains—should be stored in the data ware-
house rather than in the heads of ETL support staff or BI users. It should be avail-
able as data in a format that BI tools can readily use.

The FACT STATE table (shown in Figure 7-4) supports valid YTD comparisons by A FACT STATE
storing the recency and completeness of each fact table in a format that can easily table holds the most
be used with the CALENDAR dimension. It contains the most recent load date and recent load date and
the last complete load date of each fact table. The most recent load dates should be the last complete
updated automatically by all fact-loading ETL processes. For ETL processes that load date of each
are subject to unpredictable late-arriving data you may have to manually set the fact table
LAST COMPLETE LOAD DATE.

Figure 7-4
FACT STATE table
212 Chapter 7 When and Where

A FACT STATE To use FACT STATE information, you add FACT STATE to your fact table query
table contains all and filter it on the fact table name you are using. You can then use any of its
the necessary YTD attributes in place of a SYSDATE-based calculation. Unfortunately, because the
information but it FACT STATE table is not “properly” joined to any other table in the query, many
can be difficult to BI tools complain about a possible Cartesian product. Even if your BI tool doesn’t
use for BI queries complain, using FACT STATE in this manner can be confusing for both BI users
and developers, not to mention dangerous—if it is not properly constrained to the
correct fact table. To overcome this issue, you can provide the FACT STATE
information as part of a fact-specific calendar dimension.

Fact-Specific Calendar Pattern


FACT STATE A fact-specific calendar is built by merging the dynamic FACT STATE row for a
information can be fact table with the static rows of the standard CALENDAR dimension. This creates
repackaged in easy a version of the calendar that is “aware” of the YTD status of the facts that it is
to use fact-specific designed to work with. Figure 7-5 shows an example fact-specific calendar SALE
calendars DATE, built by joining the one row in FACT STATE (where FACT_TABLE =
“SALES_FACT”) to every row in CALENDAR.

Figure 7-5
SALE DATE:
a fact-specific
calendar
with added
FACT STATE
information

A fact-specific At first sight, it seems wrong or at the very the least wasteful, to repeat the same
calendar makes FACT STATE information on every row in the new calendar, but remember this
ETL load dates as calendar is still tiny by fact table standards and now it is simple to compare its
easy to use as attributes to their equivalent FACT STATE attributes. Because the fact-specific
SYSDATE calendar will always be present in every meaningful query involving its specific fact
table, the MOST RECENT and LAST COMPLETE attributes can be used just as
easily as the DBMS system variable SYSDATE, without having to worry about
constraining FACT STATE on the right fact table or a BI tool (or developer)
complaining about a missing join. For example, to compare 2011 (the current year)
YTD sales with 2010, based on the most recent load date, a query would contain
the following simple SQL:
Dimensional Design Patterns for Time and Location 213

SELECT Year, SUM(Revenue)as Revenue_YTD


WHERE Year = 2010 or Year = 2011
AND Day_In_Year <= Most_Recent_Load_Day_in_Year
GROUP BY Year

To select the last three complete weeks of facts, the constraint would be:
WHERE Week_Overall BETWEEN (last_complete_week_overall - 3
AND last_complete_week_overall)

You should create a fact-specific calendar for each fact table that is used for YTD Create a fact-
comparisons, ideally as (materialized) views so that they will be updated automati- specific calendar
cally whenever the FACT STATE table is updated. If a fact table has a single time view for each fact
dimension, its fact-specific calendar can be given a unique role-specific name, such table used for YTD
as SALE DATE (shown in Figure 7-5). If a fact table has multiple date dimensions, analysis
each one must use the same (more generically named) fact-specific calendar as its
role-playing (RP) time dimension. It is possible for all fact-specific calendars to
share the same conformed dimension name if each one is defined within a separate
fact-specific database schema (that also contains its matching fact table). The
naming approach you can adopt will depend on how your BI toolset qualifies
tables when accessing multiple star schemas simultaneously.

To help keep the SQL that builds fact-specific calendars simple, the YTD compari- FACT STATE
son attributes within CALENDAR should be mirrored in FACT STATE; for attributes should
example, if there is a QUARTER IN FISCAL YEAR attribute in CALENDAR there mirror calendar
should be a MOST RECENT LOAD QUARTER IN FISCAL YEAR and a LAST attributes to keep
COMPLETE QUARTER IN FISCAL YEAR in FACT STATE. view building simple

You can expand fact-specific calendars to hold additional Y/N indicator flags—
such as MOST RECENT DAY, MOST RECENT MONTH, PRIOR DAY, and PRIOR
MONTH—that are based on the MOST RECENT LOAD DATE. Some BI tools may
also find it useful to have a MOST RECENT DAY LAG column that numbers every
date in the calendar relative to the MOST RECENT LOAD DATE; i.e., the most
recent date is 0, the previous day is –1, the following day is +1.

Using Fact State Information in Report Footers


The system date (SYSDATE) is often used in report headers or footers to provide FACT STATE
basic time context of a report. You can add FACT STATE information to produce information can be
a more descriptive report footer, such as: used to provide
descriptive report
Report run on 23rd March 2011. Report reflects data available up to
footers that explain
17th March 2011. The last complete week's data is for week 10,
data up to week 12 is included but is incomplete. the data available

The bold values above are derived from SYSDATE, MOST RECENT LOAD
DATE, LAST COMPLETE WEEK IN YEAR, and MOST RECENT LOAD WEEK
IN YEAR, respectively. FACT STATE tables can be expanded to hold additional
audit and data quality information, such as whether the latest facts have been
signed off or not. This information, too, is handy stuff to print in a report footer.
214 Chapter 7 When and Where

Conformed Date Ranges


FACT STATE In addition to defining meaningful date ranges for YTD comparisons on a single
information helps fact table, a FACT STATE table can help define the set of sensible comparisons that
define conformed can be performed across fact tables. For example, you can't meaningfully compare
date ranges for current YTD sales with YTD commissions based simply on current year totals if
comparing business SALES FACT contains data up until the end of May and COMMISSION FACT
processes using only contains data up until the end of April. However, the two processes can be
multiple star compared on a "year to end of April" basis. You derive this conformed date range
schemas from FACT STATE using the earliest LAST COMPLETE DATE among all the fact
tables involved in the analysis.

Clock Dimensions
A clock dimension A clock dimension contains useful time of day descriptions, such as Hour of Day,
contains time of day Work Shift, Day Period (Morning, Afternoon, Evening, Night), Peak and Off-Peak
descriptions, periods. Its granularity is typically one row per minute, half hour, or hour of the
typically at the day—whatever level of detail is needed to provide the row headers and filters that
minute granularity BI users need. Figure 7-6 shows a typical CLOCK dimension with minute granu-
larity. It contains 1,440 rows—one for each minute in a day—plus a zero Time key
row for unknown or not applicable time of day. You should avoid defining clock
Time down to the dimensions with a granularity of one row per second unless there really are useful
second is best rollups of less than a minute. For most business processes, time at the precision of
treated as a fact a second or less is not useful as a dimension (as a report row header or filter), but it
may be useful as a fact for calculating exact durations. Storing precise timestamps
as facts allows the time dimensions to remain small and concentrate on being good
dimensions—sources of good descriptions for report row headers and filters.

Figure 7-6
CLOCK dimension

CLOCK in Figure 7-6 is an HV dimension because work shifts and peak time can
change but their historical names and times must still be used to describe historic
facts. A standard CALENDAR is an FV dimension because date descriptions are
fixed and do not change. A fact-specific calendar is a CV dimension because it
must contain the current ETL status dates for its specific fact table.
Dimensional Design Patterns for Time and Location 215

Day Clock Pattern - Date and Time Relationships


Problem/Requirement
The standard attributes of time—such as Hour, Minute, AM/PM, and Minute in
Day—are independent of date; for example, 11:59 a.m. is always 11:59 a.m. no
matter what the date or day of the week is. This is why you model a clock dimen-
sion separately from a calendar dimension. But as you embellish the clock with
additional attributes—such as the peak/off peak and work shift name—you often Certain time of day
find that some of these time descriptions vary by date. For example, 11:59 a.m. descriptions vary
might be classified as “Work Time” on Friday March 27, 2010, and “Play Time” on based on date
Saturday March 28, 2010. Does this mean that you have to recombine time and attributes
create a dimension at a granularity of one minute for every day in the data ware-
house?

Solution
Thankfully not! Date and time don’t have to be combined to solve this problem.
Time of day descriptions, like work shift or peak/off peak, are seldom dependent
on the actual date (March 27 or March 28) but on the day type (weekday, weekend,
holiday, or unusual day.) You can handle this level of variation in the CLOCK
dimension by using the TIME KEY to represent versions of a minute. Figure 7-7 A DAY CLOCK
shows a DAY CLOCK dimension, with a granularity of one record per minute, per contains a version
day of the week, per day type. It holds 14 versions of each minute—one for each of each minute for
day of the week, plus an additional version for each day of the week when it falls on each day type; e.g.
a holiday. This results in 20,160 rows in total. If CLOCK attributes vary only by weekday, weekend
weekday, weekend, and holiday then you would just need three versions of each
minute, cutting the table down to 4,320 rows.

Figure 7-7
DAY CLOCK
dimension with
weekend and
holiday variations

Resist any temptation to combine CALENDAR and CLOCK dimensions into one.
The resultant dimension would be unnecessarily large and difficult to maintain,
having 525,600 records (365×1440) for each year at the granularity of minute.
Don’t even think about it down to the second.
216 Chapter 7 When and Where

Time of day If work shift start times, or any other CLOCK attributes, change on a specific date
attributes that vary rather than “on Saturdays”, infrequent change can be handled by defining CLOCK
based on actual as an HV dimension with a Type 2 SCD TIME KEY. If date-specific change is
dates can be occurring on a more regular basis it may be seasonal; e.g., summer descriptions
handled by a and winter descriptions. Check that values don’t cycle back before you treat them
seasonal or HV as normal HV changes that would grow the dimension year on year. You may just
CLOCK dimension need a few seasonal versions of a minute as well as day versions.

Day Clock Consequences


HV clock dimensions which contain special versions of a minute, like DAY
CLOCK, keep the dimensional model simple and easy to query, but fact loading
ETL processes must be designed to assign the correct TIME KEY value based on
time of day and:

Clock dimensions Day type, which can be looked up from the CALENDAR dimension.
that contain special
versions of minutes Location type, which can come from an explicit where dimension such as
require more STORE, or the implicit where details embedded within dimensions such as
complex TIME KEY CUSTOMER, EMPLOYEE, or SERVICE
ETL lookups
The current version of the minute, where CLOCK.CURRENT=‘Y’ unless the
ETL processes are loading late-arriving facts and older versions of the time de-
scriptions would be valid.

Time Keys
TIME KEYS are TIME KEY in Figure 7-7 is a normal surrogate key with no implicit time meaning.
normal surrogate Unlike DATE KEY it is not derived from time and is not in time sequence (though
keys that are not the first 1440 are). By keeping time keys “meaningless” you can start with a simple
based on time clock dimension and expand it (by creating new rows) to cope with attribute
sequence. This variations as they arise. For example:
allows them to cope
with change and Time of day attributes that vary by location. For example, certain branch
variation when it types may have longer operating hours than others or different TV channels
arrives may have different advertising slot names and lengths.

Time of day descriptions may simply change. The standard attributes of time
such as hour and minute cannot change (unless everyone gets new decimal
watches) and are defined as fixed value (FV) attributes. But an organization
may decide to change the start time of its peak service. You can define the
Peak/Off Peak attribute as HV to preserve the peak/off peak status of historical
descriptions. The TIME KEY can act like any other HV surrogate key and allow
an ETL process to create new versions of the minutes that are moving from
peak to off peak and vice versa.
Dimensional Design Patterns for Time and Location 217

International Time
Problem/Requirement
To analyze global business events, a data warehouse needs to handle international
time correctly. For customer (or employee) behavioral analysis, local time of day,
weekday status, holiday status, and season are important. While an organization- International events
wide standard time perspective—irrespective of event location—is equally impor- must be analyzed
tant, for measuring simultaneous operational activity and accounting for financial by local and
transactions in the correct fiscal period. standard time

Regardless of how events are originally recorded—using local time or the standard Converting between
time of a central application server set to Greenwich Mean Time—converting time zones is not
between the two requires an understanding of event geography, time zones, and trivial
“daylight saving” that is beyond individual queries. Just how many time zones are
there? It’s not 24!

Solution
If standard organization and local customer time are important, the data ware-
house should provide both as readily available dimension roles to avoid inconsis-
tent and inefficient time zone calculations within reports. For consistency, a shared
ETL process should perform all time zone conversions, and the results should be Overload the facts
used to overload international facts with additional time dimension keys. Figure 7- with additional time
8 shows how local time is modeled in a star schema—by overloading a global sales dimensions to
fact table with extra date and time of day keys (LOCAL DATE KEY and LOCAL provide dual time
TIME KEY) so that the CALENDAR and CLOCK dimensions can play the dual perspectives
roles of Standard Sale Time and Local Sale Time.

Figure 7-8
Sales fact table
overloaded with
local and standard
time dimensions

Consequences
All dimensional overloading patterns require additional ETL processing and make
fact tables larger but the trade off is faster, simpler, more consistent BI queries.
218 Chapter 7 When and Where

Multinational Calendar Pattern


Problem/Requirement
For a single-country data warehouse, adding holiday schedules and season desc-
riptions to the calendar dimension is relatively straightforward. But when a data
Holiday and season warehouse goes global, these attributes become problematic, because holidays and
descriptions are seasons are location-specific or geopolitical time attributes that vary by location, just
geopolitical time as time zones do. If the number of countries to be covered is small—and will
attributes that vary remain that way—then their holiday variations can be handled dimensionally by a
by location as well small repeating group of attributes; for example, if a company operates only in the
as date UK, a single SEASON and the following holiday attributes may be sufficient:
ENGLISH HOLIDAY FLAG
WELSH HOLIDAY FLAG
NORTHERN IRISH HOLIDAY FLAG
SCOTTISH HOLIDAY FLAG

A national calendar However, if the data warehouse is expected to cover more than a few countries, you
table holds will need a more robust solution. NATIONAL CALENDAR in Figure 7-8 attempts
geopolitical time to solve the geopolitical attribute problem by using a composite key of date and
attributes keyed on country to record holiday information for each date and country combination as
a combination of separate rows. Unfortunately, this design demands that BI users and developers
date key and remember to constrain NATIONAL CALENDAR to a single country when query-
country which can ing the facts, otherwise their answers will be overstated by the number of countries
lead to over- they “let into” the query. For example, if NATIONAL CALENDAR holds holiday
counting information for ten countries and a busy sales manager forgets to correctly con-
strain the calendar, an ad-hoc analysis of holiday sales revenue will be overstated
ten times. The figures would be wrong even if the query filters sales to just one
branch for one holiday, because even a single sales transaction on that date will be
joined to, and over-counted by, the multiple countries that observe that holiday.
Commissioned sales staff may be happy by this oversight—few other BI stake-
holders will be so enthusiastic.

NATIONAL CALENDAR, in Figure 7-8, is a multi-valued dimension. It contains


multiple date values for each fact. If not used carefully it has the potential to over-
count the facts. Chapter 9 covers multi-valued dimensions in detail.

Country-specific A safer solution for ad-hoc queries is to provide country-specific calendar views
calendar views are that pre-join CALENDAR to NATIONAL CALENDAR constrained to a single
safer to use but they country. BI users can then choose (or be defaulted to) the most appropriate calen-
limit analysis to one dar view. Unfortunately, this solution limits analysis to one country at a time, and
country at a time. even then, BI users must still take care to constrain the geography of their queries
They are not a good to precisely match their chosen calendar, otherwise the geopolitical time attributes
match for they use will not actually match the facts. Country-specific calendar dimensions are
international facts an international data warehousing anti-pattern: they do not match international
fact tables.
Dimensional Design Patterns for Time and Location 219

Solution
To overcome the “one country at a time” query limitation and prevent calendar
and fact mismatch you need a different calendar design that truly matches multi-
national fact tables. MULTINATIONAL CALENDAR in Figure 7-9 looks re-
markably like a standard calendar dimension, but it handles date descriptions that
vary geographically by storing multiple versions of the dates that have varying
descriptions, each with a unique DATE KEY; for example, Figure 7-9 shows the
three versions of March 17, 2010 needed to support the different combinations of
SEASON and HOLIDAY in the UK, U.S., South Africa, and Ireland on that date.

Figure 7-9
Multinational
calendar dimension
showing 3 versions
of March 17th 2011

But how do these multiple versions of a date behave in fact queries? The answer is A multinational
“just like a single version of the date” when you ignore multinational attributes. calendar uses a
For example, all sales for March 17, 2011 will roll up to a single line on a report if date key that
they are grouped solely on CALENDAR_DATE. Only if sales are grouped by represents a
SEASON or HOLIDAY (attributes that vary internationally) will the report contain geopolitical version
any additional lines, which is exactly what you want. In this way, the multinational of a date to match
calendar is similar to an HV employee dimension that uses surrogate key values to multinational facts
represent historical versions of an employee, except here the surrogate keys repre-
sents geopolitical versions of a date.

The benefit of the multinational calendar is that it keeps both the model and With a multinational
queries simple while handling the complexity of the geopolitical attributes. BI users calendar, simple
are totally unaware of the multiple versions of a date, they do not have to think queries can safely
about which national calendar to use, their queries can cross national boundaries, cross national
and they can use whatever calendar attributes interest them. boundaries

Consequences
BI user interfaces that provide date lists driven from a multinational calendar must
do a Select Distinct. But that should be the default for all value lists anyway!
ETL fact loading processes must know how to assign the correct DATE KEYs
based on the when and where details of business events. You also need to think
carefully about how ETL processes create DATE KEYs for multiple versions of a
day, in the first place.
220 Chapter 7 When and Where

Date Version Keys


Problem/Requirement
Multiple DATE When creating surrogate key values for multiple versions of a day it is important to
KEYs for a date preserve their date order sequence so that, for example, all versions of March 17,
must still sort in date 2011 are sorted together, ahead of all versions of March 18, 2011 (as shown in
order for efficient Figure 7-9). This is vital for efficient fact table partitioning and date range join
partitioning and join processing that uses SQL BETWEEN logic.
processing
Solution
Maintain DATE KEY You can maintain surrogate key date order sequence by appending a version
sort order by number to the end of the standard sequential date key—effectively scaling it by the
appending a fixed number of version digits. Figure 7-9 shows an epoch date key (generated using a
length version reference date of January 1, 1900) with a two-digit version number appended.
number Two-digits allow the calendar to support up to 100 versions (0-99) of each date.
The same technique can also be applied to ISO format date keys, in which case
YYYYMMDD would become YYYYMMDDVV, where VV is the version number.

Building version numbers into your date keys is a good idea even if your data
warehouse or data mart will never go international. You never know when an extra
version of a date will come in handy.

You can create a The number of date versions needed depends on your multinational business
date version for requirements. You can create a date version for every country (200+). This might
every country or just be appropriate if there are many geopolitical attributes and the combination of
one for each possible values is greater than the number of countries. Alternatively, if the only
variation on a date attribute that varies by location is HOLIDAY (Y or N), then you need only two
versions of a day: one for HOLIDAY = ‘Y’ and one for ‘N’. Only one version would
be needed for any date that is globally a holiday or non-holiday. A financial organi-
zation might use a calendar with six versions of each day, one for each of its global
markets.

Needing a date version for each country is unlikely, because many will share
common geopolitical attribute values. Create a single “00” standard version for
each day, and then add versions as needed when you encounter regional or
international variations.

Consequences
Because CALENDAR is the most commonly occurring role-playing dimension, it
is important to keep DATE KEYs small when modeling for multinational versions.
If you really need more than ten versions of a date and you have chosen
YYYYMMDD format date keys, adding a two digit version number will require an
8-byte integer. If you can live with ten versions or less—or use an epoch-based date
key—a 4-byte integer will suffice. Smaller date keys are always a good thing—
especially for larger fact tables!
Dimensional Design Patterns for Time and Location 221

International Travel
To enable BI carbon footprint analysis, Pomegranate stakeholders have modeled Events with pairs of
the national and international flights taken by their global sales and consulting when and where
force. The resulting EMPLOYEE FLIGHTS event table, Figure 7-10 contains 6 details are typically
event stories—6 flights taken by employee Bond during July 2011. These are typical movements with
movement stories containing pairs of when and where details that give rise to interesting distance,
interesting when and where related measures, such as distance, duration and speed, duration and speed
in addition to other explicit facts such as their associated costs; e.g., CO2 emissions. measures

Figure 7-10
Flight events for
employee James
Bond

Movement doesn’t have to be from one geographic location to another; it can be


between virtual locations, such as the URLs of a website, or between members of
a social network. Many of the same questions apply: How long does it take to
navigate from page A to B or pass intelligence from James Bond to Jason Bourne
and how far apart are they (measured in page links or people rather than miles)?

Figure 7-11 shows the flight events modeled as a star schema using the
CALENDAR, CLOCK, and AIRPORT dimensions to play multiple roles of depar-
ture and arrival times and locations. This design can easily be used to answer many
of the stakeholders’ questions:

Which Employees travel the most frequently and furthest?


Which Airlines are used most often?
Which Airlines have the lowest CO2 figures on the routes we use?
!
But it makes one rather important question surprisingly difficult:

Where do our employees need to travel to on business?


!
222 Chapter 7 When and Where

The default from ARRIVAL AIRPORT will tell you where airlines are flying employees to—but
and to details may that’s not quite the same thing. Figure 7-10 shows that Bond took three flights on
not answer the most July 18th, each with the REASON of attending a conference. He did not, of course,
important where attend three conferences in one day, nor did he actually have to go to Amsterdam
questions or Minneapolis. He simply chose that route from London to the one conference he
needed to attend in Phoenix. Apparently that routing had lower CO2 emissions per
passenger than a direct alternative because it used a larger, newer aircraft.

Figure 7-11
Flight star schema

The most interesting Bond’s first multi-flight journey can be worked out manually by browsing all his
where details are flights and spotting the short gaps between the connecting flights and the longer
typically the first and gap that precedes his flight on July 21—which represents a different journey from
last points in a Phoenix to New York for a consulting engagement. But getting a journey-level
journey perspective on all of the flights in a large fact table via BI queries is difficult, be-
cause it involves comparing pairs of flights in the correct order on a per-employee
basis. DW/BI designers don’t want to hear that a query is difficult.

If stakeholders use the prepositions “from” and “to” to connect where details to
the main clause of an event, it is an obvious clue that the event represents move-
ment. Ask stakeholders for related stories such as those in Figure 7-10 to dis-
cover if individual movement events are part of a sequence that describes a
greater journey from an origin to a final destination.
Dimensional Design Patterns for Time and Location 223

Solution
The FLIGHT FACT table, shown in Figure 7-12, has been modified to contain two
extra airport foreign keys, representing the journey origin and journey destination
locations not found in the original EMPLOYEE FLIGHT event details. With these
additional AIRPORT roles, it suddenly becomes trivial to answer questions about
where frequent flyer employees are located (Journey Origin) and where they really
have to go to (Journey Destination). These incredibly useful first and last locations Overload every fact
are hidden amongst all the flight information but can be found by applying a time- with the first and last
based business rule: “all flights taken by the same employee, no more than four hours locations within a
apart, are legs of the same journey”. This test would be difficult for BI tools using meaningful
non-procedural SQL but relatively simple for ETL processes with access to full sequence
procedural logic.

Figure 7-12
Flight fact table
with dimensional
overloading

Often, the location of first and last events represent something even more interest- First and last events
ing than additional where dimensions; they represent why and how; for example, often contain why
the first web log entry for a visitor arriving at a website contains the URL previ- and how details that
ously clicked on—this is usually a search engine or banner ad. In which case it describe the cause
represents why the visit took place and contains referral information, such as the and effect of all the
advertising partner or search string. Similarly, the last URL visited is significant movements within a
because it can describe the outcome of the visit—how it went. For example, if the sequence
last URL is a purchase checkout confirmation page then the visit was a successful
sales transaction and each click leading up to the purchase can be labeled as such.

Because Timing-specific first and last locations are so significant they should be
attached to all the events in a sequence to help describe events more fully. Do this
by overloading the fact table with additional location foreign keys or brand new
why and how dimension keys.

Consequences
Adding useful dimensions from related events is another example of dimensional
overloading that requires extra ETL processing and additional fact table storage. In
this case, ETL must make multiple passes of the input data to read ahead, decide
which events are related and then go back and load the facts with this extra infor-
mation. However, this is well worth doing, so that common BI questions can be
answered without resorting to complex and inefficient SQL.
224 Chapter 7 When and Where

Time Dimensions or Time Facts?


Flight facts could In addition to overloading the flight fact table with where dimensions, you might
be overloaded with consider overloading it with when dimensions too—documenting the actual
actual departure departure and actual arrival times of each flight, if these were available. This would
and arrival time allow Pomegranate to measure the on-time performance of airlines. But should
dimensions, but these additional when details be modeled as dimensions? Would they provide
would they be useful new ways of grouping the data in addition to the existing scheduled time
useful dimensions? dimensions?

If actual and If stakeholders asked for flights to be summarized by ACTUAL ARRIVAL DATE
scheduled dates dimension rather than the SCHEDULED ARRIVAL DATE dimension, it would
vary by very little make little difference to the answers they saw, unless many flights arrive a day (or
there may be no more) late. Even then comparing the two sets of dates dimensionally would pro-
value in defining duce skewed measures of airline performance; for example, a flight scheduled to
the actuals as arrive at 23:59 on March 31st could be only two minutes late but would be reported
dimensions as arriving in a different fiscal quarter. In contrast, a flight scheduled to arrive at
8:55 a.m. could be just over 15 hours late and still roll up to the same day, when
compared using ACTUAL ARRIVAL DATE and SCHEDULED ARRIVAL DATE.
It would appear that the actual arrival and departure dates separated from their
time of day components have no value as dimensions. Creating and indexing
additional foreign keys for them in the fact table would be a waste.

Actual timestamps However the actual timestamps values themselves could be held in the fact table
make good facts because they are valuable for calculating delays that can used to measure airline
that can be used to performance (perhaps filtering to ignore two-minute delays but looking for any-
calculate additional thing over two hours). Better still, the FLIGHT DELAY could be calculated during
delay and duration ETL and stored as an additive fact along with FLIGHT DURATION, as shown in
facts Figure 7-12. Both of these facts should be pre-calculated—rather than force the BI
users to perform the time arithmetic, especially because the timestamps involved
are in different time zones!

Fact tables can be Figure 7-12 shows one more time-related fact called Layover Duration, which is the
usefully overloaded time spent at the arrival airport (or city) before taking the next flight. This is an
with facts calculated example of fact overloading, again performed by ETL reading ahead and picking up
using the next event details from the next related event.

The actual departure and arrival dates do not make good additional time dimen-
sions in this particular example because they do not vary significantly from the
scheduled dates—they are usually the same date or one day later. For many other
business processes where actuals do vary significantly from targets or schedules,
actual dates would make very useful time dimensions indeed.
Dimensional Design Patterns for Time and Location 225

National Language Dimensions


Data warehouses that have to deal with international locations and time zones will International data
also have to provide national language support (NLS). Stakeholders will want to warehouses need
ask business questions in their own language and have the results translated. to be multilingual

National Language Calendars


Multilingual calendar presentation styles can often be handled by the localization Use the localization
features built into database management systems and/or BI tools; for example, you features of your BI
can configure language and default date presentation format (MM/DD/YYYY for tools and DBMS to
USA and DD-MM-YYYY for Europe) at the database schema level or in the BI tool support local date
metadata layer to reformat dates into the appropriate local language for presenta- formats and month
tion. Changing the presentation format at the database or BI tool level preserves name translation
the correct date sort order of the underlying queries.

If BI users and developers require national language support for reporting element
names while constructing ad-hoc queries (for example, Italian users want to select
“Mese Fiscal” and “Motivo per il Volo” rather than “Fiscal Month” and “Flight
Reason”), attribute name translation should be handled by the BI tool semantic
layer rather than database views. This keeps the SQL or OLAP query definitions
portable across boarders.

Swappable National Language Dimensions Pattern


Problem/Requirements
Pomegranate has BI users in the UK, U.S., France, and Italy who want their reports
to use the local language for descriptive labels—such as full product descriptions or
flight reasons. One possible design is to create additional dimension columns for
each of the required languages (for example, FRENCH FLIGHT REASON and
ITALIAN FLIGHT REASON). But this approach overcomplicates the dimensions, Stakeholders want
especially if many attributes need localization, and many languages have to be reports in English,
supported. It also requires reports to be rewritten to use each new language column French and Italian
as other countries come on line.

Solution
Instead, a more scalable design is to create separate hot swappable dimensions
(SD) for each language. Each language version would be identical in structure
(identical table name, identical column names, and identical surrogate key values)
but with its descriptive column contents translated as required. These language-
specific dimensions would then be selected based on the schema the BI user logs
into. For example, Italian user IDs would default to the schema with Italian ver- Use hot swappable
sions of the PRODUCT and FLIGHT REASON dimensions. dimensions
226 Chapter 7 When and Where

Create separate With this approach standard reports can be developed once and run unaltered (as
hot swappable long as they do not filter on translated descriptions) in multiple offices with local-
dimensions for ized results. For example, a CO2 footprint report in the London office that catego-
each reporting rizes travel reasons as “Conference,” “Consulting,” and “Return Home”, would
language display “Congresso,” “Consulto,” and “Casa di Ritorno” when in Rome.

Using separate hot swappable dimensions for national languages means that you
can add new languages at any time without affecting the existing schemas and
reports. This allows you to deliver an agile solution with a single language initially
and then go global, without incurring technical debt.

Consequences
When translating dimensional attributes, care must be taken to preserve their
cardinality; for example, 50 distinct product descriptions in English must remain
50 distinct product descriptions in French and Italian—so that reports contain the
same number of rows with the same level of aggregation when translated.

Preserve sort order National language versions of a dimension sort differently. Cryptic business keys
and cardinality (BK) are often stripped from dimensions if they are never required for display
purposes. However, they can be used (without being displayed) to provide consis-
tent sort order when standard reports are delivered in multiple languages.

Summary
Time is modeled dimensionally by separating date and time of day into CALENDAR and
CLOCK dimensions which should contain all the descriptive time attributes BI users need.

Period Calendars, such as MONTH are built as rollups of the standard CALENDAR. They are
used to explicitly define the time granularity of higher level fact tables.

Fact-specific calendars, built using ETL fact state information, are used to ensure valid YTD
comparisons.

International facts should be overloaded with additional time keys to support standard and local
time analysis.

Location-specific date descriptions and day-specific time descriptions can be handled by using
the time surrogate keys DATE KEY and TIME KEY to represent versions of a date or minute.

Journey analysis can be enhanced by overloading movement facts with additional location keys
and why and how dimensions based on the first and last locations in a meaningful sequence.

Separate hot swappable language-specific dimensions are used to support national language.
8
HOW MANY
Design Patterns for High Performance Fact Tables and Flexible Measures

How many times must a man look up…


— Bob Dylan, Blowin’ in the Wind

Everything that can be counted does not necessarily count;


everything that counts cannot necessarily be counted.
— Albert Einstein

In this chapter we describe how the three fact table patterns—transaction fact This chapter
tables, periodic snapshots, and accumulating snapshots—are implemented to covers techniques
efficiently measure discrete, recurring and evolving business events. We particu- for incrementally
larly focus on the agile design of accumulating snapshots, by describing how the designing and
requirements for these powerful but complex fact tables can be visually modeled as developing high-
evolving events using event timelines, our final BEAM✲ modelstorming tool. We performance fact
also describe the BEAM✲ notation for capturing fact additivity and fully docu- tables and flexible
menting the limitations of semi-additive facts, such as balances. We conclude with measures
techniques for optimizing fact table performance and multi-fact table reporting by
concentrating on design patterns for aggregates and other derived fact tables that
accelerate and simplify BI queries

Point in time event measurement Chapter 8 Design


Periodic measurement Challenges
Evolving process measurement At a Glance
Modeling evolving event milestones and duration measures
Incremental development of complex fact tables
Flexible fact definition
Fact table performance
Correctly querying multiple fact tables at once
Cross-process analysis using simple BI tools

227
228 Chapter 8 How Many

Fact Table Types


There are three fact Facts are stored in three types of fact table: transaction fact tables, periodic snap-
table types. They shots, and accumulating snapshots that correspond to the three event story types:
vary in how they discrete, recurring, and evolving. Table 8-1 shows how each type represents time,
represent time and how it is maintained by ETL.

Table 8-1 FACT TABLE BEAM✲ STORY TIME TIME ETL


TYPE CODE TYPE DIMENSION(S) PROCESSING
Fact table types
Transaction [TF] Discrete Point in time Transaction date Insert
fact table or (and time)
short interval

Periodic [PS] Recurring Regular Period Insert


snapshot predictable (e.g., Month) or (and update
interval period end date if period-to-
(and time) date)

Accumulating [AS] Evolving Irregular Multiple mile- Insert and


snapshot unpredictable stone dates (and update
longer interval times)

Transaction Fact Table


Transaction fact Transaction fact (TF) tables are used to store point-in-time events, such as retail
tables store point sales purchases, or short duration events, such as phone calls, that are completed
in time or short by the time they are loaded into the data warehouse. These discrete events are the
duration facts atomic-level details of business processes—the individual transactions captured by
the operational system. Point in time facts have a single time dimension represent-
ing when the facts occurred. For short duration facts, the time dimension usually
represents start time and can be accompanied by a second end time dimension, or
simply a duration fact, if end time will not be used for grouping or filtering. If date
and time of day are significant, each logical time dimension will be split into
physical CALENDAR and CLOCK dimensions as described in Chapter 7. Figure 8-
1 shows a BEAM✲ transaction fact table SALES FACT [TF] with a granularity of
receipt line item: one record for each different product on a customer’s sales
receipt.

Figure 8-1
Transaction fact
table
Design Patterns for High Performance Fact Tables and Flexible Measures 229

Financial transaction fact tables often have an additional “book date” or applica-
ble financial period dimension to handle late-arriving transactions and adjust-
ments. The generic version of this is an audit date dimension, which can be added
to any fact table to record when facts are inserted.

Transaction fact tables are insert only because all the information about their Transaction fact
transactional facts is known at the time they are loaded into the data warehouse, tables are insert
and does not change—unless errors occur. Even then, if the errors are operational only – which speeds
rather than ETL, they are often handled as additional adjustment transactions that up their ETL
must be inserted. This helps to keep ETL processing as simple and efficient as processing
possible—an important consideration when loading hundreds of millions of rows
per day. Although transaction fact tables can be extremely deep, they are generally
narrow—containing only the small numbers of facts captured by operational
systems on any one transaction.

Consequences
Transaction fact tables are the bedrock of dimensional data warehouses. Because
they do not summarize operational detail, they provide access to all the dimensions
and facts of a business process. In theory, this means they can be used to calculate Transaction fact
any business measure. However, in practice—due to their size and the complexity tables often need to
of many business measures—they can’t be used directly to answer every question. be supplemented
For example, transaction fact tables are impractical for repetitively calculating with snapshots for
running totals over long periods of time. For efficiency, cumulative facts, such as BI usability and
balances, are best modeled as recurring events and implemented as periodic query performance
snapshots.

Periodic Snapshot
Periodic snapshots (PS) are used to store recurring measurements that are taken at Periodic snapshots
regular intervals. Recurring measurements can be atomic-level facts that are only store regularly
available on a periodic basis (such as the minute by minute audience figures for a recurring facts
TV channel), or they can be derived from more granular transactional facts.

Most data warehouses use daily or monthly snapshots to store balances and other Periodic snapshots
measures that would be impractical to calculate at query time from the raw trans- can contain atomic-
actions. For example, compare the cost of calculating product revenue and product level facts but are
stock level for April 1st 2011 using atomic-level sales and inventory transactions. typically used to
Product revenue is calculated by summarizing that one day’s worth of sales trans- hold measures
actions, whereas the product stock level calculation requires every inventory derived from more
transaction prior to April 1st 2011 to be consolidated. To efficiently answer stock granular
questions you need a periodic snapshot, such as STOCK FACT shown in Figure 8- transactions
2. This is a daily snapshot of in-store product inventory that records the net daily
effect of inventory transactions, rather than the transactions themselves.
230 Chapter 8 How Many

Figure 8-2
Periodic snapshot
fact table

Periodic snapshots Although periodic snapshots share many dimensions with their corresponding
have fewer transaction fact tables, they will generally have fewer of them—because some will
dimensions than be lost when transactions are rolled up to a daily or monthly level. Periodic snap-
transaction fact shots will typically have more facts than transaction fact tables. Their design is
tables but more more open-ended—limited not by what is captured on a transaction, but only by
facts the imagination of the BI stakeholders. Adding new facts to a transaction fact table
is rare—the operational systems would have to be updated to capture more infor-
mation. But periodic fact tables are more frequently refactored with additional
facts as BI stakeholders become more creative in defining measures and key per-
formance indicators (KPIs).

Periodic snapshots Like transaction fact tables, periodic snapshots are typically maintained on an
are typically loaded insert-only basis. For example, daily stock levels for each product at each location,
on an insert-only shown in Figure 8-2 will be inserted into the STOCK FACT table at the end of each
basis day. Most monthly snapshots are maintained the same way—with new facts
inserted at the end of each snapshot period (month). However, for some monthly
snapshots, such as a customer account snapshot for a bank, there are benefits in
updating them on a nightly basis:

Some monthly Stagger the ETL Workload: If ETL processing waits until the end of the
periodic snapshots month it has to aggregate a whole month’s worth of transactions for each ac-
can be updated on count. This makes the last night of the month a particularly heavy night: if
a daily basis, to ETL fails, information for the whole of the last month will be unavailable.
improve ETL However, if ETL is run nightly for the snapshot, it has only to insert or update
processing and a day’s worth of transactions for only the accounts that had activity on that
provide period-to- day and if it fails the table is only one day out of date.
date measures
Provide Month-to-Date Facts: Although a monthly snapshot can be useful
for trending historical customer activity, it is on average 15 days out of date. If
it contains an extra month-to-date row for each customer account it can be
used to support additional operational reporting requirements.

Load monthly (and quarterly) snapshots on a nightly basis to improve ETL per-
formance and support period-to-date reporting.
Design Patterns for High Performance Fact Tables and Flexible Measures 231

Consequences
Many end-of-period measures are complex and time consuming to calculate from
raw transactional facts. If the necessary measures are already available from a
reliable operational system it is often better to load a periodic snapshot directly
from an additional source rather than attempt to reproduce the operational busi-
ness logic with ETL processing by loading from a transaction fact table.

For case studies describing the use of periodic snapshots, see:


Data Warehouse Design Solutions, Christopher Adamson, Michael Venerable
(Wiley, 1998) Chapter 6, “Inventory and Capacity”, and Chapter 8, “Budgets and
Spends”
The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley,
2002) Chapter 3, “Inventory”, and Chapter 15, “Insurance”

Accumulating Snapshots
Accumulating snapshots (AS) are used to store evolving events: longer running Accumulating
events that represent business processes with multiple milestone dates and facts snapshots store
that change over time. They are so named because each evolving event accumulates evolving events
additional fact and dimension information over time, typically taking days, weeks,
or months to become complete.

Unlike transaction fact tables, and most periodic snapshots, accumulating snap- Accumulating
shots are designed specifically to be updated. Facts are inserted into an accumulat- snapshots are
ing snapshot shortly after events begin and are updated whenever event statuses updated with
change. This leaves the fact table containing the final status of every completed ongoing event
event and the current status of all open events. activity

Figure 8-3 shows an accumulating snapshot for library book lending. It contains Accumulating
examples of books that have been borrowed and returned (completed events), snapshots have
books that are overdue (evolved events), and books that just been borrowed (new multiple milestone
events). LENDING FACT has multiple time dimensions—like all accumulating time dimensions
snapshots—representing the milestones that a book loan can go through. Only two
of these (LOAN DATE and DUE DATE) are available when a loan is created.

Figure 8-3
Accumulating
snapshot fact table
232 Chapter 8 How Many

Accumulating LENDING FACT also contains a duration (OVERDUE DAYS) and a state count
snapshots can (OVERDUE COUNT). Durations are typical accumulating snapshot facts. If there
usefully contain are a small number of interesting durations, they can be stored as explicit facts. If
duration and state there are many possible durations because there are a number of milestone dates,
count facts that the fact table should physically store the milestones as timestamp facts and BI
match their applications should access it through a view that calculates the durations. State
milestone time counts are another characteristic of an accumulating snapshot fact. They typically
dimensions match the milestones dates and simply record a 1 if a milestone has been reached
or 0 if it has not. They allow queries to quickly sum the number of events at each
milestone in a single pass without decoding dates or applying complex filters.
LENDING FACT could be extended with additional state counts for returned, lost
and on loan books.

Consequences
Accumulating snapshots that support end-to-end business process measurement
are some of the most valuable fact tables, and are very popular with stakeholders,
but they can be extremely difficult to build. Many ETL nightmares are caused by
Accumulating trying to merge multiple operational sources and transaction types in one pass into
snapshots are the perfect accumulating snapshot. The code involved is complex and difficult to
difficult to build, quality assure, often resulting in delays. And when the snapshot is finally delivered,
especially when while it may answer the initial questions perfectly, all too soon stakeholders can hit
they merge events a BI brick wall when they need to drill into missing details. This happens because
from multiple source accumulating snapshots typically summarize a process from the perspective of the
systems initial event and only record the current status of the overall event. For example, an
order processing snapshot that summarizes deliveries for each order line would
help to spot problems with fulfillment performance, but would lack the delivery
details needed to explain why the problems are occurring.

The agile approach to successfully delivering an accumulating snapshot is to build


it incrementally. Using BEAM✲, snapshot requirements are captured by modeling
Develop an evolving event (described shortly) that is implemented over a number of short
accumulating development sprints by remodeling its milestones as simpler discreet events. The
snapshots resulting transactional star schemas are far easier to build and test individually and
incrementally by can provide early BI value ahead of the accumulating snapshot, which is incremen-
modeling evolving tally created by the relatively straightforward merging of facts that already use
events and conformed dimensions. The added bonus of this approach is reduced technical
delivering milestone debt: the atomic-level transaction stars contain all the details stakeholders will need
star schemas for drill down analysis in the future.

The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley,
2002) contains four interesting accumulating snapshot case studies:

Chapter 5, “Order Processing”


Chapter 12, “Education” (college admissions)
Chapter 13, “Healthcare” (billing tracking)
Chapter 15, “Insurance” (claims processing)
Design Patterns for High Performance Fact Tables and Flexible Measures 233

Fact Table Granularity


A fact table’s granularity is its level of detail: the meaning of each fact row in the Granularity
table. Granularity can be stated in business terms and/or dimensionally. For exam- describes a fact
ple, the business definition of granularity for an order fact table is “one record per table’s level of
order line item”, while the dimensional granularity is “orders by date, time, cus- detail: the meaning
tomer, and product”. Transaction fact table and accumulating snapshot granularity of each fact row. It
tends to be defined in business terms while periodic snapshot granularity is defined must be clearly
dimensionally. Whichever approach you choose (often both, for the benefit of documented
stakeholders and the DW team), stating and clearly documenting the granularity,
is an essential step in fact table design. Fact tables that have fuzzy or mixed granu-
larity definitions are impossible to build and use correctly.

Granularity is documented in the model by recording the combination of granular- Granularity can be
ity dimensions (GD) that uniquely identify each fact. For most transaction fact stated in business
tables and accumulating snapshots the list of GD columns will include a degener- terms or
ate transaction ID dimension; for example, a call detail fact table with a business dimensionally by
granularity of “one row per phone call”, can use a degenerate CALL REFERENCE listing GD columns
NUMBER [GD] to uniquely identify each row. This succinct granularity definition
is very useful for ETL processing, but for BI purposes it can be helpful if the granu-
larity can also be defined using dimensions that are more likely to be queried—
such as customer and call timestamp (assuming a customer can only make one call
at a time). These alternative granularity definitions can be documented using
numbered GD codes. For example, CALL REFERENCE NUMBER [GD1] and
CUSTOMER KEY [GD2], CALL DATE KEY [GD2], CALL TIME KEY [GD2].

For accumulating snapshots and period-to-date snapshots that must be updated,


GD columns, especially degenerate IDs, are used to define unique update indexes
for fast ETL processing. For advice on fact table indexes see Indexing later in this
chapter.

Modeling Evolving Events


Evolving events represent business processes that are complex enough or take Related events that
enough time to complete that they are described as sequences of smaller milestone represent a process
events. You can think of them as multi-verb events because each milestone can sequence can be
represent a discrete event (verb). These multiple verbs can be modeled as a single modeled as multi-
evolving event in two ways: verb evolving
events, initially or
Initially as evolving: An evolving event can emerge directly in response to a retrospectively
modelstorming “Who does what?” question when stakeholders think of an
event as only the beginning of a time-consuming process that needs to be
234 Chapter 8 How Many

measured end-to-end. If this happens stakeholders will instinctively tell process


stories with multiple when details, as described in Chapter 2, that represent the
milestones that must be reached to complete the process.

Retrospectively as evolving: You can remodel an evolving event from multiple


discrete events, when you discover, with the help of an event matrix, that they
represent a process sequence (as described in Chapter 4).

Adding milestone Whichever route you come to it, modeling an evolving event involves adding
details is straight- multiple milestone details to an event table, as in the Figure 8-4 example which
forward when there shows shipment, and delivery milestones added to CUSTOMER ORDERS. Adding
is a 1:1 relationship these milestone details is straightforward when there is a 1:1 relationship between
between events all the events, because their granularity is unchanged by merging them; for exam-
ple, if each order is associated with exactly one shipment, followed by exactly one
delivery, all the details would naturally align and no information is lost by aggre-
gating multiple events, or “made up” by allocating portions of events.

When an evolving However, if an evolving event story can have repeating milestones (multiple occur-
event can have rences of a specific milestone) there is a 1:M relationship between events, and
repeating something has to be done to bring everything to the same granularity. For example,
milestones, the if a single order line item for 100 units results in 4 staggered shipments from the
most recent or total warehouse that are then batched up in 2 deliveries by the carrier, the 4 shipment
milestone details events and 2 delivery events need to be reduced to a single record to match the
are stored as part of order event. The simplest way to align the multiple milestones is to record the
the event totals for their additive quantities and the most recent values for all other details.
For example, DELIVERY DATE and CARRIER, in Figure 8-4, hold the last deliv-
ery date and the last carrier (if more than one carrier was used) and DELIVERED
QUANTITY holds the running total number of items delivered, so far, for each
order line item.

Ask how many You discover the cardinality of milestones by asking how many questions about
questions to each milestone verb—these are hidden in the prepositions for the milestone details,
discover repeating particularly in the milestone when details. For the evolving CUSTOMER ORDERS
milestones event you would ask the following questions based on SHIP DATE:

How many shipments can there be for an ordered product ?

If the stakeholders’ answer is “more than one”, you should ask:

When there is more than one ship date for an order, which
one will you use to measure the order process?

The best way to ask this type of question, about multiple milestone values, is to get
stakeholders to fill out the evolving event table with example stories.
Design Patterns for High Performance Fact Tables and Flexible Measures 235

Figure 8-4
Evolving orders
event
236 Chapter 8 How Many

If all the repeated Typically, BI queries will use the most recent values for a repeating milestone but if
milestone values stakeholders say they need all the values then you will have to model the milestone
are needed they as a separate discrete event at its atomic-level of detail. If you have already done so
must be modeled you can point the stakeholders at its event table. Either way, you still want to push
as discrete events for a single value definition for each repeating detail so that you can add it to the
evolving event. To help stakeholders to understand why the most recent value
would be useful, remind them that the role of the new event table is to summarize
the current progress or final state of each evolving event story.

If milestone events If stakeholders continue to struggle to give you a single value definitive answer for
have a M:M a detail, then it probably does not belong in the evolving event. This can happen
relationship it may where there is a M:M relationship between milestones and more complex alloca-
not be appropriate tions are needed. In which case, it may not be appropriate to combine any of the
to combine them in details from the milestone. If the initial event and a milestone turns out to have a
the same evolving M:1 relationship, this is not so problematic but some allocation of additive quanti-
event ties will be needed. For example, if 2 different orders for 100 units are partially
fulfilled by a single shipment of 190 units, a SHIPPED QUANTITY of 100 must be
assigned to the first evolving order event and 90 assigned to the second.

Tell process stories If you determine that a milestone' detail belongs in the event, you should use its
that describe the examples to tell interesting process stories. For milestone dates, ask stakeholders to
typical, min and max give you examples that will represent typical, minimum, and maximum intervals
intervals between between milestones. If a detail has already been modeled as part of a discrete event,
milestones you may be able to reuse values from its event table, but these must make sense in
combination with the examples already present in the evolving event. For example,
if you have used a relative time value like “Today” for ORDER DATE you might
leave SHIP DATE as missing to show that initially there is no shipment yet for an
order loaded into the data warehouse today.

Use missing values As you add new details, you may have to alter some of the existing examples dates
to describe the initial to bring out interesting scenarios, such as the initial and final states of the event.
state of an event. The initial state will have missing values for all the milestone details that have not
You also want to happened yet. This means some details that are mandatory in discrete events must
describe its final become optional in the evolving event. For example, CARRIER is always present
states, completed on a CARRIER DELIVERY event but will be “Unassigned” in the evolving
or otherwise CUSTOMER ORDERS event if an order has not been shipped yet. If there can be
more than one final state, try to capture additional process stories for each possible
outcome; for example, ask stakeholders to give you stories of successfully com-
pleted orders and cancelled orders.

When you have finished modeling an evolving event, it is a good idea to reorder
the details after the main clause in W and process order—keeping all the whens,
whos, whats, wheres, and so on together in the order they appear chronologically.
Doing so can make a complex evolving event much easier to read.
Design Patterns for High Performance Fact Tables and Flexible Measures 237

Evolving Event Measures


Evolving events are implemented as accumulating snapshots. These are fact rich Accumulating
fact tables because they typically combine several events, each of which brings its snapshots are fact
own facts. These combined facts can then be used to calculate additional evolving rich. They contain
measures such as event counts, state counts and durations that are worth storing in additional evolving
the accumulating snapshots to simplify process performance reporting. measures

Event Counts
When an evolving event has a 1:M relationship with its milestone events, define Event counts record
additional event measures—such as (number of) SHIPMENTS or (number of) the number of
DELIVERIES, in Figure 8-5—to record the number of aggregated/repeated events. repeated milestones

Figure 8-5
Event and status
counts

State Counts
Each milestone date or embedded verb within an evolving event represents a state State counts record
that the event can reach. Stakeholders will often have questions about how many if an event has
orders, applicants, claims, etc. have reached a particular state. Answering these completed a
questions can be greatly simplified by adding state counts, such as SHIPPED and milestone. They are
DELIVERED in Figure 8-5. These counts are 1 or 0 depending on whether a state useful because
has been achieved or not. They can be incredibly useful because state logic can repeated milestones
often be more complex that you think; for example, you might imagine that mean you cannot
count(DELIVERY_DATE)would be an efficient way to count order items that use milestone dates
have reached the DELIVERED state but due to partial deliveries, it’s not quite that alone to evaluate
simple. Instead, you have to test that DELIVERED QUANTITY = ORDERED progress
QUANTITY.

Event status business rules can become complex. They should be evaluated once
during ETL processing and the results stored as additive state counts to provide
simple, consistent answers for all BI queries.
238 Chapter 8 How Many

Durations
Milestone The multiple when details of an evolving event can be used dimensionally like any
timestamps can be other details (for grouping and filtering), but they can also be used in pairs to
used in pairs to calculate the elapsed time between milestones. Some of these durations will be key
create duration measures of process performance. It’s not always obvious which ones are signifi-
facts. These should cant or what they should be called by looking at the raw when details. You find out
be named by by using timelines to modelstorm the durations with stakeholders. You should also
modeling them with discover the appropriate unit of time measurement (day, hour, or minute), and the
stakeholders acceptable minimum and maximum intervals between events that can be used as
alert thresholds to drive conditional reporting applications.

Identifying and naming the right duration measures enables stakeholders to


efficiently analyze process bottlenecks.

Additional Process Performance Measures


Quantities from Just as the multiple when details of a newly modeled evolving event can be used to
different milestones create interesting durations, other quantity details from the separate discrete events
can be combined to can be combined to create additional process performance measures. For example,
create process you could use ORDER REVENUE, COST AT SHIPPING, and DELIVERY COST
performance to calculate MARGIN. You should model these additional derived measures with
measures stakeholders to capture their formulas and business names, and add them to the
event table with examples.

Event Timelines
Use event timelines The best way to discover the important milestone when details and duration
to visually model measures of an evolving event is to use an event timeline—like Figure 8-6. You
milestones and should draw a timeline showing each of the milestone dates of an evolving event in
durations chronological order, so that you can examine each milestone pairing visually, and
ask stakeholders for business names for the intervals between them. The most
important intervals are likely to have pre-existing names—a sure sign that they
have business value and should be modeled as facts—but new and significant
intervals can quickly be discovered and named in this way too.

You should try to get a name for each significant duration but when you have
several milestone dates, you can end up with a lot of potential durations—too
many to name in some cases. The number of durations is equal to: (Number of
timestamps x (Number of timestamps – 1) / 2 ). So if an evolving event has six
milestones, you have 6 x 5 / 2 = 15 possible durations.

Start by modeling Typically, the most interesting durations will be those measured from the initial
the fixed points on event date (Order Date) or from a target date (Delivery Due Date). Start by adding
the timeline: the these fixed points on the timeline. These are the fixed value (FV) dimensions of the
initial when detail evolving event. With these in place use the white space on the timeline to prompt
and any target dates stakeholders for the other milestones events and their chronology.
Design Patterns for High Performance Fact Tables and Flexible Measures 239

Once you have all the events on the timeline (you may have copied the event Model durations by
sequence from your event matrix), you can then start discovering durations by naming the gaps
pointing at the milestone pairs and asking stakeholders to name the gaps between between milestone
them. Any meaningful duration you discover should be added to the timeline, as in events
Figure 8-6, which shows three important durations for an evolving order.

Figure 8-6
CUSTOMER
ORDERS
event timeline
showing repeating
milestones

After naming the durations on the timeline, add them to the evolving event table Add durations to the
with example values. As mentioned in Chapter 2, you may question the wisdom of evolving event table
adding so many derived facts (DF) to the event table, but the event table is still a BI as derived facts, to
requirements model, not yet a physical accumulating snapshot design. Its purpose document their UoM
is to document the measures stakeholders will need, not dictate a physical struc- and range of values
ture. By adding a duration to the event table you are documenting its name, unit of
measurement (UoM), and value range. You are not making a decision about how,
if at all, it will be physically stored. Duration definitions can be implemented as
database views or report items in BI tool metadata layers.

Figure 8-7 shows three durations—PACKING TIME, DELIVERY TIME, and Number milestone
DELIVERY DELAY—added to the event as derived facts. Their definitions can be dates DT1-DTn to
recorded using simple spreadsheet-like formula by numbering the event milestones reference them in
DT1 to DT4. For example, PACKING TIME is defined as DT2 – DT1: the interval duration definitions
between ORDER DATE [DT1] and SHIP DATE [DT2]. The DTn numbering can
also be used to record the chronological order of the milestones.

Figure 8-7
Duration measures
240 Chapter 8 How Many

DF : Derived Fact, calculated from other columns in the same table.


DTn : Date/Time, numbered in chronological order for use in duration formulas.

All the durations within an event should be defined using the same unit of meas-
ure; for example, all [days] or all [hours]. This avoids errors when durations are
compared or used in calculations.

Using Timelines for Documentation


Timelines should be Although the DF formulas in the column types of Figure 8-7 are useful for ETL
used to permanently and BI developers, the best way to document the meaning of each duration for the
document duration wider audience is to use an event timeline. The timelines you create in a model-
definitions within the storming workshop should also become a permanent part of the model documen-
model tation, along with the other BEAM✲ artifacts: event tables, dimension tables,
hierarchy charts, and the event matrix.

Timelines are part of Timelines are to evolving events as hierarchy charts are to dimensions. Dimension
the definition of an tables need hierarchy charts to document the levels of their conformed hierarchies.
accumulating Accumulating snapshots need timelines to document their event sequences and
snapshot. They also durations. Just as you can use relative spacing on a hierarchy chart to show relative
make great training aggregation, spacing on a timeline can show the relative durations of stages within
material for BI users a process, and highlight the most time-consuming events that must be carefully
monitored. Timelines are an essential part of the training material for stakeholders
who need to work with complex evolving events.

Timelines can be added to the footers or summaries of reports, as simple static


graphics, to explain the meaning of the duration figures they contain.

Using Timelines for Business Intelligence


Use dynamic Timelines are not only useful for modeling and documenting event sequences, they
timelines for make great tools for visualizing process flow in BI applications too. Dynamic or
data visualization animated versions of the timelines you model can be used on reports and dash-
on reports and boards to display live status counts and durations. In Figure 8-8, an example
dashboards dashboard for monitoring CUSTOMER ORDERS shows the average durations
between milestone events and the number of order items at each stage.

BI developers can create dynamic timelines within reports by displaying duration


measures as horizontal stacked bar charts.

The Back of the Napkin, Dan Roam (Portfolio, 2008) Chapter 12, “When can we fix
things” contains some great ideas on drawing timelines to solve when problems.
Other chapters describe how to draw pictures to solve other 7Ws (who, what,
where, how many, why and how) related business problems.
Design Patterns for High Performance Fact Tables and Flexible Measures 241

Figure 8-8
Timeline dashboard

Developing Accumulating Snapshots


When you have added all the useful evolving measures to an evolving event, it is When milestone
time to profile its data sources and, all being well, design an accumulating snapshot events have a 1:1
fact table such as ORDER FACT [AS], shown in Figure 8-9. This is an accumulat- relationship and are
ing snapshot version of the original transactional ORDER FACT [TF] from Chap- handled by the
ter 5. How you go about developing this or any accumulating snapshot depends on same operational
the data profiling results. If profiling confirms that all the milestones have a 1:1 system,
relationship and can be obtained from the same source system then you can build accumulating
the accumulating snapshot directly. This would be the approach for the LENDING snapshots can be
FACT [AS] snapshot in Figure 8-3. Because there is a 1:1 relationship between developed directly
book loans and book returns, no detail is lost, and most importantly for ETL,
because these different transactions are handled by a single operational system (for
each library) there will be no conformance issues in merging them.

ORDER FACT [AS] has a more complex 1:M or M:M relationship between its If milestone events
milestones which are handled by multiple sales and logistics systems managed by have more complex
Pomegranate, its distributors and carriers. Attempting to populate this fact table relationships or are
directly is a high risk, “big development up front” strategy. (Another form of BDUF handled by different
that you should avoid). It is unlikely that an accumulating snapshot with such operation systems,
complex sourcing issues could be successfully delivered in one or two normal accumulating
length development sprints. So, while it’s being developed what would be demon- snapshots should
strated, or validated by stakeholder prototyping? Nothing? That’s not particularly be developed
agile. Instead each of the accumulating snapshot’s milestones can initially be incrementally
developed and delivered, on a more agile basis, as separate transaction fact tables.

If an evolving event is modeled retrospectively, you will already have all (or most) Transaction fact
of the discrete event definitions for its milestones (you may have discovered tables are used to
additional milestones while modeling the evolving event). These are the blueprints stage accumulating
for transaction fact tables that can stage the milestone events prior to merging snapshot data,
them in the ultimate accumulating snapshot. If you don’t yet have these, you can validate the design
model them using the techniques described in Chapters 2-4 by pulling out their and deliver early BI
verbs from the milestone prepositions and asking: “who does each of these?” value
242 Chapter 8 How Many

Figure 8-9
ORDER FACT
Accumulating
Snapshot

Staged For real-time DW/BI, the latency introduced by staging each milestone in its own
accumulating fact table first may prevent an accumulating snapshot being updated urgently
snapshot ETL enough for current day reporting requirements. If streamlining the ETL process
processing will need becomes paramount and the milestone fact tables are not needed for queries, they
to be streamlined if can become un-indexed staging tables that are truncated at the end of every load
real-time DW/BI cycle or be replaced by ETL processes that act as virtual tables, piping their inserts
requirements exist or updates directly to the inputs of the accumulating snapshot process. If a real-
time snapshot and queryable detailed fact tables are required, the staging tables can
be implemented as un-indexed real-time partitions (covered shortly) that are fully
indexed and merged with their fact tables by conventional overnight ETL.

Fact Types
Additivity describes If the most important property of a fact table is its granularity, the most important
how easy or difficult property of a fact is its additivity—which tells you whether or not its values can be
it is to sum up a fact summed to produce meaningful answers. This is important because stakeholders
and get meaningful almost never want to see individual fact values. Instead they want to summarize
results them, and the easiest way to do that is to sum them. Facts are divided in three types
based on their additivity: Fully additive, non-additive and semi-additive.
Design Patterns for High Performance Fact Tables and Flexible Measures 243

Fully Additive Facts


(Fully) additive (FA) facts produce meaningful results when summed using any (Fully) additive facts
combination of the available dimensions. For example, REVENUE in Figure 8-1 can be summed
can be summed across customers, products, time, and locations—and will always using any
produce a correct total revenue. Additive facts are the easiest to use because there combination
are no special rules about which dimensions they work with, so default measures of their available
can be quickly defined in BI tools using SQL sum()function. For this reason it is dimensions
always best to record fact information in its most additive form.

The first rule for defining an additive fact is to use a single unit of measure. For Additive facts
example, while modeling an event you may identify a quantity that is recorded in must use a single
multiple currencies that are documented as [£, $, ¥]. The corresponding fact needs standard unit of
to be converted into a standard currency, otherwise the fact will not be additive measure
across currency.

Store facts in a single unit of measure to make them additive and avoid aggrega-
tion errors. If BI applications need to view facts in different units of measure—e.g.,
report sales in local and standard currency, or product movements in shipping
crates rather than retail units—provide conversion factors. These should be
stored centrally in the data warehouse (as facts) rather than in the BI applica-
tions—because they can change.

Non-Additive Facts
Non-additive (NA) facts cannot be summed, even if they are in the same unit of Non-additive
measure. For example, UNIT PRICE cannot be summed to produce a meaningful facts can never be
total—even if all unit prices are recorded in dollars. Instead UNIT PRICE can be summed to produce
averaged, or used to create an additive SALE VALUE (UNIT PRICE × SALE meaningful answers
QUANTITY) fact. BI users will likely want to use this additive measure more often
than UNIT PRICE, so it should be stored in the fact table, and if storage is an
overriding concern, the non-additive fact should derived, at query time, by just the
reports that need it.

Percentages are non-additive; two product purchases with a discount of 50% do Percentages are
not equate to a 100% discount. Because of this, percentages make terrible facts, but non-additive. Only
they do make great measures and KPIs that BI users will want to see on reports and their additive
dashboards. Facts like DISCOUNT should be stored as an additive monetary components should
amount (as in Figure 8-1), allowing BI tools to calculate the correct percentages be stored as facts
within the context of a report.

Timestamps are non-additive facts, but pairs of timestamps can be subtracted to


produce duration facts that can be treated as additive or semi-additive.
244 Chapter 8 How Many

Non-additive facts Percentages and unit prices can easily be converted into additive facts, but other
can be aggregated quantities cannot. These facts have to be clearly documented as non-additive along
using other functions with their compatible alternative methods of aggregation for creating useful meas-
such as min, max or ures. For example, TEMPERATURE NA is a non-additive fact that can be aggre-
average gated using functions such as min, max, and average.

Semi-Additive Facts
Semi-additive facts Additive facts are easy to work with—they can be summed with impunity. Non-
are harder to work additive facts require a little more creativity to aggregate, but after an appropriate
with than additive or measure formula has been found they too are relatively straightforward to deal
non-additive facts with: you simply never sum them up. Semi-additive facts are more problematic.

Semi-additive facts A semi-additive (SA) fact can be summed up some of the time but you can’t sum it
can be summed but up all of the time. To be more precise: a semi-additive fact cannot be summed
not across their non- across at least one dimension: its non-additive dimension. For example, yesterday’s
additive dimension(s) STOCK LEVEL cannot be added to today’s STOCK LEVEL. It is non-additive
across the time dimension. But STOCK LEVEL is additive across other dimen-
sions. It can be summed for all stores and/or all products (apples and pomegran-
ates?) to give a correct total stock level, as long as the query is constrained to a
single day—a single value of the non-additive dimension.

Semi-additive facts are fully documented by marking them as SA and their non-
additive dimension(s) as NA. If there is a single semi-additive fact in a fact table or
if all semi-additive facts have the same non-additive dimension(s) this is sufficient.
To fully document a However, if there are multiple semi-additive facts with differing non-additive
semi-additive fact dimensions, the SA and NA codes are linked by numbering, to pair each SA fact to
the SA fact code is its NA dimension(s). For example, Figures 8-2 and 8-10 show the BEAM✲ table
used in conjunction and matching enhanced star schema for STOCK FACT, a daily periodic snapshot
with at least one NA of in-store inventory. Both show STOCK LEVEL SA1 is non-additive across
dimension code STOCK DATE KEY NA1, whereas ORDER COUNT SA2 is non-additive across
PRODUCT KEY NA2. This semi-additive fact documentation can be used to
correctly define measures in BI tools and some multidimensional databases. SQL
doesn’t natively understand that some numbers are semi-additive, this can cause
averaging and counting issues for the unwary BI developer.

Averaging Issues
Semi-additive facts Although semi-additive facts cannot be summed over their non-additive dimen-
can be averaged but sions, they can often be averaged (carefully) over them. Unfortunately the SQL
not by using AVG( ) AVG() function may not be up to the job; for example, if stakeholders ask:

What was the average stock of Advanced Laptops


in the SW region last week?
Design Patterns for High Performance Fact Tables and Flexible Measures 245

Figure 8-10
Periodic snapshot
containing
semi-additive facts

and the stock data is as follows: the product category “Advanced Laptop” contains
two products: POMBook Air and POMBook Pro; The SW region contains 10
stores; Each day last week, every SW store stocked 20 POMBook Airs and 60
POMBook Pros (let’s keep it simple); Last week had 7 days (like every other week).

AVG(Stock_Level)will return 40, which is the wrong answer to the stake– Periodic semi-
holders question. 40 is the average of 60 and 20 which is what you get when half additive facts, such
the data has a value of 60 and the other half has a value of 20. The AVG() func- as balances, must
tion—the equivalent of SUM(Stock_Level)/COUNT(*)—sums up 70 store/day use a time average
records with 20 laptops and 70 with 60 (5,600) and divides by the number of which divides the
records (140). To get the correct average for a category in a region, you must not total by the number
divide by the number products (in the category) or the number of stores (in the of non-additive time
region). Instead you must only divide by the number of non-additive dimensional periods in the query
values (7 days). The correct SQL for this is: SUM(Stock_Level)/
COUNT(DISTINCT Stock_date). The correct answer is: 800.

Averaging a semi-additive balance correctly, requires you to understand the time


granularity of its fact table. For a daily snapshot, an average is calculated by
dividing by the number of distinct days: the number of non-additive time periods.

Counting Issues
ORDER COUNT in Figures 8-2 and 8-10 is yet another example of a semi-additive Unique counts are
fact that you must handle carefully. As long as queries are constrained to a single semi-additive or
product, ORDER COUNT can be summed across days and locations to give a total non-additive facts
number of unique orders. But if a query needs the total number of orders for the
“Advanced Laptop” category it’s in trouble because it will over-count any orders
that contain both POMBook Airs and POMBooks. Unfortunately, there is no way
to get the correct answer from STOCK FACT.

Storing counts, such as ORDER COUNT or CUSTOMER COUNT, in a periodic


snapshot can seem like a great idea for query efficiency (to save re-counting mil-
lions of records), but once they have been calculated to match the granularity of a
246 Chapter 8 How Many

Atomic-level fact snapshot they may not be as additive as you hope, often turning out to be semi-
tables are required additive or non-additive when you try to sum them further. If so, the only way to
to provide fully calculate a correct unique count is to go back to the transactions and count them
additive unique distinctly within the context of the query. The status counts SHIPPED and
counts DELIVERED, in Figures 8-5 and 8-9, do not suffer from this problem because they
count order item states uniquely at their atomic-level of detail, whereas the event
counts SHIPMENTS and DELIVERIES do, because they count shipments and
deliveries aggregated to the order item level. If stakeholders want the total number
of deliveries this month vs. last month they cannot get the answer from ORDER
FACT [AS] using sum(Deliveries). Instead they need to use DELIVERY
FACT [TF] to count(distinct Delivery_Numbers).

Think of degenerate dimensions as non-additive facts. They cannot be summed


but can be counted distinctly to produce useful additive measures. For example,
count(distinct Receipt_Number)provides an additive count of unique
sales transactions/shopping baskets.

Heterogeneous Facts Pattern


Heterogeneous In Chapter 6, we discussed product dimension design for handling heterogeneous
products, that are products, which involved moving large sets of exclusive (Xn) attributes into their
described differently, own more efficient swappable subset dimensions. Thankfully, heterogeneously
are often measured described products are often measured homogenously; for example, a large retailer
in the same way might sell everything from milk to DVD players, but it doesn’t matter if items are
best described by fat content (“2% semi-skimmed”) or technical features (“Blu-ray
recording”), they are all measured the same way: by quantity sold, revenue, cost,
and margin, using the same sales fact table.

Problem/Requirement
However, in certain businesses—such as banking—heterogeneous products will
have heterogeneous facts: very different ways of being measured. This can make fact
table designs that attempt to provide an integrated view of the business, very
inefficient. For example, Figure 8-11 shows a small portion of a monthly account
Heterogeneous snapshot that will allow all major product types (checking, saving, mortgages,
products that have loans, and credit cards accounts) to be analyzed. Unfortunately this “one size fits
heterogeneous facts all” fact table will be very wide and sparsely populated. The dimensional keys and a
can give rise to small set of core facts that measure all account types (ACCOUNT BALANCE and
inefficient “one size TRANSACTION COUNT) would always be present, but the majority of the facts
fits all” fact table are marked as exclusive (Xn) with their validity based upon the defining character-
designs istic PRODUCT CATEGORY [DC]. These will be null most of the time, making
the table “fact rich but data poor”. Depending on the database technology used, the
null facts may take up far less storage space than valid facts, but if there are hun-
dreds of facts in total across all lines of business this design will still be extremely
difficult to manage and likely to perform poorly.
Design Patterns for High Performance Fact Tables and Flexible Measures 247

Figure 8-11
Exclusive facts

Solution
Limit MONTHLY ACCOUNT SNAPSHOT to the common facts (and possibly a
few frequently used specialist facts like INTEREST CHARGED and CHARGES)
and create a small set of custom fact tables, one for each major product family Create a core fact
based on the exclusive fact sets as in Figure 8-11. The core fact table will contain a table for cross-
row for every account each month, and the custom fact tables—such as product analysis
MONTHLY CHECKING FACT and MONTHLY MORTGAGE FACT shown in and custom fact
Figure 8-12—would contain rows for their account types only. The custom fact tables for each
tables will contain the common facts, too—so that BI users don’t have to query exclusive fact set
multiple fact tables.

Figure 8-12
Core and custom
fact tables for
exclusive facts
248 Chapter 8 How Many

When heterogeneous products have many heterogeneous facts, even if they share
a common granularity, a monolithic fact table design may not be ideal. If the facts
come from different operational systems with different access methods and
maintenance cycles, separate core and custom fact tables will be easier to build
and maintain, and can be a better fit with BI user groups.

Factless Fact Pattern


While some fact tables can have too many facts, others can contain none. Factless
fact tables are used to track events where there is nothing to measure except the
event occurrence itself. For example, SEMINAR ATTENDANCE FACT in Figure
Factless fact tables 8-13 contains one row for every prospect (and existing customer) who attends a
record events where sales seminar to hear about Pomegranate products. Prospects do not pay for the
there is nothing to privilege, nor are the seminar costs notionally allocated to them, so there are no
measure except the monetary facts. The only thing to measure is the number of people who attend,
event occurring which can be obtained by simply counting the rows in the fact table.

Figure 8-13
Factless fact table

Factless fact tables Factless fact tables are also used as coverage tables to track dimensional relation-
can be used as ships in the absence of other events; for example, a promotion coverage table that
coverage tables to records products on promotion—regardless of sales, or a monthly healthcare
record dimensional eligibility snapshot that records the fact that a person is covered by a medical plan
relationships in the that month. Coverage fact tables are often used in combination with transaction
absence of other fact tables to answer questions about what didn’t happen (but should have); e.g.,
events “Which products were promoted but didn’t sell?” or “How many people were
covered but didn’t claim?”
Design Patterns for High Performance Fact Tables and Flexible Measures 249

To answer the “Who didn’t attend but should have?” question about seminars, If the number of
there is a case for making SEMINAR ATTENDANCE FACT a normal fact table by non-events is not
adding an ATTENDANCE fact. This would be 1 if an invited prospect attends and too high, a 0/1 fact
0 for a “no show”. Normally fact tables don’t record events that didn’t happen, can be added to
because there are just too many of them. Airlines don’t record all the flights you count “what didn’t
didn’t take today—even if you are their best frequent flyer. But in the case of sales happen”
seminars Pomegranate didn’t invite the whole world, so the number of extra
records for invitees that did not attend would be manageable.

A dummy fact (always equal to 1) can be added to factless fact tables to provide an A dummy additive
additive fact that can be summed. This makes it easier to build aggregates of large fact (equal to 1) can
factless fact tables that can be used “invisibly” by aggregate navigation (see Aggrega- be added to support
tion later in this chapter). The aggregate will have the same fact but it will hold aggregate
values other than 1. Also, some BI tools only recognize a table as a fact table if it navigation
has at least one fact.

In Figure 8-13, the PRODUCT dimension is defined as a multi-level (ML) dimension


and the PRODUCT KEY in SEMINAR ATTENDANCE FACT is marked as ML too,
documenting that it makes use of the multi-level feature of the dimension. This
design allows the star schema to record attendance for seminars that are single
product launches and seminars that promote entire product categories.

Fact Table Optimization


Because fact tables are so large—accounting for the vast majority of the storage and
I/O activity of a dimensional data warehouse—it is essential to design them for
high performance. The techniques for optimizing fact table performance are
downsizing, indexing, partitioning, and aggregation.

Downsizing
The first way to improve performance is to design fact tables that are as compact as Improve fact table
possible without compromising their usability. The following checklist gathers performance by
together techniques for reducing fact table row width: reducing row width

Use integer surrogate keys as dimensional foreign keys. Keep business keys in
dimensions.

Use date keys instead of datetime data types—especially if time is unused.

Reduce the number of dimension keys—combine small why and how dimen-
sions (see Chapter 9).
250 Chapter 8 How Many

Move free text comments and lengthy sets of degenerate flags into their own
physical dimensions and replace them with short foreign keys (See Chapter 9).

Don’t store a large number of facts that can easily be calculated intra-record;
e.g., don’t store all the durations that can be calculated from a smaller number
of milestone timestamps.

Limit fact history to The next thing to consider is the length of each fact table. You should try to limit
only the data that is history to what the BI users really need. Don’t use fact tables as an expensive
useful for BI archival strategy. If the auditors need more history than the BI users, they should
get that from the operational system of record, not the data warehouse. Regulatory
requirements are not analytical requirements, so don’t automatically load 20 years
of transactional history just because it exists. If the business has changed substan-
tially in that time how far can queries go back and make valid comparisons? Also,
the further back you go the harder it becomes to load the data because data quality
challenges tend to increase.

When modeling business events with stakeholders, ask for event stories describ-
ing the earliest when details that BI users will need to work with.

The most interesting data is the most recent. If you have years of history to load,
start with the current year and work backwards—partitioning can help to do this
efficiently. Don’t bother loading the oldest data until stakeholders ask for it.

Indexing
Create query After you have done all that you can to control the size of a fact table the next issue
indexes on foreign to consider is how to index it for query performance. Here you should seek your
keys to support “star DBMS vendor’s advice on defining some form of “star join index.” This generally
join optimization” involves creating a bitmap index on each dimensional foreign key—but techniques
vary by DBMS and by version, with new data warehousing index strategies being
added all the time (we hope).

More query indexes Whatever indexing techniques you use, there is inevitably a trade-off between
can improve BI query performance and ETL processing time. Your priorities should be heavily
performance but biased towards query performance—but BI users can only query what you can
slowdown ETL manage to load in the available time—so index thoughtfully!

Accumulating and In addition to query indexes, accumulating snapshots and period-to-date (PTD)
period-to-date periodic snapshots need an ETL index to support efficient updates. This will be an
snapshots also OLTP-style unique index using GD columns such as the ORDER ID degenerate
need an ETL update dimension in CUSTOMER ORDERS. Transaction fact tables and most periodic
index snapshots are insert-only so they do not require a unique index, as long as ETL
processes can guarantee fact uniqueness.
Design Patterns for High Performance Fact Tables and Flexible Measures 251

If a fact is frequently used for ranking or range banding you should consider
indexing it to speed up sorting, and joining to a Range Band dimension (described
in Chapter 9).

Partitioning
Partitioning allows large tables to be stored as a number of smaller physical Large fact tables
datasets based on value ranges. If your DBMS supports table partitioning you can be partitioned
should consider partition large fact tables on the surrogate key of their primary on date key ranges
date dimension. Partitioning on date can be made simpler by carefully designing
your calendar dimension surrogate keys (see Chapter 7, Date Keys for details).
Partitioning has a number of benefits for ETL, query performance and administra-
tion:

ETL performance: Partitions with local indexes that can be dropped and Loading into empty
rebuilt independently allow ETL process to use bulk/fast mode loads into an partitions speeds up
empty partition while they are un-indexed. If only the most recent partitions of ETL processing.
accumulating and PTD snapshots are being updated, unique update indexes Partition swapping
(that are used for ETL, not queries) can be dropped on historic partitions. Par- enables 24/7 BI
tition swapping allows ETL to update the data warehouse while queries con- access
tinue to run.

Fact table pruning: Many fact tables need a fixed amount of history (24 Partitions can be
months, 36 months). Monthly partitions allow older data to be efficiently re- truncated to rapidly
moved by truncating a partition rather than row-by-row deletion of millions of delete unneeded
records. history

Real-time support: Fact tables that need to be refreshed frequently throughout Un-indexed “hot
the day can be implemented using real-time “hot partitions”. These are special partitions” can
un-indexed in-memory partitions that are trickle-fed from the operational support real-time
source. During the day queries use these like any normal partition, and at night ETL inserts
their data is merged with the fully indexed historical partitions.

Query performance: DBMS optimizers will ignore partitions that are outside Query optimizers
of a query’s date range, and some can read multiple partitions in parallel. But can use partition
splitting a table into too many small partitions can also hurt performance, es- pruning and parallel
pecially for broad queries that must “stitch” many partitions together. This can access to speed up
be avoided by creating aggregates to answer the broad queries. certain queries

For more information on real-time partitions and ETL processing see:


The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley,
2002), pages 135–139.
The Data Warehouse ETL Toolkit, Ralph Kimball and Joe Caserta (Wiley, 2004),
Chapter 11, “Real-Time ETL Systems.”
252 Chapter 8 How Many

Some DBMSs allow you to partition on more than one dimension. This can be
useful when a particular dimension is frequently used to constrain queries or
represents the way source data extracts are organized for ETL processing; for
example, by organization, geography, or data provider.

Aggregation
Aggregates act as An aggregate (AG) (fact table) is a stored summary of a base fact table. It acts like a
group by indexes for group by index on the base facts—speeding up queries that do not need to return
existing fact tables detailed figures. They are an essential complement to traditional where clause
indexes. A star-join index optimizes highly constrained queries that need to sum-
marize smaller quantities of data, whereas aggregates optimize broad, loosely
constrained queries that need to summarize large quantities of data. Aggregates are
derived fact tables that are very similar to periodic snapshots, dimensionally and in
terms of granularity. They differ from periodic snapshots in that they do not
provide any new facts. Instead, they simply contain summarized versions of the
additive facts from base fact tables.

DBMS aggregate Historically, data warehouse queries were written to use specific aggregates in the
navigation form of summary data marts. Today, many DBMSs provide aggregate navigation
automates that automatically redirects queries to the best (smallest) aggregate. When this
aggregate usage happens, the aggregates are invisible to the BI users and query tools.

Small high Aggregates must be designed so that they match the GROUP BY and WHERE
performance clauses of the most popular queries, or they will be not be used. They also must be
aggregates can be designed so that they are many times smaller than existing fact tables—to provide
designed using the performance improvements that justify the cost of maintaining them. Twenty
lost, shrunken and times smaller is a useful guideline—which can lead to a corresponding query
collapsed patterns performance boost. The three types of aggregate design in a dimensional data
warehouse are lost, shrunken, and collapsed.

Lost Dimension Aggregate Pattern


Lost dimension aggregates are created by summarizing a fact table using a subset of
its dimensions. Figure 8-14 shows a lost dimension aggregate formed by dropping
the customer and store dimensions.

Figure 8-14
Lost dimension
aggregate
Design Patterns for High Performance Fact Tables and Flexible Measures 253

Lost dimension aggregates are the easiest aggregate type to build, because no
dimensional joins are needed; for example, a lost aggregate can be built by:

CREATE MATERIALIZED VIEW Daily_Product_Sales AS


SELECT Date_Key, Product_Key, SUM(Revenue)
FROM Sales_Fact
GROUP BY Date_Key, Product_Key

At least one of the lost dimensions must be a granularity dimension (GD)—part of


the fact table granularity—for the aggregate table to be smaller than its base fact
table. Choose which dimensions to drop wisely, so that the aggregate is suffi-
ciently smaller but can still be used to answer a broad range of queries.

Shrunken Dimension Aggregate Pattern


Shrunken dimension aggregates are created by summarizing a fact table using one Shrunken
or more shrunken or rollup (RU) dimensions instead of the base dimensions. aggregates use
Figure 8-15 shows a shrunken dimension aggregate that was formed by rolling up rollup dimensions
Dates into Months, and Stores into Regions.

Figure 8-15
Shrunken
dimension
aggregate

Notice that the Customer dimension has been dropped. This is not uncommon for Aggregates can
shrunken dimension aggregates—because dropping the most granular dimension shrink and lose
is often needed to significantly reduce the aggregate’s size. Sales by Month, Region, dimensions
Product, and Customer would contain nearly as many rows as the base fact table—
thereby negating its performance benefits.

Shrunken dimension aggregates are more complex to build, requiring additional


rollup dimensions and dimensional joins, and more difficult to maintain using
incremental refresh. But they can be designed to satisfy a broad range of queries.

Materialized views can be used to build aggregates and their matching rollup Materialized views
dimensions. When building rollup dimensions, you can reuse their base dimension can be used to build
key to create rollup keys if you carefully select the first or last key value that shrunken
matches the rollup dimension granularity. For example, you can use the DATE aggregates and
KEY of the last day of each month as the MONTH Dimension’s MONTH KEY, or their matching rollup
use the STORE KEY of the first store in a region as the REGION Dimension’s dimensions
REGION KEY. The actual surrogate key value selected does not matter, as long as
it is used consistently in the rollup dimension.
254 Chapter 8 How Many

Collapsed Dimension Aggregate Pattern


Collapsed Collapsed dimension aggregates are created by summarizing a fact table using
aggregates are selected dimensional attributes, and storing the facts and the dimensional attrib-
pre-joined utes in a single, denormalized summary table. Figure 8-16 shows a collapsed
aggregates dimension aggregate for Sales by Quarter and Product Type.

Figure 8-16
Collapsed
dimension
aggregate

Collapsed dimension aggregates can offer additional query acceleration because


the dimensions and facts are pre-joined. However if many attributes are included
the increased record length will make the table too large. An aggregate might have
20 times fewer records than its base fact table but if it is three times wider it will
not deliver sufficient performance improvements to justify itself.

Aggregation Guidelines
The following guidelines will help you get a good set of aggregates in place:

Budget up to a 100% overhead for aggregate storage and ETL processing.

Create aggregates that are approximately 20 times smaller than their base fact
tables. Spread aggregates, by designing aggregates of aggregates (400 times
smaller than the base fact tables).

Use (fast refreshable) materialized views whenever possible to build aggregates,


and enable DBMS aggregate navigation and query rewrite features.

Design invisible aggregates that the DBMS will automatically redirect queries
to. Don’t allow BI users, reports, or dashboards to become directly dependent
on an aggregate. Hide them from query and reporting tools.

Trust base star schemas to handle highly constrained queries using star join
indexes—and focus aggregates on addressing broad summary queries.

Monitor aggregate utilization, drop those that are seldom used, and add new
aggregates as query patterns change.

Make sure that you initially build aggregates that will speed up comparisons
against budgets, targets, and forecasts. These are the most obvious quick-win
aggregates.
Design Patterns for High Performance Fact Tables and Flexible Measures 255

Mastering Data Warehouse Aggregates by Christopher Adamson (Wiley, 2006)


provides definitive advice on designing, building, and using invisible aggregates
within a dimensional data warehouse.

Drill-Across Query Pattern


Problem/Requirement
As new star schemas and business processes are added to the data warehouse, BI
users’ questions will inevitably become more sophisticated because they will want
to perform cross-process analysis. When they do, it's important to understand how Cross-process
their queries should access multiple fact tables to compare and combine measures. analysis requires
For example, Figure 8-17 shows two HR processes: salary payments and ab- queries to access
sence/leave tracking that need to be compared to answer the question: “Which multiple fact tables
employees were highly paid but were frequently absent in 2011?”

Figure 8-17
Querying multiple
fact tables

Because the two fact tables SALARY FACT and ABSENCES FACT share con- Joining two fact
formed EMPLOYEE and CALENDAR dimensions it appears straightforward to tables – don’t try
join them using their common surrogate keys, as in the following query: this at home!

SELECT Employee_ID, Employee, SUM(Salary), Sum(Absence)


FROM Salary_Fact s, Absences_Fact a, Calendar c, Employee e
WHERE s.Employee_Key = e.Employee_Key
AND a.Employee_Key = e.Employee_Key
AND s.Date_Key = c.Date_Key
AND a.Date_Key = c.Date_Key
AND c.Year = 2011
AND e.Employee_ID = “007”
GROUP BY Employee_ID, Employee SORT BY 3

While the above SQL appears perfectly valid, it will not produce the correct totals
for James Bond or any other employees even if the “007” constraint was removed.
256 Chapter 8 How Many

Queries that attempt Report 3: 2011 Employee Analysis, in Figure 8-18, shows the results of the previous
to directly join fact query—but first take a look at the two smaller reports that preceeded it. Report 1
tables using single shows that employee James Bond has received three salary payments totalling
SQL select clauses £160,000. Report 2 shows that he has been absent 6 days. Now look at Report 3. It
can overstate the shows that James earned £320,000 and was absent 18 days. Something is clearly not
facts! right here: his salary has doubled and his absences have tripled!

Figure 8-18
Overstating the
facts

Joining across a This over-counting is known as the “many to one to many problem”, “fan trap” or
M:M relationship “chasm trap”. It occurs when the tables being joined have a M:M relationship. SQL
causes over- has to evaluate the WHERE clause, which performs the joins ahead of the GROUP
counting because BY clause, so in the example the many Bond salaries (3) are joined to the many
SQL joins first, then bond absences (2) creating too many rows, which are then summed up. Even if the
aggregates the “too fact tables have a 1:M relationship, any facts from the 1 side of the relationship will
many rows” created be overcounted. This is an insidious problem because the aggregation that’s
by the join inherent in most BI queries will hide the “too many rows”. The only totally safe
join between fact tables is when there is a 1:1 relationship. This is very rare and
hard to guarantee. Even then performance can be poor when millions of facts are
joined.

Solution
BI applications can avoid the M:M problem by performing drill-across queries.
Drilling across means lining up measures from different business processes using
conformed row headers. A drill-across query does this by issuing multi-pass SQL:
Multiple fact tables sending separate SELECT statements to each star schema. These separate queries
should be accessed aggregate the facts to the same conformed row-header level before they are merged
using drill-across to produce a single answer set. Drilling across would provide the correct answer to
queries that issue Report 3 by running a query to summarize salaries by Employee ID and another to
multi-pass SQL summarize absences by Employee ID and then merging (full outer join) the two
correctly aggregated answer sets.
Design Patterns for High Performance Fact Tables and Flexible Measures 257

Drill-across or multi-pass query support is a key feature of BI tools. It helps to Multi-pass SQL
manage query performance by keeping individual queries simple. By accessing fact summarizes fact
tables one at a time the queries can be optimized as star joins and take advantage of tables one at a time,
aggregate navigation. They may also be run in parallel by the DBMS. then joins the results

Choose BI tools that have drill-across/multi-pass functionality. As an alternative to


multi-pass, some tools will generate multiple inline views within a single query.

Drill-across also supports distributed data warehousing. You can scale a dimen- Drill-across enables
sional data warehouse by placing star schemas and OLAP cubes on multiple distributed data
database platforms in multiple locations. Multi-pass queries allow these to be warehousing. Stars
accessed as a single data warehouse. Distributed data warehouses can use different can be placed on
hardware, operating systems, and DBMSs for each database server—as long as they different DBMSs
contain stars or cubes with conformed dimensions that can be queried by a com-
mon BI toolset using drill-across techniques.

As a general rule, fact tables shouldn’t be directly joined. Most fact tables have a
1:M or M:M relationship, which results in the facts being overstated when meas-
ures are calculated. Instead they should be queried by drilling across.

Drill-across works very well when queries need to combine summarized facts; for Drill-across queries
example, when business processes are compared at a monthly or quarterly level, work well for
individual multi-pass queries will access millions of facts but the answer sets will be summary-level
aggregated to a conformed row-header levels before they are returned, and BI tools process
will only have to merge reports’ worth of data—a few hundred rows. comparisons

Consequences
However, drill across doesn’t work for every type of cross-process or multi-event
analysis. For example, Figure 8-19 shows the sad state of a BI user who is trying to
compare orders and shipments. He is trying to ask questions such as “What was
the average delay on shipping an order item over the last six months?” and “How Drill-across queries
many unshipped items are there YTD this year vs. last year?” but his queries never become inefficient
seem to finish, or perhaps even start. The problem is these questions require when cross-process
individual line items from each transaction fact table to be compared before they analysis involves
are aggregated. This can result in multi-pass SQL that returns millions of rows that atomic-level
the BI tools must attempt to merge. Even when smart BI tools can construct the comparisons
correct in-database joins, performance can still be poor.

Figure 8-19
Unhappy BI user:
difficult drill-across
analysis
258 Chapter 8 How Many

Derived Fact Table Patterns


Problem/Requirement
The unhappy user’s comparison problems, in Figure 8-19, are not so much drill-
across limitations, as poor or missing design. Orders and shipments are not dis-
Missing evolving crete events that can be fully analyzed in isolation using transaction fact tables.
events cause BI They are evolving event milestones that constantly need to be compared to each
pain and suffering other and to deliveries, returns and payments to provide key measures of process
performance. Ad-hoc queries shouldn’t have to try to join these events together
every time, especially if there are complex M:M relationships between them.

Solution
Figure 8-20 shows what the user really needs: an orders accumulating snapshot
that can be queried using simple single-pass SQL. Following the agile approach to
Developing Accumulating Snapshots, outlined earlier in this chapter, this snapshot is
delivered as a derived fact table (DF), by merging the two existing order and
shipments transaction fact tables.

Figure 8-20
Happy BI user:
derived fact table
to the rescue

Derived fact tables Derived fact tables are built from existing fact tables to simplify queries. They use
solve difficult BI with additional ETL processing and DBMS storage, rather than more complex BI and
simple ETL rather SQL, to answer difficult analytical questions. In addition to aggregates, there are
than complex SQL three other types of derived fact table: sliced, pivoted, and merged.

A sliced fact table Sliced fact tables contain subsets of base fact tables; for example, UK sales
contains a subset of derived from a global sales fact table. Sliced fact tables can support restricted
a base fact table row-level access and data distribution needs as well as enhanced performance
for users who only need a subset of the data. They are often used in conjunc-
tion with swappable dimensions (SD) that contain matching subsets of dimen-
sional values.

A pivoted fact table Pivoted fact tables transpose row values in a base fact table into columns; for
transposes base fact example, a fact table with nine facts derived from a base transaction fact table
table rows into fact with a single fact that records nine transaction types. Pivoted fact tables make
columns fact comparisons and calculations simpler. The same rows-to-columns ap-
proach can also be used to create bitmap dimensions (see Chapter 9) that sup-
port combination constraint queries.
Design Patterns for High Performance Fact Tables and Flexible Measures 259

Merged fact tables combine facts and dimensions from two or more base fact A merged fact table
tables, summarized to a common granularity; for example, a fact table that combines multiple
combines targets with summarized actual sales, or an accumulating snapshot base fact tables,
derived from milestone transaction fact tables. Merged fact tables simplify summarized to a
cross-process analysis by replacing complex drill-across queries and expensive common granularity
joins with single star queries.

DF: used as a table code to identify a derived fact table constructed from one or
more existing fact tables. Used as a column code to identify derived facts that can
be calculated (possibly in a view) from other facts.

Data warehouse designs routinely fail to take full advantage of derived fact tables. DW/BI
Often there is a false impression that once the fact tables on the matrix have been retrospectives
loaded: “That’s the data stored now, the major ETL development is done, and should re-examine
everything from here on out is BI”. This can leave BI users and developers strug- the design
gling to answer increasingly complex business questions. In all forms of agile periodically, to see if
development, project teams hold end of sprint meetings, known as retrospectives, additional ETL or
to discuss what was successful and what could be improved. For DW/BI, retrospec- derived fact tables
tives should include BI developers sharing their most common reporting com- can simplify difficult
plexities with the team to see whether these queries can be simplified by derived queries
fact tables and other ETL enhancements.

Merged fact tables are often referred to as consolidated data marts when they are Consolidated data
used to combine and summarize facts from several different business processes on marts are the
a periodic basis. These “one-stop shop” data marts are incredibly popular with periodic equivalent
stakeholders because they provide high performance fact access in a format suit- of accumulating
able for simpler BI tools. Common consolidated data marts include: snapshots

Customer relationship management data marts that provide a so-called “360°


customer view” by summarizing measures from all the individual fact tables
that relate to “customer touch points”.

Profitability data marts that combine revenue with all the elements of cost, to
support product or service profitability analysis.

Consequences
There is often pressure from business stakeholders to dispense with the details and
build highly summarized consolidated data marts directly from operational
sources to provide “quick win” key performance indicator (KPI) dashboards. Don’t attempt to
Unfortunately, data marts that summarize many different business processes and build consolidated
consolidate multiple operational sources are literally the last thing you should data marts before
build. Apart from the ETL risks, the lack of detail rapidly undermines confidence you have loaded
in the KPIs when users cannot drill-down deep enough to explain the figures and atomic detailed star
view actionable information. Instead, consolidated data marts should be developed schemas
incrementally as derived fact tables: derived from atomic-level fact tables.
260 Chapter 8 How Many

Summary
Transaction fact (TF) tables record the atomic-level, point-in-time facts associated with discrete
events.

Periodic snapshots (PS) provide additional atomic-level facts by sampling continuous business
processes and new aggregated facts by summarizing atomic transactional facts at regular
intervals.

Accumulating snapshots (AS) bring together the milestone events of a business process and
combine their transactional facts to provide additional performance measures

Both periodic and accumulating snapshots provide high performance access to measures that
would be impossible or impractical to calculate at query time, from atomic transaction fact
tables alone.

Apart from its type, the most important definition of a fact table is its granularity, which must
precisely state, in business terms or dimensionally (GD), the meaning and uniqueness of each
fact table row.

Event timelines are used to visually model the milestone events and duration measures of
evolving events that can be implemented as accumulating snapshots.

Accumulating snapshots that need to be sourced from multiple operational systems or contain
repeating milestones (with 1:M or M:M relationships) should be developed incrementally—by
first implementing transaction fact tables for their individual milestone events.

Fact additivity describes any restrictions on how a fact can be summed to produce a meaningful
value. Fully additive (FA) facts can be summed with no restrictions, using any combination of
available dimensions. Semi-additive (SA) facts must not be summed across their non-additive
(NA) dimension(s). Non-additive (NA) facts must not be summed.

Fact tables can be optimized by appropriate downsizing, indexing, partitioning and aggregation.

Cross-process analysis should be handled by drilling-across multiple fact tables one at a time
using multi-pass SQL or by building derived fact (DF) tables that merge commonly compared
fact tables.
WHY AND HOW
9
Dimensional Design Patterns for Cause and Effect

There is occasions and causes why and wherefore in all things.


— William Shakespeare (1564–1616), "King Henry V", Act 5, scene 1

How am I doing?
— Ed Koch, Mayor of New York 1978–1989

Some of the most valuable dimensions in a data warehouse attempt to explain why Why and how
and how events occur. Why dimensions are used to describe direct and indirect dimensions are
causal factors. They are often closely linked to the how dimensions that provide all closely linked: they
the remaining event descriptions that are not related to the major who, what, when describe cause and
and where dimension types. Together why and how represent cause and effect and effect
complete the 7W dimensional description of a business event.

In our final chapter we cover dimensional design patterns for describing how This chapter
events occur and why facts vary. We focus particularly on bridge table patterns for describes why and
representing multiple causal factors and multi-valued dimensions in general. We how dimension
describe how bridge table weighting factors are used to preserve atomic fact granu- design patterns
larity and avoid ETL time fact allocations. We also describe how bridge tables can
be augmented with multi-level dimensions and pivoted dimensions to efficiently
handle barely multi-valued reporting and complex combination constraints. We
conclude with step, range band and audit dimension techniques for analyzing
sequential events, grouping by facts and handling ETL metadata.

Direct and indirect causal factors Chapter 9 Design


Attributing multiple causes to a fact Challenges
Dealing with barely multi-valued dimensions efficiently At a Glance
Handling complex combination constraints
Understanding sequential behavior
Range band reporting
Tracking data quality and lineage
261
262 Chapter 9 Why and How

Why Dimensions
Why details become The why details of an event become causal dimensions, such as promotion,
causal dimensions weather, or just reason. Causal dimensions explain why business events occur
that help to explain when they do, in the way that they do. They describe what stakeholders believe are
why facts occur in the influential factors for a business event; for example, price discounts driving up
the way they do sales transactions, or storms triggering home insurance claims. Causal factors fall
into two categories: direct and indirect.

Direct causal factors Promotional discounts are examples of causal factors that are directly related to the
have a recorded facts. You know with absolute certainty when they are or are not related to a sale
influence on the because the promotional code (or discounted product code) and the discounted
facts price are recorded or not recorded as part of the sale transaction.

Indirect causal Other causal factors—such as weather conditions, sporting events, or advertise-
factors may or may ment campaigns—are only indirectly related to facts. Stakeholders may know that
not have influenced these took place at the same time, in the same location as the facts they want to
the facts measure, but they can only speculate that they had an effect on them.

Causal factors can Causal factors can also be described as external or internal. Weather and sporting
be internal: under events are examples of external causes that an organization has no control over
the control of the (unless it is sponsoring the sporting event). Whereas, price discounts and advertis-
organization, or ing are examples of internal causal factors which the organization does control.
external: beyond Some internal causes—like seminars, sales calls and advertising—can be significant
its control business events in their own right, and warrant dedicated fact tables to analyze
their associated costs and activities by who, what, where, and when. In these cases
causal dimensions may be conformed across multiple cause and effect star schemas
that typically represent process sequences.

Internal Why Dimensions


PROMOTION is an Figure 9-1 shows a simple PROMOTION dimension. This is an internal why
internal why dimension that would typically contains a combination of discount, display and
dimension that can advertising descriptions. These are a mixture of direct and indirect causal factors.
contain a mixture of DISCOUNT TYPE is a direct causal factor captured on every transaction along
direct and indirect with a DISCOUNT amount fact ($0.00 when there is no discount). Advertising
causal attributes attributes such as CHANNEL are indirect causal factors if there is no way to know
for sure that customers saw the adverts. However, if the DISCOUNT TYPE is
“Coupon” or “Discount Code” and an advert contains the information that the
customer must supply at the point of sale, then it becomes a direct causal factor.

The special “No Promotion” record (PROMOTION KEY zero) will be the most used
record in the PROMOTION dimension, if most products are not on promotion
every day.
Dimensional Design Patterns for Cause and Effect 263

Figure 9-1
PROMOTION
dimension

While a PROMOTION dimension may be a small dimension, with only of few Indirect causal
hundred promotional condition combinations, it can be challenging to build and values are often
assign to the facts because of its mix of direct and indirect causal factors. Direct more difficult to
causal factors are usually straightforward to assign because they are captured by the source than direct
operational system but many of the interesting indirect causes may not be. For causal values
example, a sales system will not (reliably) record whether discounts are also pro-
moted by TV ads or special (in-store or on-website) product displays because this
information is not needed to complete each sale transaction and print a valid
invoice/receipt (which must show any direct discount details). A richly descriptive
promotion dimension will require this information to be sourced from elsewhere—
typically from less formal data sources, like spreadsheets and word processing
documents, and its ETL processing will need to be sophisticated enough to assign
the full combination of promotional conditions correctly.

The DW/BI team may have to build small data entry applications to capture causal
descriptions and timetables when this information is “known to business but not
known to any system.”

If BI users need to analyze promotion return on investment (ROI), the data ware- PROMOTION may
house will need an additional Promotion Spend Fact table—using the same con- be conformed
formed PROMOTION dimension. BI users can then run drill-across queries across sales and
against both PROMOTION SPEND FACT and SALES FACT to compare promo- promotion cost star
tion costs to sales revenue uplift. schemas

Unstructured Why Dimensions


In some cases valuable direct causal details are attached to transactions as unstruc- Direct causal factors
tured comments. These potentially large textual columns should be removed from are often captured
fact tables and placed in separate dimensions, to maximize fact table performance as free-format text
for the majority of queries that just want to rapidly aggregate the additive facts. The reasons. These
resulting text dimensions are why dimensions. Figure 9-2 shows an example non-additive text
COMMENT dimension that contains reasons that salespeople have given for facts should be
varying the price for specific customers. This could be turned into a better why removed from fact
dimension by adding additional attributes that codify the free-format text reasons tables and placed in
based on interesting keywords. Embellishing the table with low cardinality sets of text dimensions
descriptive tags would provide better report row headers and more consistent
filters.
264 Chapter 9 Why and How

Figure 9-2
COMMENT
Dimension

In the absence of structured why details, a simple COMMENT dimension will still
allow BI users to search events using causal keywords and display comments on
reports when they find exceptional transactions. COMMENT dimensions can be
improved in future iterations by adding additional attributes and using “text-
mining” ETL routines to tag comments.

External Why Dimensions


WEATHER is an Figure 9-3 shows an example WEATHER dimension. This is an external why
external why dimension. It does not attempt to document every distinct temperature and
dimension. It could weather condition—it contains general weather descriptions that are useful for
be added to any reporting and analyzing non-weather events, not weather facts. Because weather is
fact table with a dependent on time and location, any event with a where detail can potentially
time and location include weather as a dimension; for example, events that might be usefully ana-
granularity that lyzed by weather include product sales, travel reservations, seminar attendance,
matches the insurance claims, and television viewing. Adding a weather dimension requires an
weather feed external feed that matches the event’s time and location granularity.

Figure 9-3
WEATHER
Dimension

Causal dimensions—such as weather—that do not alter the granularity of fact


tables can be added later when a reliable external data source has been found.
Dimensional Design Patterns for Cause and Effect 265

Multi-Valued Dimensions
One of the challenges of causal dimensions—especially with external indirect Causal factors are
causes—is that there may be more than one cause for any given fact. For example, multi-valued
Figure 9-4 shows an EVENT CALENDAR table that documents several sporting dimensional
events that may have influenced product sales in July 2010. This table makes it easy attributes where
for BI users to answer the question: “How much did we make during the World there is more than
Cup?”, because they don’t have to remember the dates, just pick the single event one cause of the
from a drop-down list. As such this table is WHERE clause friendly and could be same type for a fact
used to store other event types with dynamic date ranges for which consistent date
range filters would be useful; for example, “Business” events like “Last 60 days,
Current year”, “Last 90 days, Current Year” and the same ranges from the previous
year.

Weighting Factor Pattern


Problem/Requirement
If BI users want to group (rather than filter) a report by the multiple sporting
events they can get into trouble because many of these sporting events overlap.
They must be careful how they interpret a report that shows $30M sales during the
World Cup and $10M sales during Wimbledon. They must not add totals that
overstate the sales. The business has not made $40M because Wimbledon took Grouping by a multi-
place during the World Cup. This problem arises because EVENT is a multi-valued valued attribute can
dimensional attribute—it can have more than one value for a single atomic-level cause over-counting
fact like a customer product purchase.

Figure 9-4
Event Calendar

Solution
For BI users who need to group by multiple events, Figure 9-5 shows an alternative
version of the sporting schedule that is more GROUP BY friendly. EVENT DAY
CALENDAR stores each (sporting) event and date combination—a 14 day event Over-counting can
like Wimbledon will be stored as 14 rows. This may appear a little wasteful but it be avoided by
has two benefits: providing a
weighting factor
Each date/event combination can be given a weighting factor to allow facts to
be allocated amongst the multiple events that occur on the same day.
266 Chapter 9 Why and How

It simplifies the fact table join—this is now a simple inner join on the single
EVENT DATE KEY just like a standard CALENDAR dimension instead of a
BETWEEN join on START DATE KEY and END DATE KEY.

Figure 9-5
Event Day Calendar

Weighting factors If you take a look at the example data in Figure 9-5 you will see that the weighting
for each multi- factors for any one date add up to 1 (100%). For example, on June 21 2010 both the
valued group (e.g., World Cup and Wimbledon are taking place so they both receive 50% of the sales
all the events on a activity by giving them a weighting factor of 0.5 (50%). Whereas, on July 14 2010
day) must total to 1 the Tour De France is the only significant sporting event taking place (perhaps this
is the only event that Pomegranate has sponsored that day) so it gets a weighting
factor of 1 (100%). Now when sales are grouped by EVENT, sales revenue can be
“correctly weighted” by multiplying each atomic revenue fact by the sporting event
weighting factor for the day it was recorded, as shown in the SQL below:

SELECT Event, SUM(Revenue * Weighting_Factor) as Weighted_Revenue


FROM Sales_Fact s, Event_Day_Calendar d
WHERE s.Sale_Date_Key = d.Event_Date_Key
GROUP BY Event

Consequences
Of course these may not be the “correctly weighted” figures at all—if a business
The “correct” sells more tennis rackets than soccer balls, the Wimbledon/World Cup split should
weighting factor be quite different. Allocation is usually problematic because different stakeholder
split can depend groups have different ideas about how the atomic facts should be split. However,
on the facts being the one thing no one can argue about is the weighted total. If the weighting factors
queried and who is always adds up to 1 for any day, the grand total for all the days covered by a report
doing the querying will be correct—so long as no events are filtered out.

Useful impact reports can be constructed by querying both the weighted facts and
unweighted versions of the facts. The unweighted facts can be displayed in the
body of the report for each row header; for example, World Cup $30M and Wim-
bledon $10M. The weighted facts can be aggregated within the BI tool to produce
a correct grand total for the report; for example, $30M for the two events (because
they completely overlap).
Dimensional Design Patterns for Cause and Effect 267

Modeling Multi-Valued Groups


Sporting events in the previous example are multi-valued indirect causal factors. Multi-valued (MV)
Because they are not fundamental details of product sales events—they are only event details are
related to time—the multi-valued modeling challenge they represent can be ad- discovered by telling
dressed after the sales star is implemented or (better still) ignored altogether. This group themed
is not the case when a direct causal factor is a significant multi-valued detail of an stories
event. For example, imagine you are modelstorming Pomegranate’s medical
insurance claims. When you ask the stakeholders “Who does what?” they reply:

Doctor claims amount.

By working through the 7Ws you discover that a DOCTOR (who) claims an
amount for a TREATMENT (what) given to a PATIENT [Employee] (who) on a
specific TREATMENT DATE (when), as shown in the BEAM✲ event table, Figure
9-6. These are all single-valued details that convert readily to dimensions. But a
problem arises when you come to the why question:

Why does a doctor treat a patient?

When you ask for examples for the resulting DIAGNOSIS why detail, you discover
that a claim contains multiple diagnosis—there is typically more than one thing
wrong with a patient—and the diagnosis codes (ICD10 codes) submitted as part of
every claim are not linked to the specific treatments.

Figure 9-6
MEDICAL
TREATMENTS
event table
containing
group themed
examples

You capture this business knowledge about multiple diagnoses by marking Multi-valued (MV)
DIAGNOSIS as MV to denote a multi-valued detail. Generally you discover MV and multi-level (ML)
details by getting stakeholders to tell group themed stories. You do this after they event details are
have told all their other themed stories (typical, different, missing, repeat) by point- discovered by telling
ing at each detail and asking stakeholders if they can give you an example of the group themed
same type of event that would contain groups of that detail; e.g. a group of custom- stories
ers or a group of products. For most events and most details they won’t be able to
because multi-valued details (and multi-level (ML) details, which you also find this
way) are the exception rather than the rule—thankfully.
268 Chapter 9 Why and How

Multi-Valued Bridge Pattern


Problem/Requirement
The multi-valued group of diagnosis for each treatment creates a M:M relationship
between the event and a diagnosis dimension. This could be addressed in the star
Changing the schema design by changing the physical fact table granularity from one row per
granularity of a fact claim line item to one row per claim line item per diagnosis (CLAIM ID GD,
table to remove a TREATMENT_KEY GD, DIAGNOSIS_KEY GD). However, this causes a signifi-
M:M relationship cant allocation problem. If 10M claims have an average of 5 itemized treatments
“hard-codes” fact for an average of 3 diagnosis codes each this would immediately triple a 50M row
allocations fact table to 150M rows. While the extra 100M rows will adversely affect query
performance, the real issue is what facts do you put on these extra rows? Doctors
submit 50M “atomic” claim amounts, how do you go about splitting these amongst
their multiple diagnoses to create 150M additive CLAIM AMOUNT facts? Looking
at the example events you may have some ideas on how Bond’s treatment costs
should be allocated to his symptoms but with hundreds of millions of claim facts to
process, automating this would be difficult and few stakeholders will agree on how
you should “hard-code” these fact allocations.

Solution
Fact allocation problems can be avoided by leaving the fact table granularity
unaltered, and using a multi-valued bridge table (MV) instead, to resolve the M:M
Use a multi-valued relationship. For example, DIAGNOSIS GROUP [MV], shown in Figure 9-7, can
bridge table (MV) to be used to join unaltered claim facts to a DIAGNOSIS dimension. It does this by
resolve a M:M storing the multiple DIAGNOSIS KEYs of a claim as separate rows of a diagnosis
relationship group, each with a now familiar WEIGHTING FACTOR. Diagnosis groups are
between a fact table created and assigned a surrogate key (DIAGNOSIS GROUP KEY) as unique claim
and a dimension diagnosis combinations are observed during ETL. These bridge table keys are
added to the facts as they are loaded so that tables can be joined as in Figure 9-8.

Figure 9-7
DIAGNOSIS
GROUP
multi-valued
bridge table

Bridge tables avoid Not only does the bridge table resolve the technical problem of the M:M relation-
the political issues ship, it sidesteps the political issues of how to split the atomic facts and provide
of hard-coding fact greater reporting performance and flexibility. By not increasing the number of
allocations facts and altering their values, queries that stick to the normal single-value dimen-
sions to analyze busiest doctors, sickest patients or most expensive treatments run
Dimensional Design Patterns for Cause and Effect 269

as fast as possible and produce “unarguable” answers. Even queries that filter on Bridge tables
one specific ICD10 code produce similar fast “unarguable” answers. Only when BI provide flexible
users want to analyze by multiple diagnoses do they have to consider the weighting multi-valued
factors and argue about allocations. When they do, they can choose to ignore reporting. Users
weighting factors and look at the unweighted treatment costs, use the default can choose how
(crude) weighting factors in DIAGNOSIS GROUP (which add up to 100% for each they weight the
diagnosis group) or model their own weighting factors in swappable versions of facts at query time
DIAGNOSIS GROUP and use those instead.

When a multi-valued dimension, such as DIAGNOSIS, is constrained to a single


value the multi-valued allocation problem goes away for that query and any
additive facts, such as Claim Amount, can be summed without over-counting.

Figure 9-8
Using a
multi-valued
bridge table

If diagnosis combinations seldom repeat, a simple ETL process could create a


new diagnosis group for every claim using the CLAIM ID as a DIAGNOSIS GROUP
KEY. This would avoid having to add a new foreign key to the claim facts as the
degenerative CLAIM ID would already be present. However, if combinations
frequently reoccur a more sophisticated ETL process can reuse common diagno-
sis groups (to reduce bridge table growth) by using a dedicated surrogate key for
the bridge table. This approach would be likely to have greater BI value because
frequently occurring groups could be more easily found (by ranking on
DIAGNOSIS GROUP KEY—breaking our rule on hiding surrogate keys) and their
weighting factors adjusted by hand if necessary.

Consequences
When you discover a potential multi-valued dimension you should first check that
the granularity of the facts is correct before complicating the design with a bridge
table. If this is an aggregation of the available operational details you may be able to
turn the multi-valued dimension into a normal dimension by going down to the
atomic level of detail. For example, if you modeled an invoice fact table with a Check that fact
granularity of one row per invoice then PRODUCT would be a MV dimension. granularity is correct
Modeling the atomic invoice line items easily solves this. However, if you are (atomic) before
already at the atomic-level you can avoid “splitting the atom” and creating mean- using a bridge table
ingless (unstable subatomic) measures by using a bridge table. design
270 Chapter 9 Why and How

Optional Bridge Pattern


Problem/Requirement
Causal (why) dimensions are not the only multi-valued dimensions. Frequently the
who dimension for a single fact can be multi-valued. For example, multiple doctors
A bridge table over- can perform a surgical procedure and multiple customers can purchase a joint
complicates queries policy. Multi-valued uses of who dimensions can be implemented by building
that rollup a barely bridge tables as in the previous pattern. However, using bridge tables in every
multi-valued query can be excessive when a multi-valued who dimension is only barely multi-
dimension to a valued and the majority of queries want to rollup the facts past the multiple indi-
single-valued level viduals to a single-valued who hierarchy level. For example, most product sales are
made by a single employee but a small percentage are made by teams of two
employees working together. Employee level sales reports need to split any team
sales facts between employees based on their seniority or role within their teams,
but most reports only need to total sales to the branch level or above—ignoring the
members of a team by using the branch where the team is based.

Solution
When a dimension is barely multi-valued, a bridge table can be avoided by making
the dimension multi-leveled (ML) so that it contains additional records for the
Use a multi-level small number of multi-valued groups needed. For example, the multi-level
(ML) dimension to EMPLOYEE [HV, ML] dimension, in Figure 9-9, holds normal employee records
avoid joining for sales consultants and additional records for sales teams made up of two or more
through a bridge consultants. It contains example dimension members for two employees Holmes
table and Watson, and handles facts where they have worked together (when the game is
afoot) by treating their team “Holmes & Watson” as a pseudo-employee. This
allows EMPLOYEE to join directly to the sales fact table (as in Figure 9-10) and
rollup all their individual and joint sales to the appropriate branch at the time of
sale. For example, Watson’s individual sales will be rolled up to Afghanistan or
London depending on when they occurred. His joint sales with Sherlock Holmes
will always be rolled up to London.

Figure 9-9
Multi-level
EMPLOYEE
dimension
containing additional
team rows

All SK column examples values will be replaced by integer surrogate keys in the
physical model. The (2) records show the effect of an HV attribute change for
employee John Watson (moving back to London).
Dimensional Design Patterns for Cause and Effect 271

Figure 9-10
Joining the multi-
level dimension
directly to the facts

For most queries that need to total sales the efficient direct join will be ideal, but A bridge table will
those queries that calculate team sales splits will still need to treat EMPLOYEE as still be needed for
multi-valued. They can do so by joining through an optional bridge table, such as queries that must
TEAM shown in Figure 9-11, that provides the team split percentage (the equiva- use a multi-valued
lent of a weighting factor) for each team member. The presence of the optional weighting factor
bridge effectively makes the direct join a shortcut that can be used whenever
PERCENTAGE is not needed.

Figure 9-11
Joining through the
TEAM optional
bridge table

To be able to optionally join through a bridge, or directly to the facts, both the An optional bridge
optional bridge table and the ML dimension must use the same surrogate keys, table must use the
effectively making them swappable dimensions. For example, the Figure 9-12 same surrogate key
BEAM✲ diagram for TEAM shows that the bridge table key TEAM KEY is actually values as its multi-
a foreign key role of EMPLOYEE KEY. TEAM uses the special pseudo-employee level dimension
key values, shown in Figure 9-9, to record the members and percentage splits for
each team. It also uses normal employee key values on the records where TEAM
KEY and MEMBER KEY are the same and PERCENTAGE is 100. These act like
teams of one, allowing the bridge table to join employees to 100% of their individ-
ual sales facts—the equivalent of a direct join.

Figure 9-12
Multi-valued and
Multi-leveled
bridge table
272 Chapter 9 Why and How

Adding multiple TEAM contains an additional attribute MEMBERSHIP TYPE to describe these
levels to the bridge “Team Split” and “Employee” records. It also records a third membership type of
table increases “Team” for some of the 100% records (highlighted in bold in Figure 9-12). These
reporting flexibility records allow the bridge to join facts to the team level records (e.g. “Holmes &
Watson”) in EMPLOYEE as well as normal employee records. This makes TEAM
[HV, MV, ML] a multi-level (ML) as well as multi-valued bridge table, enabling it
to be used to flexibly query both team sales and employee sales in a single pass. For
example, consider the following query:

Select Employee_Name, Sum(Revenue)


From Employee E, Team T, Sale_Fact S
where E.Employee_Key = T.Member_Key and
T.Team_key = S.Employee_Key
Group by Employee_Name

Bridge table levels This returns both team sales and employee total sales (including their team sales)
must be carefully — a very useful report — but care must be taken not to add a grand total because it
filtered to avoid would double-count team sales. Filtering on MEMBERSHIP TYPE removes this
double-counting limitation and makes the following additive reports available:

employee individual sales (excluding their team sales):


Where Membership_Type = "Employee"

employee team sales (excluding their individual sales)


Where Membership_Type = "Team Split"

employee total sales


Where Membership_Type in ("Employee", "Team Split")

team sales:
Where Membership_Type = "Team"

teams sales and employees individual sales (excluding their team sales):
Where Membership_Type <> "Team Split"

This last filter is the equivalent of the shortcut join that avoids the bridge table.

Consequences
Multi-level bridge tables are complex. Including the multiple levels provides
complete reporting flexibility, but at a price. Queries must filter the bridge table
Multi-level bridge correctly to avoid double-counting or misinterpreting the results. Keeping the
tables are complex multiple levels in the dimension and bridge synchronized also requires significant
to build and additional ETL processing. For example, one change to Watson’s location requires
complex to use 2 new rows in EMPLOYEE and 4 new rows in TEAM to keep Watson and the
correctly Holmes & Watson team in sync. All these new rows have been marked (2) in
Figures 9-9 and 9-12 to highlight the second versions of Watson, his team, and his
team splits in the two tables created by his return from Afghanistan to London.
Dimensional Design Patterns for Cause and Effect 273

Pivoted Dimension Pattern


Problem/Requirement
A number of Pomegranate products are highly configurable (e.g. the POMCar).
For marketing and manufacturing purposes, BI users would like to analyze cus-
tomer option choices, particularly which options are frequently chosen together, BI users want to
and which options are added to base products that already had certain other analyze event
“options as standard”. The options themselves are also products and services combinations
stored in the PRODUCT dimension—some can be sold standalone, others cannot.
Figure 9-13 shows a proposed enhancement to the order processing star schema to
handle the option analysis requirements. It includes an OPTION PACK [MV]
bridge table that allows the PRODUCT dimension to play the additional roles of
custom ordered option (by adding OPTION PACK KEY to ORDERS FACTS) and
standard option (by adding a STANDARD PACK KEY foreign key to PRODUCT).

Figure 9-13
OPTION PACK
bridge table added
to ORDER FACTS

By creating two role-playing views (OPTION and STANDARD OPTION) it is easy


to construct a query for the users’ first question type:

How many products with option 4 as standard were


customized by adding option 5?

Their second question type:

What were the most popular customized products ordered


with option 2 or option 3?
274 Chapter 9 Why and How

is also straightforward because the unique SERIAL NUMBER of each customized


product has also been added to ORDERS FACT. This enables customized products
to be counted distinctly so as not to double-count orders where the customer has
chosen both option 2 and option 3 for the same product.

Counting unique events that are filtered on multiple multi-values requires an


appropriate unique degenerate ID in the fact table to be counted uniquely.

Unfortunately, even with the role-playing bridge table and a unique degenerate ID
the proposed design does not easily answer their third question type:

How many products were purchased with options 2, and


3 and 14, but without options 4, and 5 and 190?

Combination The AND logic of the option combinations complicates matters. The users cannot
analysis can involve answer this question with simple SQL that might contain: “WHERE Option=2
complex set logic and Option=3…” because OPTION can be equal to both 2 and 3 at the same
SQL time! Instead they must:

1. Run 3 queries to find products with one of the 3 required options.


2. INTERSECT the results to find only the products with all 3 options.
3. Run 3 more queries to find products that have the 3 unwanted options.
4. INTERSECT the results to find the products with all 3 unwanted options.
5. Use SQL MINUS to take the second set of products away from the first.

These 9 subqueries can be executed as a single SQL SELECT but users would not
be able to construct them (or other combination questions) using simple ad-hoc
query tools. Even if they could, the queries would not necessarily perform well.

Multi-valued bridge tables often give rise to complex multi-valued combination


constraints. The AND logic becomes complex because the constraint needs to be
placed simultaneously on the multiple rows in the bridge table. It is far easier for
SQL to constrain multiple columns in this way than multiple rows.

Solution
If the number of options available across all customizable products is limited (e.g.
200 in total) and relatively static (e.g. new options are only added once a year) this
row problem can be turned into a column solution with a bit of lateral thinking.
Use a pivoted Figure 9-14 shows an OPTION PACK FLAG dimension. This is a pivoted dimen-
dimension to turn sion (denoted by the code PD) that stores the same option combinations as the
complex row bridge table, but as columns rather than rows. It requires 200 columns to do so but
constraints into these columns are just bit (or single byte) flags and they make combination con-
simple column straints very easy to build in SQL. For example, the filter for the previous user
constraints question becomes:
Dimensional Design Patterns for Cause and Effect 275

WHERE Option2 = “Y” and Option3 = “Y” and Option14 = “Y” and
Option4 = “N” and Option5 = “N” and Option190 = “N”

The example data in Figure 9-14 shows that option pack the users are looking for is A bridge table and
OPTION PACK KEY 1. This is the same value as the more complex set based pivoted dimension
queries would eventually find in the bridge table because the pivoted dimension are swappable
and the bridge table use the same surrogate key—they are swappable versions of dimensions that can
each other. This means that the fact table does not need to be altered to add the be used together
pivoted dimension if the bridge table key is already present. There is value in
having both tables because the bridge table and OPTION dimension combination
is GROUP BY friendly and single value WHERE clause friendly while the pivoted
dimension is combination WHERE clause friendly. To make the pivoted dimension
user-friendly as well it should be built with meaningful names for each option
column; for example, MEMORY UPGRADE, CPU UPGRADE, RAID
CONFIGURATION etc.

Figure 9-14
OPTION PACK
pivoted dimension

In Figure 9-14, the pivoted dimension has an additional OPTION PACK attribute Add comma
containing comma separated lists of option codes. This can be used in a query separated list
GROUP BY clause or displayed in a report header/footer to describe the filters that attributes to flag
have been applied. In the user-friendly version of the pivoted dimension this would dimensions to make
be a long text column containing a list of descriptive option names (sorted in them more report
alphabetic order). It can be useful to provide both versions; e.g., an OPTION display-friendly
PACK NUMBER list of codes and an OPTION PACK list of descriptions.

If you need to build a column flag dimension, create the multi-valued bridge table
version first. Maintaining this type of table is easier with standard ETL routines
and simple SQL. After the bridge table is in place you can then create more
elaborate ETL routines that pivot its rows to create and maintain the column-
orientated version with meaningful column names generated for the descriptive
row values using SQL generated SQL.

While bridge tables and pivoted dimensions often go together, the need for a Even without a
pivoted table is not limited to multi-valued dimensions. For example, if the granu- bridge table, a
larity of ORDERS FACT was one record per product option order line item (the pivoted dimension is
product plus each of its custom options as fact rows), this would avoid the multi- needed to cope with
valued bridge but the pivoted dimension would still be needed to easily answer the complex ad-hoc
combination questions. It would just be more work to build the pivoted dimension combination queries
276 Chapter 9 Why and How

from scratch without the bridge and the fact table would still need to be altered to
add the OPTION PACK KEY.

Column flag dimensions should only be populated with observed combinations,


otherwise they can easily grow to be bigger than fact tables. A bit flag dimension
with only 20 columns has over a million possible combinations.

If bridge table rows If the business problem was more complex and varying quantities of each option
contain quantities, could be chosen to configure a product, the OPTION PACK bridge table would
a pivoted dimension need to contain an OPTION QUANTITY attribute and the OPTION PACK [PD]
can contain count pivoted dimension would contain option count columns rather than [Y/N] flags.
columns Similarly, if small numbers of options and options quantities were handled as
separate fact rows (to avoid a bridge table) and comparisons or combinations were
constantly used then a pivoted fact table might be created with option count facts.

Consequences
Pivoted dimensions are limited by the maximum number of columns available in a
Pivoted dimensions database table (usually between 256 and 1024) and the ETL involved in automating
are limited to the maintenance of volatile combination values is complex. A pivoted dimensions
relatively small and works well for Pomegranate because there are only a few hundred relatively stable
stable value options (with several new ones being add manually each year) but it could not cope
populations with a possible 155,000 ICD10 diagnosis codes.

How Dimensions
How dimensions are How dimensions document any additional information about facts that are not
often degenerate captured by other dimensions. The most common how dimensions are degenerate
(DD) IDs that (DD) transaction identifiers stored in fact tables. These dimensions describe how
provide useful facts come to exist by tying them back to the original source system transactions.
links to operational They can also be invaluable for providing unique transaction counts. For example,
source records and an ORDER ID in an ORDERS FACT table can be used to count how many orders
unique count contained at least one laptop product line item. Using COUNT(DISTINCT
measures Order_ID) ensures that individual orders with several line items for different
laptops will not be over-counted. As mentioned earlier, a degenerate ID that can be
uniquely counted is essential if a star schema has one or more multi-valued dimen-
sions.

Look out for conformed degenerate dimensions and add them to the event matrix
to help you discover event sequences and milestone dependencies that can be
modeled as evolving events.
Dimensional Design Patterns for Cause and Effect 277

Too Many Degenerate Dimensions?


Most transaction fact tables will contain at least one degenerate transaction ID that Large sets of
cannot be stored more efficiently in a separate dimension table because its cardi- degenerate
nality approaches that of the fact table itself and it has no additional descriptive dimensions within
attributes. However if a fact table contains many additional degenerates you should a fact table should
try to prune these to keep the fact table record length under control using the be remodeled as
following guidelines: separate
dimensions
If degenerates will be used to group or filter or will be browsed in combina-
tion, they should be remodeled as attributes in a separate dimension.

Any degenerates that contain large unstructured comments should be re-


placed by a surrogate key to a COMMENT dimension (as in Figure 9-2).

If a degenerate [Y/N] flag will be frequently counted it can be remodeled as a Some degenerates
low cardinality additive fact with the values 0, 1 that can be summed. This is can be remodeled
especially useful as aggregates can be built that use this fact. as useful additive
and non-additive
If the degenerate is high cardinality and will be counted distinctly it should facts
remain in the fact table where it will act as a non-additive fact.

If a degenerate flag describes the type of value in an adjacent fact it may repre-
sent data that would be better modeled as separate additive fact columns
without the flag. For example, a REVENUE fact and a flag REVENUE TYPE
with the values: ‘E’ for estimate and ‘A’ for actual, should instead be modeled
as two facts: ESTIMATED REVENUE and ACTUAL REVENUE.

Sometimes a degenerate meets more than one of these criteria. For example, a
flag may be frequently counted, and used for grouping and constraining. In
which case, you can model it as both a fact and a dimensional attribute.

Creating How Dimensions


If you identify degenerates that should be remodeled as dimensions, check to see if Move degenerates
any belong in existing dimensions. For any that do not, define a new dimension to a physical how
with its own surrogate key and relocate the degenerates to it, replacing them in the dimension named
fact table with the new surrogate key. This new dimension is often called a “junk” after the fact table
dimension, because of its tough-to-classify mix of attributes. But it really is not and replace them in
junk at all. Instead, it is a non-conformed how dimension, specific to just this set of the fact table with a
facts, that can often be usefully named after its matching fact table. For example, a surrogate key
CALL DETAILS FACT table may need a CALL DETAIL how dimension, and a
SALES FACT table may need a SALE TYPE dimension. If a fact table has multiple
small non-conformed dimensions—typically whys and hows— they can often be
merged to reduce the number of keys in the fact table.
278 Chapter 9 Why and How

Don’t tell stakeholders that any of their data is “junk”, especially when you are
modelstorming with them. If you are looking for a less pejorative term for your
non-conformed how dimensions, call them miscellaneous dimensions.

Range Band Dimension Pattern


Problem/Requirement
BI users want to group by facts, such as REVENUE and ORDER QUANTITY, and
count the unique occurrences of customers, products or transactions. They need to
BI users want to use a fact like a dimension and treat a dimensional attribute like a measure. Con-
group by the facts verting a dimensional attribute like CUSTOMER ID into a measure can be
but need to rollup straightforward using COUNT(DISTINCT …)but it requires more work to turn
the answers raw facts into good GROUP BY items. Because facts are mostly high cardinality,
continuously valued numbers, grouping by them rolls up very little data and
produces too many report rows: more data dump than readable report.

Solution
Numeric range band dimensions such as RANGE BAND, shown in Figure 9-15,
are another type of how dimension. They are how many dimensions or “How do
you turn a fact into a dimension?” dimensions that convert continuously valued
Provide a range high cardinality facts into better discrete row headers. Chapter 6 described how
band dimension to high cardinality dimensional attributes should be stored as range band labels that
“turn facts into are more useful for grouping by. Range band dimensions allow this to be done
dimensions” dynamically at query time to facts and other numeric dimensional attributes.

Figure 9-15
RANGE BAND
dimension

Range band Figure 9-15 is an example of a general-purpose range band dimension that can
dimensions convert store any number of range band groups. The example data shows two groups: “5
high cardinality facts Money Bands” that would be used to group REVENUE into 5 bands and “4 Age
into useful low Bands” that can be joined to a customer or employee age to group a population
cardinality report into 4 bands. Figure 9-16 shows how the RANGE BAND dimension is joined to
row headers SALES FACT to count the number of products sold in each of the 5 revenue
ranges—effectively converting the REVENUE fact into a dimension on-the-fly.
The SQL for the query would be:
Dimensional Design Patterns for Cause and Effect 279
SELECT range_band, SUM(quantity_sold)
FROM sales_fact, range_band
WHERE range_band_group = “5 Money Bands”
AND revenue BETWEEN low_bound AND high_bound
GROUP BY range_band

Range band dimensions allow BI users to define new bandings at any time—by Index facts that are
simply adding or changing dimension rows. The price for this flexibility will be frequently used for
slower query performance because SQL between joins are difficult to optimize. If range banding
certain facts are frequently used for range banding they can be indexed to improve
join and sort processing. Normally only the dimensional foreign keys are indexed.
Facts are usually not indexed because indexes do not speed up their aggregation.
But for range banding queries, the facts are acting like dimensional foreign keys.

Figure 9-16
Range banding
a fact

Consequences
RANGE BAND GROUP, LOW BOUND, and HIGH BOUND form the primary
key (PK) of the RANGE BAND dimension, and must therefore be unique. You
should set up the LOW BOUND and HIGH BOUND values for each range band Range bands must
with care: they should not overlap, and no gaps should exist. In addition, the be carefully defined.
RANGE BAND names must be unique within each RANGE BAND GROUP. The They must be
short code ND1 (No Duplicates) in Figure 9-15 has been added to these columns unique with no gaps
to indicate that they form a no duplicates group (number 1)—the combination of and no overlaps
column values within the group must be unique.

Step Dimension Pattern


Problem/Requirement
Chapter 7 covered techniques for overloading sequential events, such as flights or
web page visits, with first and last locations. These powerful dimensions not only
provide extra where information, they typically describe why a sequence of events BI users want to
started and how it finished. For example, the first URL in a web visit can be con- understand
verted into a REFERRAL why dimension that describes the banner ad or search sequential behavior
string that triggered each click, and the last URL can become a how dimension that by analyzing events
describes each click by its outcome; for example, “Just browsing” or “Big shopping relative to the cause
trip”. Armed with this additional why and how information, BI users will often and effect events
want to analyze the position of all the intervening events relative to these pivotal within a sequence
cause and effect events.
280 Chapter 9 Why and How

Solution
The humble looking STEP dimension, in Figure 9-17, helps BI users understand
sequential behavior. It allows ETL processes to explicitly label events with their
position in a sequence (from its beginning and from its end), along with the length
of the sequence. For example, a web browsing session of four page views by the
same visitor (IP address) within an agreed timeframe would be represented as four
A step dimension rows in a PAGE VIEWS FACT table. The first page view event would be labeled as
numbers each event step 1 of 4 by assigning it a STEP KEY of 7 (see Figure 9-17). The next page view
in a sequence would be labeled as step 2 of 4 using STEP KEY 8, and so on.

Figure 9-17
STEP dimension

A STEP dimension BI users can use the STEP dimension to easily identify page views belonging to
enables positional sessions of any length, rank pages by position within sessions, and answer ques-
analysis (better tions about the beginning, midpoint and ending of sessions for any interesting
story telling) using subset of customers, time, and products. They can quickly find the good and bad
simple single-pass (“session killer”) last page visits of a session (LAST STEP = “Y”), or those that
queries precede session killers (STEPS UNTIL LAST = 1) using simple, single-pass SQL.
Answering questions like these without a STEP dimension would be too difficult
for all but the most SQL-savvy BI users.

Step dimensions A STEP dimension can also play multiple roles for an event; for example, Figure 9-
can play multiple 18 shows a PAGE VIEWS FACT table with two STEP dimension roles: STEP IN
roles to describe SESSION which describes page position within the overall session, and STEP IN
sequences within PURCHASE which describes how close each page is to a purchase decision. Each
sequences time a visitor clicks on a link to place a product in a shopping cart, STEP IN
PURCHASE would be reset and the next mini-sequence length calculated.

Figure 9-18
Using the STEP
dimension to
describe web page
visits
Dimensional Design Patterns for Cause and Effect 281

The STEP IN PURCHASE dimension role lets BI users analyze page visit Events that are not
sequences that lead to product purchases and ones that don’t: page views that don’t part of a sequence
lead to a purchase would have a STEP IN PURCHASE KEY that points to the "Not use STEP row 0
Applicable" row 0 in STEP.

Consequences
STEP dimensions are relatively simple to populate from spreadsheets, but they
grow surprisingly quickly as the maximum number of steps increases. The formula
for calculating the number of rows needed for n total steps is: n(n+1)/2. There- STEP dimensions
fore, 200 steps = 20,100 rows, and 1,000 steps would be more than half a million grow in size quickly.
rows! If 200 steps are more than adequate for 99% of all sequences, pre-populate You should set a
your STEP dimension accordingly, and create special step number records greater maximum number
than 200 if/when they are needed. These records would use special STEP KEY of steps for the
values (e.g. the negative step number) and would contain the STEP NUMBER but majority of
have missing values for the other attributes to denote that they are steps in "excep- sequences
tionally long" sequences. Often exceptionally long sequences are the result of ETL
processing errors or poorly defined business rules that fail to spot the end of a
normal sequence.

Although designing and creating a STEP dimension is straightforward, attaching it STEP dimensions
to the facts can require significant additional ETL processing. The events that require additional
belong to the same sequence have to be identified by an appropriate business rule ETL processing to
(for example, all the page visits from the same IP address that are no more than 10 make two passes of
minutes apart) and counted in a first pass of the data; only then can the correct the data
STEP KEYs be assigned to each fact row in a second pass.

Overloading facts with STEP information and other richly descriptive why and how
dimensions takes significant additional effort from the ETL team. You should
make sure you take them to lunch—on a regular basis.

Audit Dimension Pattern


Problem/Requirement
No treatment of how and why would be complete without covering the perplexed
stakeholder’s questions:

How did this data get into our data warehouse?


Why are the figures so high/low?

Too often the answers to these questions are locked away in an ETL tool metadata Stakeholders want to
repository—inaccessible to BI users who need this information the most. query data lineage
282 Chapter 9 Why and How

Solution
Figure 9-19 shows an AUDIT dimension that presents ETL statistics and data
quality descriptions in a dimensional form—tied directly to the facts—where they
can be queried by BI users, and used to provide additional context within the body
of reports or as header or footer information. The AUDIT dimension surrogate
key—AUDIT KEY—represents each execution of an ETL process. For example, if
An audit dimension there are five different ETL modules that support the nightly refresh of the data
provides summary warehouse, there would be at least five new rows added to the AUDIT Dimension
ETL metadata in a each night. Each of these rows would have a unique AUDIT KEY, which would
dimensional format appear in the fact table (and dimension) rows that were created or updated by the
given ETL execution—providing basic data lineage information on each fact (and
dimension): where it came from, and how it was extracted and loaded or last
updated.

Figure 9-19
AUDIT dimension

Audit dimensions Figure 9-19 also shows additional indicator attributes (in bold) that describe data
can be expanded to quality and completeness. The Audit dimension would contain additional rows for
provide basic data each ETL module so that unusual facts records can be explicitly flagged if they
quality indicators contain out of bounds (defined by example data, data profiling, or historical
norms), missing, adjusted or allocated values.

Audit dimensions Audit dimensions leverage the value of ETL metadata. By making it available
turns metadata into within each star schema they elevate metadata to the position of “real” data—
normal data that can another how or why dimension that BI users can use to group or filter their reports
be used to query the to help explain the figures they see.
facts

You can find additional information on creating and populating Audit dimensions
in The Data Warehouse ETL Toolkit, by Ralph Kimball and Joe Caserta (Wiley,
2004) pages 128–131
Dimensional Design Patterns for Cause and Effect 283

Summary
Why dimensions are used to store direct and indirect causal reasons. Direct causal factors such
as price discounts are typically easier to implement and attribute to facts than indirect factors
because they are captured as part of a business event and do not need to be inferred from addi-
tional internal or external sources.

Unstructured why details are often captured as free text comments. These should be stored in a
COMMENT why dimension rather than as degenerate dimensions within fact tables.

Multi-valued (MV) bridge tables are used to resolve multiple causal factors and other multi-
valued dimension relationships. Bridge tables avoid having to change the natural atomic granu-
larity of a fact table and hard-coding fact allocations at ETL time. Using a bridge table allows BI
users to choose how to weight the facts at query time. They also avoid multi-valued issues alto-
gether when queries do not use the multi-value dimension.

Optional bridge tables and multi-level dimensions that share common surrogate keys can be
used to efficiently handle barely multi-valued dimensions. Queries that do not need to deal with
a multi-valued dimension level and its weighting factor can attach the multi-level dimension di-
rectly to the facts to rollup to single-valued hierarchy levels.

Pivoted dimensions (PD) are built by transposing row values into column flags or column
counts. They are used to simplify combination constraints that would otherwise be difficult to
place across multiple-rows. Pivoted dimensions are often implemented as swappable versions of
multi-valued bridge tables. For query flexibility it is useful to have both the row-oriented bridge
table for grouping and the column-oriented pivoted dimension for combination filtering. It is
also easier to build a pivoted dimension once the bridge table is in place.

Degenerate how dimension (DD) transaction IDs ensure that facts are traceable back to source
systems. They also provide unique event counts for use in multi-valued queries.

Physical how dimensions are typically non-conformed dimensions that are specific to a single
fact table. These miscellaneous dimensions provide a home for the unique combinations of de-
generate dimensions that are too numerous to leave in the fact table. They reduce the size of fact
tables and make it easier for users to browse the dimensional values combinations.

Range Band dimensions support the ad-hoc conversion of continuously variable facts and di-
mensional attributes into report-friendly discrete bands for grouping and filtering.

Step dimensions allow facts to be analyzed using their relative position within event sequences.
They enable BI users to discover events that closely follow or precede other significant cause
and effect events. The help the data warehouse to tell better stories.

Audit dimensions make ETL data lineage and data quality metadata available within star sche-
mas so that it can easily be used with BI reports.
I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.
I send them over land and sea,
I send them east and west;
But after they have worked for me,
I give them all a rest.

—Rudyard Kipling, The Elephant’s Child

Time for a DW/BI Retrospective


APPENDIX A: THE AGILE MANIFESTO
Manifesto for Agile Software Development
We are uncovering better ways of developing software by doing it and helping others do it.
Through this work we have come to value:
Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more.

The Twelve Principles of Agile Software


We follow these principles:

Our highest priority is to satisfy the customer through early and continuous delivery of
valuable software.
Welcome changing requirements, even late in development. Agile processes harness change
for the customer's competitive advantage.
Deliver working software frequently, from a couple of weeks to a couple of months, with a
preference to the shorter timescale.
Business people and developers must work together daily throughout the project.
Build projects around motivated individuals. Give them the environment and support they
need, and trust them to get the job done.
The most efficient and effective method of conveying information to and within a
development team is face-to-face conversation.
Working software is the primary measure of progress.
Agile processes promote sustainable development. The sponsors, developers, and users
should be able to maintain a constant pace indefinitely.
Continuous attention to technical excellence and good design enhances agility.
Simplicity—the art of maximizing the amount of work not done—is essential.
The best architectures, requirements, and designs emerge from self-organizing teams.
At regular intervals, the team reflects on how to become more effective, then tunes and
adjusts its behavior accordingly.
© 2001 Agile Alliance 285
APPENDIX B: BEAM✲ NOTATION AND
SHORT CODES

Table Codes
Event Story and Fact Table Types
CODE MEANING/USAGE CHAPTERS
[DE] Discrete Event. Event represents a point in time or short duration 2, 8
transaction that has completed. Implemented as a transaction fact table.
[RE] Recurring Event. Event represents measurements taken at predictable 2, 8
regular intervals. Implemented as a periodic snapshot fact table.
[EE] Evolving Event. Event represents a process that takes time to complete. 2, 4, 8
Implemented as an accumulating snapshot fact table.
[TF] Transaction Fact table. Physical equivalent of a discrete event [DE]. Typi- 5, 8
cally maintained by insert only.
[AS] Accumulating Snapshot. Physical fact table equivalent of an evolving event 8
[EE]. Maintained by insert and update. Typically contains multiple milestone
date/time dimensions with matching duration and state count facts.
[PS] Periodic Snapshot. Physical fact table equivalent of a recurring event [RE]. 8
Typically contains semi-additive facts.
[AG] Aggregate. Fact table that summarizes an existing fact table. 8
[DF] Derived Fact table. Fact table that is constructed by merging, slicing, or 8
pivoting existing fact tables.
{Source} Data source. Default source system table or filename. 5

287
288 Appendix B

Dimension Table Types


CODE MEANING/USAGE CHAPTERS
[CV] Current Value. Table contains only current value dimensional attributes. 4, 5, 6
Also known as a type 1 slowly changing dimension.
[HV] Historic Value. Table contains at least one historical value dimensional 4, 5, 6
attribute. Also known as a type 2 slowly changing dimension.
[RP] Role-Playing. Dimension is used to play multiple roles; for example, 4
Salesperson and Manager are both roles of the Employee
[RP]dimension. Calendar[RP] is the most common role-playing dimen-
sion.
[RU] Rollup. Dimension is derived from a more granular dimension. For example, 4
Month[RU] is a rollup of the Calendar dimension containing conformed
dimensional attributes Month, Quarter, and Year.
[SD] Swappable Dimension. Part of a set of dimensions with a common surro- 6
gate key that can be used in place of each other. Swappable dimensions are
often used to provide subsets of a large dimension population for efficiency;
for example, Business Customer is a swappable subset of Customer.
Swappable dimensions can also be used to provide alternative historical
views and national language support.
[ML] Multi-Level. A dimension containing additional members representing 6
higher levels in the dimension’s hierarchy. Used when a fact table can be
attached to a dimension at different levels. For example, sales transactions
can be assigned to an individual Employee or a Team/Branch, and web
advertisements can be for a specific product or a product category.
[HM] Hierarchy Map. A table used to resolve a recursive relationship. Represents 6
a variable-depth hierarchy. For example, Company Structure[CV,HM] is
a current value hierarchy map (does not track hierarchy history).
[MV] Multi-Valued. A bridge table used to resolve a many-to-many relationship 6, 9
between a fact table and a multi-valued dimension.
Or
A hierarchy map [HM] that contains child members with multiple direct
parents. For example, Reporting Structure [MV,HM] is a hierarchy
map that connects employees to more than one direct manager.
MV tables often contain a weighting factor that allows facts to be allocated
across the multiple values at query time.
[PD] Pivoted Dimension. A dimension that represents multiple row values as a 9
set of column (bit) flags—used to simplify combination selection. Often built
by pivoting a multi-valued bridge table or a fact table.
{Source} Data source. Default source system table or filename. 5
BEAM✲ Notation and Short Codes 289

Column Codes
General Column Codes
CODE MEANING/USAGE CHAPTERS
MD Mandatory. Column value should be present under normal conditions. 2
Column is defined as nullable so it can handle errors.
NN Not Null. Column does not allow nulls. All SK and FK columns are not null by 5
default.
ND No Duplicates. Column must not contain duplicate values. The numbered 9
NDn version is used to define a combination of columns that must be unique.
Xn Exclusive. A dimensional attribute that is not valid for all members of a 3, 8
dimension. Used in conjunction with a DC defining characteristic.
Number coded to identify mutually exclusive attributes or attribute groups and
identify the defining characteristics it is paired with.
Also used to denote exclusive facts that are only valid for certain dimensional
values.
DC Defining Characteristic. Column value dictates which exclusive attributes or 3, 8
DCn,n facts are valid. For example, Product Type DC defines which Product
attributes are valid. Number coded when multiple defining characteristics
exist in the same table.
[W-type] Dimension type or name. The W (who, what, when, where, why, how) type 4, 6
[dimension] of an event detail or the dimension name when a detail is a role; for example,
Salesperson [Employee] where Salesperson is a role of the
Employee dimension. Also used to show a recursive relationship within a
detail table.
{Source} Data source. The name of a column or field in a source system. Can be 5
qualified with a table or filename if necessary (when different from the table
default).
Unavailable Unavailable or incorrect. Column name or column code annotation denot- 5
MD ing that source data is unavailable or does not comply with the current
column type definition. For example, MD denotes that the source system
does not treat the data as mandatory as it contains null or missing values.
Gender denotes that Gender is not available.

Data Types
CODE MEANING/USAGE CHAPTERS
C Character data type. The numbered version is used to define the maximum 5
Cn length. Overrides the default length.
N Numeric data type. The numbered version is used to define precision. 5
Nn.n Overrides the default precision.
DT Date/Time data type. The numbered version is used in duration formulas for 4, 5, 8
DTn derived facts; for example, Delivery Delay DF=DT2-DT1. Numbering
can also denote default chronological order of milestones within an evolving
event.
290 Appendix B

D Date data type. The numbered version is used in duration formulas for 5, 8
Dn derived facts; for example, Project Duration DF=D2-D1. Numbering
can also denote chronological order of milestones within an evolving event.
T Text. Long character data used to hold free format text. The numbered 5
Tn version is used to define the maximum length. Overrides the default length.
B Blob. Binary long object used to hold documents, images, sound, objects, 5
and so on.

Key Types
CODE MEANING/USAGE CHAPTERS
PK Primary Key. Column or group of columns that uniquely identify each row in 5
a table.
SK Surrogate Key. integer assigned by the data warehouse as the primary key 5
for a dimension table. Used as a foreign key in fact tables.
Used to denote that example data in a BEAM✲ table column will be replaced
by an integer foreign key in the physical model.
BK Business Key. A source system key. 3, 5
NK Natural Key. A (source system) key used in the real world 5
FK Foreign Key. A column that references the primary key of another table. 5
RK Recursive Key. A foreign key that references the primary key of its own 6
table. Often used to represent variable-depth hierarchies. Stores information
needed to build hierarchy maps; for example, Parent Company Key in
Company.

Dimensional Attribute Types


CODE MEANING/USAGE CHAPTERS
CV Current Value attribute. A dimensional attribute that holds the current value 3, 6
CVn only. Source system updates overwrite the previous value. Supports current
value (as is) reporting. Also known as a type 1 slowly changing dimensional
attribute.
The numbered version relates a CV attribute to a previous value (PV) version
of itself; for example, Territory CV1 and Previous Territory PV1.
HV Historic Value attribute. A dimensional attribute that tracks historical values. 3, 6
HVn Source system updates cause a new version of the dimensional record to be
created, preserving the historically correct values. Supports historical value
(as was) reporting. Also known as a type 2 slowly changing dimensional
attribute.
The numbered version is used in combination with CV to define conditional
HV attributes. These are CV attributes that act as HV attributes only when
another HV attribute with the same number changes; for example, Street
CV, HV1 will only track changes when Zip Code HV1 changes at the same
BEAM✲ Notation and Short Codes 291

time.
FV Fixed Value attribute. A dimensional attribute that should not change; for 3
example Date of Birth. FV attributes however can be corrected. When
FV attributes are corrected they behave like CV attributes: the previous
incorrect value is not preserved.
PVn Previous Value attribute. A dimensional attribute that records the previous 6
value of another current value attribute. Also known as a type 3 slowly
changing dimensional attribute.
PVn is always used in conjunction with a matching CVn to relate the previous
value to the current value; for example, Previous Territory PV1 and
Territory CV1.
PV attributes can also be used to hold initial or “as at specific date” values;
for example, Initial Territory PV1 or YE2010 Territory PV1.

Event Detail and Fact Column Types


CODE MEANING/USAGE CHAPTERS
MV Multi-Value. Event detail contains multiple values that must be resolved 6, 9
using a bridge table when converted to a dimensional model.
Fact table FK that points to a multi-value bridge table.
ML Multi-Level. Event detail represents various levels in a hierarchy such as 6
individual employee or teams/branches that must be handled by a multi-level
dimension that contains additional members representing the required levels.
Fact table FK that points to a multi-level dimension and makes use of the
additional levels.
DD Degenerate Dimension. Dimensional attribute stored in a fact table. Has no 2, 3, 4, 5
additional descriptive attributes; therefore, does not join to a physical dimen-
sion table. Typically used for transaction IDs (how details); for example,
Order ID DD.
GD Granular Dimension. A dimension or combination of dimensions that de- 2, 8
GDn fines the granularity of a fact table.
The numbered version is used when alternative dimension combinations can
define the granularity. For example, Call Reference Number GD1 or
Customer GD2, Call Time GD2 define the granularity of a call detail fact
table.
FA (Fully) Additive fact. A fact that produces a correct total when summed 5, 8
across any combination of its dimensions. For a fact to be additive it must be
expressed in a single unit of measure. Percentages and unit prices are not
additive.
SA Semi-Additive fact. A fact that can be correctly totaled by some dimensions 8
SAn but not others. Semi-additive facts have at least one non-additive (NA)
dimension. For example, an account balance cannot be summed over time:
its non-additive (NA) dimension. Semi-additive facts are often averaged over
their non-additive dimension.
SA is always used in conjunction with at least one NA dimension foreign key
to relate the semi-additive fact to its non-additive dimensions.
The numbered version is used to relate multiple semi-additive facts in the
same table to their appropriate NA dimensions. For example, Stock Level
SA1 is non-additive across Stock Date Key NA1 whereas Order Count
SA2 is non-additive across Product Key NA2.
292 Appendix B

NA Non-Additive fact. A fact that cannot be aggregated using sum; for example, 8
NAn Temperature NA. Non-additive facts can be aggregated using other func-
tions such as min, max, and average.
A non-additive dimension of a semi-additive fact. The numbered version is
used to relate non-additive dimensions to specific semi-additive facts when
multiple SAs exist in the same table.
DF Derived Fact. A fact that can be derived from other columns within the same 8
DF= table. May be followed by a simple formula referencing other facts or
formulae date/time details by number; for example, Unit Price DF=Revenue
/Quantity.
[UoM] Unit of measure. Unit of measure symbol or description; for example, Order 2, 4
[UoM1, Revenue [$] or Delivery Delay [days].
UoM2,...] Lists multiple units of measure required for reporting, with the default stan-
dard unit (UoM1) first. All quantities are stored in the standard UoM to pro-
duce an additive fact.
APPENDIX C: RESOURCES FOR AGILE
DIMENSIONAL MODELERS
Here is our list of recommended resources to help you implement the ideas contained in the book.

Tools: Hardware and Software


Use Inclusive Tools. Take Pictures
We’re very interested in the use of tablet devices for collaborative data modeling, but until they support
seamless shared drawing and become ubiquitous, to the point where everyone is comfortable using them
to scribble all over your nascent designs, we recommend that you use low-tech whiteboards, flipcharts,
large Post-It notes, or whiteboard-on-a-roll for your modelstorming sessions.

The Wi-Fi–enabled cameras in smartphones and tablets can be tremendously useful for capturing
modelstorming results and quickly transferring them to a laptop for further review (stick to black ink on
your whiteboards and flipcharts to help with that). There are many apps that can automate the workflow
of cleaning up whiteboard images and moving them to shared folders for group viewing.

Go Large
Digital projectors are our number one high-tech collaborative modeling tools. It’s amazing how quickly
everyone can spot opportunities to improve a data model when it’s blown up large on the wall. Invite
your colleagues to the screening of your latest data model. Perhaps they can stay for a movie afterwards!

Avoid Database Modeling Tools When You’re Not Talking To Databases


We recommend that you use spreadsheets and presentation software for communicating with business
stakeholders, and ERD data modeling tools for communicating with databases and DBAs. ERD model-
ing software is invaluable for forward and reverse-engineering physical database tables and drawing
detailed star schema diagrams for a technical audience, but can get in the way when working with busi-
ness people.

Try Our Template


We suggest that you try the BEAM✲Modelstormer spreadsheet template (see Websites shortly). It sup-
ports the transition between BEAM✲ models and physical database models by generating SQL DDL that
can be imported by many commercial database modeling tools.

Playing Planning Poker


There are a number of smartphone apps that simulate a deck of planning poker cards.

293
294 Appendix C

Books
Agile Software Development
Scrum and XP from the Trenches, Henrik Kniberg (InfoQ.com, 2007)
Not why do agile (like so many books) but how Henrik did agile.

Agile Analytics, Ken Collier (Addison-Wesley, 2011)


While our book concentrates on agile DW/BI data modeling, Ken’s book is a guide to being agile in many
of the other aspects of DW/BI projects.

Visual Thinking, Collaboration and Facilitation, Business Modeling


The Back of the Napkin: Solving Problems and Selling Ideas With Pictures, Dan Roam (Portfolio, 2008)
Blah Blah Blah: What To Do When Words Don't Work, Dan Roam (Portfolio, 2011)
Dan’s books contain great ideas that will inspire you to draw simple pictures of the 7Ws to help you
discover, understand and present your dimension data stories and BI designs.

Gamestorming: A Playbook for Innovators, Rulebreakers, and Changemakers, Dave Gray, Sunni
Brown, James Macanufo (O’Reilly Media, 2010)
Visual Meetings: How Graphics, Sticky Notes and Idea Mapping Can Transform Group Productivity,
David Sibbet (Wiley, 2010)
Books to help you facilitate modelstorming sessions and improve upon the collaborative techniques
we have introduced.

Business Model Generation: A Handbook for Visionaries, Game Changers, and Challengers,
Alexander Osterwalder, Yves Pigneur et al. (Wiley, 2010)
Check out the Business Model Canvas for more high-level collaborative modeling ideas.

Dimensional Modeling
Star Schema The Complete Reference, Christopher Adamson (McGraw-Hill, 2010)

Dimensional Modeling Case Studies


Data Warehouse Design Solutions, Christopher Adamson, Michael Venerable (Wiley, 1998)
The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley, 2002)

ETL
The Data Warehouse ETL Toolkit, Ralph Kimball, Joe Caserta (Wiley, 2004)
Mastering Data Warehouse Aggregates, Christopher Adamson (Wiley, 2006) Chapters 5 and 6 also
provides excellent coverage of dimensional ETL.

Database–Specific DW/BI Advice


The Microsoft Data Warehouse Toolkit, Joy Mundy, Warren Thornthwaite (Wiley, 2011)
Resources For Agile Dimensional Modelers 295

Websites
decisionone.co.uk : DecisionOne Consulting, Lawrence Corr’s training and consulting firm.

llumino.com : Llumino, Jim Stagnitto’s consulting firm.

modelstorming.com : The companion website to this book where you can download the BEAM✲
Modelstormer spreadsheet, the BI Model Canvas (inspired by the Business Model Canvas) plus other
useful BEAM✲ tools and example models from the book and beyond. It also contains links to our rec-
ommended books, articles, websites, and training courses.

You might also like