Agile Data Warehouse Design Ebook
Agile Data Warehouse Design Ebook
Data Warehouse
Design
Dimensional Modeling,
from to Star Schema
Lawrence Corr
with Jim Stagnitto
Sold to
[email protected]
business intelligence (DW/BI) requirements and turning them into high performance dimensional models in the most
direct way: by modelstorming (data modeling + brainstorming) with BI stakeholders.
Agile Data Warehouse Design is a step-by-step guide for capturing data warehousing/
business intelligence (DW/BI) requirements and turning them into high performance dimensional models in the
This
This book describes BEAM✲, an agile approach to dimensional modeling, for improving communication between
most direct way: by modelstorming (data modeling + brainstorming) with BI stakeholders.
data warehouse designers, BI stakeholders and the whole DW/BI development team. BEAM✲ provides tools
and techniques
This bookthat will encourage
describes BEAM✲, DW/BI
an agiledesigners
approach and developers modeling,
to dimensional to move for
away from their
improving keyboards and
communication
between data
entity relationship warehouse
based designers,
tools and model BIinteractively
stakeholders with
and their
the whole DW/BI development
colleagues. The result team. BEAM✲thinks
is everyone
provides tools and techniques that will encourage DW/BI designers and developers to move away from their
dimensionally from the outset! Developers understand how to efficiently implement dimensional modeling
keyboards and entity relationship based tools and model interactively with their colleagues. The result is
solutions. Business stakeholders feel ownership of the data warehouse they have created, and can already
everyone thinks dimensionally from the outset! Developers understand how to efficiently implement dimensional
imagine how they will use it to answer their business questions.
modeling solutions. Business stakeholders feel ownership of the data warehouse they have created, and can
already imagine how they will use it to answer their business questions.
Within this book, you will learn:
Within this book, you will learn:
✲ Agile dimensional modeling using Business Event Analysis & Modeling (BEAM✲)
✲ Agile dimensional modeling using Business Event Analysis & Modeling (BEAM✲)
✲ Modelstorming: data modeling that is quicker, more inclusive, more productive, and frankly more fun!
✲ Modelstorming: data modeling that is quicker, more inclusive, more productive, and frankly more fun!
✲ Telling dimensional data stories using the 7Ws (who, what, when, where, how many, why and how)
✲ Telling dimensional data stories using the 7Ws (who, what, when, where, how many, why and how)
✲ Modeling by example not abstraction;; using data story themes, not crow’s feet, to describe detail
✲ Modeling by example not abstraction; using data story themes, not crow’s feet, to describe detail
✲ Storyboarding the data warehouse to discover conformed dimensions and plan iterative development
✲ Storyboarding the data warehouse to discover conformed dimensions and plan iterative development
✲ Visual modeling: sketching timelines, charts and grids to model complex process measurement – simply
✲ Visual modeling: sketching timelines, charts and grids to model complex process measurement – simply
✲ Agile design documentation: enhancing star schemas with BEAM✲ dimensional shorthand notation
✲ Agile design documentation: enhancing star schemas with BEAM✲ dimensional shorthand notation
✲ Solving difficult DW/BI performance and usability problems with proven dimensional design patterns
✲ Solving difficult DW/BI performance and usability problems with proven dimensional design patterns
Lawrence Corr is a data warehouse designer and educator. As Principal of DecisionOne
Consulting, he helps Corr
Lawrence is to
clients a data warehouse
review designertheir
and simplify and educator. As Principaldesigns,
data warehouse of DecisionOne
and advises
vendors Consulting,
on visual he helps
data organizations
modeling to reviewHe
techniques. andregularly
simplify their data warehouse
teaches designs, and
agile dimensional modeling
advises on visual data modeling techniques. He regularly holds agile modeling workshops
courses worldwide and has taught dimensional DW/BI skills to thousands of students.
worldwide and has taught dimensional DW/BI skills to thousands of business/IT professionals.
Jim Stagnitto is a data warehouse and master data management architect specializing in the
Jimfinancial
Stagnitto
healthcare, is a data
services, and warehouse andservice
information master data management
industries. architect
He is specializing
the founder in data
of the
the healthcare, financial services, and information
warehousing and data mining consulting firm Llumino. service industries. He is the founder of the
data warehousing and data mining consulting firm Llumino.
decisionone.co.uk
BEAM✲
modelstorming.com
Agile Data Warehouse Design
Collaborative Dimensional Modeling,
from Whiteboard to Star Schema
Lawrence Corr
with Jim Stagnitto
Agile Data Warehouse Design
by Lawrence Corr with Jim Stagnitto
No part of this book may be reproduced in any form or by any electronic or mechanical means including informa-
tion presentation, storage and retrieval systems, without permission in writing from the copyright holder. The only
exception is by a reviewer, who may quote short excerpts in a review.
This eBook is free of copy protection or functionality restrictions. You may view or print it for
personal use as you see fit. You may make copies for your own personal use (e.g., one for use while
traveling and one on a home computer or backup device) but you may not give the eBook file to
other people. The file is personalized with an email address on the cover and other identifying
information and belongs to that email user. Ownership can not be transferred or sold. You may print the eBook but
it has been formatted specifically for on-screen viewing with no blank pages so margins and facing pages will not
print correctly. Generally it is cheaper and more efficient to order a paperback copy than print out the entire book.
Proofing: Laurence Hesketh, Geoff Hurrell Illustrators: Gill Guile and Lawrence Corr
Cover Design: After Escher
Printing History:
November 2011: First Edition. January 2012: Revision. October 2012: Revision. May 2013: Revision.
Non-Printing History:
May 2013: eBook First Edition.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks.
Where those designations appear in this book, and DecisionOne Press was aware of a trademark claim, the
designations have been printed in caps or initial caps.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with
respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties,
including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended
by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation.
Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or
Website is referred to in this work as a citation and/or a potential source of further information does not mean that
the author or the publisher endorses the information the organization or Website may provide or recommendations
it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or
disappeared between when this work was written and when it is read.
1923-2009
Lawrence has developed and reviewed data warehouses for healthcare, telecommunications, engineering,
broadcasting, financial services and public service organizations. He held the position of data warehouse
practice leader at Linc Systems Corporation, CT, USA and vice-president of data warehousing products
at Teleran Technologies, NJ, USA. Lawrence was also a Ralph Kimball Associate and has taught data
warehousing classes for Kimball University in Europe and South Africa and contributed to Kimball
articles and design tips. He lives in Yorkshire, England with his wife Mary and daughter Aimee. Law-
rence can be contacted at:
Jim Stagnitto is a data warehouse and master data management architect specializing in the healthcare,
financial services and information service industries. He is the founder of the data warehousing and data
mining consulting firm Llumino.
Jim has been a guest contributor for Ralph Kimball’s Intelligent Enterprise column, and a contributing
author to Ralph Kimball & Joe Caserta’s The Data Warehouse ETL Toolkit. He lives in Bucks County,
PA, USA with his wife Lori, and their happy brood of pets. Jim can be contacted at:
[email protected]
ACKNOWLEDGEMENTS
We would like to express our gratitude to everyone who made this book about BEAM✲ possible
(using BEAM✲ notation):
CONTENTS
INTRODUCTION ................................................................................................................................. XVII!
PART I: MODELSTORMING ................................................................................................................... 1!
CHAPTER 1
HOW TO MODEL A DATA WAREHOUSE.............................................................................................. 3!
OLTP VS. DW/BI: TWO DIFFERENT WORLDS ......................................................................................... 4!
The Case Against Entity-Relationship Modeling .......................................................................... 5!
Advantages of ER Modeling for OLTP ...................................................................................... 6!
Disadvantages of ER Modeling for Data Warehousing............................................................. 6!
The Case For Dimensional Modeling........................................................................................... 7!
Star Schemas............................................................................................................................ 8!
Fact and Dimension Tables ...................................................................................................... 8!
Advantages of Dimensional Modeling for Data Warehousing................................................... 9!
DATA WAREHOUSE ANALYSIS AND DESIGN .......................................................................................... 11!
Data-Driven Analysis.................................................................................................................. 11!
Reporting-Driven Analysis.......................................................................................................... 12!
Proactive DW/BI Analysis and Design ....................................................................................... 13!
Benefits of Proactive Design for Data Warehousing ............................................................... 14!
Challenges of Proactive Analysis for Data Warehousing........................................................ 15!
Proactive Reporting-Driven Analysis Challenges.................................................................... 15!
Proactive Data-Driven Analysis Challenges............................................................................ 15!
Data then Requirements: a ‘Chicken or the egg’ Conundrum................................................. 15!
Agile Data Warehouse Design ................................................................................................... 16!
Agile Data Modeling ................................................................................................................... 17!
Agile Dimensional Modeling....................................................................................................... 18!
Agile Dimensional Modeling and Traditional DW/BI Analysis ................................................. 19!
Agile Data-Driven Analysis...................................................................................................... 19!
Agile Reporting-Driven Analysis.............................................................................................. 19!
Requirements for Agile Dimensional Modeling ....................................................................... 19!
BEAM✲ ............................................................................................................................................ 21!
Data Stories and the 7Ws Framework ....................................................................................... 21!
Diagrams and Notation .............................................................................................................. 21!
BEAM✲ (Example Data) Tables ............................................................................................. 21!
BEAM✲ Short Codes.............................................................................................................. 22!
Comparing BEAM✲ and Entity-Relationship Diagrams.......................................................... 22!
Data Model Types ................................................................................................................... 23!
BEAM✲ Diagram Types ......................................................................................................... 24!
SUMMARY .......................................................................................................................................... 26!
CHAPTER 2
MODELING BUSINESS EVENTS.......................................................................................................... 27!
DATA STORIES ................................................................................................................................... 28!
Story Types ................................................................................................................................ 28!
Discrete Events ....................................................................................................................... 29!
Evolving Events....................................................................................................................... 29!
X Agile Data Warehouse Design
CHAPTER 3
MODELING BUSINESS DIMENSIONS ................................................................................................. 59!
DIMENSIONS ....................................................................................................................................... 60!
Dimension Stories ...................................................................................................................... 60!
Discovering Dimensions............................................................................................................. 61!
DOCUMENTING DIMENSIONS ................................................................................................................ 62!
Dimension Subject ..................................................................................................................... 62!
Dimension Granularity and Business Keys ................................................................................ 63!
DIMENSIONAL ATTRIBUTES .................................................................................................................. 64!
Attribute Scope........................................................................................................................... 64!
Contents XI
CHAPTER 4
MODELING BUSINESS PROCESSES.................................................................................................. 95!
MODELING MULTIPLE EVENTS WITH AGILITY ........................................................................................ 96!
Conformed Dimensions.............................................................................................................. 97!
The Data Warehouse Bus........................................................................................................ 100!
The Event Matrix ...................................................................................................................... 102!
Event Sequences ..................................................................................................................... 104!
Time/Value Sequence........................................................................................................... 104!
Process Sequence ................................................................................................................ 104!
Modeling Process Sequences as Evolving Events ............................................................... 105!
Using Process Sequences to Enrich Events......................................................................... 105!
MODELSTORMING WITH AN EVENT MATRIX ......................................................................................... 105!
XII Agile Data Warehouse Design
CHAPTER 5
MODELING STAR SCHEMAS............................................................................................................. 129!
AGILE DATA PROFILING ..................................................................................................................... 130!
Identifying Candidate Data Sources......................................................................................... 131!
Data Profiling Techniques ........................................................................................................ 132!
Missing Values ...................................................................................................................... 132!
Unique Values and Frequency.............................................................................................. 133!
Data Ranges and Lengths .................................................................................................... 133!
Automating Your Own Data Profiling Checks ....................................................................... 134!
No Source Yet: Proactive DW/BI Design ................................................................................. 134!
Annotating the Model with Data Profiling Results .................................................................... 135!
Data Sources and Data Types .............................................................................................. 135!
Additional Data...................................................................................................................... 137!
Unavailable Data................................................................................................................... 137!
Nulls and Mismatched Attribute Descriptions........................................................................ 137!
MODEL REVIEW AND SPRINT PLANNING ............................................................................................. 138!
Team Estimating ...................................................................................................................... 138!
Running a Model Review ......................................................................................................... 139!
Sprint Planning......................................................................................................................... 140!
STAR SCHEMA DESIGN ..................................................................................................................... 141!
Adding Keys to a Dimensional Model ...................................................................................... 141!
Choosing Primary Keys: Business Keys vs. Surrogate Keys................................................ 141!
Benefits of Data Warehouse Surrogate Keys ....................................................................... 142!
Insulate the Data Warehouse from Business Key Change ................................................... 143!
Cope with Multiple Business Keys for a Dimension .............................................................. 143!
Track Dimensional History Efficiently.................................................................................... 143!
Handle Missing Dimensional Values..................................................................................... 143!
Support Multi-Level Dimensions ........................................................................................... 144!
Protect Sensitive Information ................................................................................................ 144!
Reduce Fact Table Size........................................................................................................ 144!
Improve Query Performance................................................................................................. 145!
Enforce Referential Integrity Efficiently ................................................................................. 145!
Slowly Changing Dimensions................................................................................................... 146!
Overwriting History: Type 1 Slowly Changing Dimensions ................................................... 147!
Contents XIII
CHAPTER 7
WHEN AND WHERE: DESIGN PATTERNS FOR TIME AND LOCATION ........................................ 203!
TIME DIMENSIONS............................................................................................................................. 204!
Calendar Dimensions............................................................................................................... 205!
Date Keys.............................................................................................................................. 206!
ISO Date Keys ...................................................................................................................... 207!
Epoch-Based Date Keys ....................................................................................................... 207!
Populating the Calendar........................................................................................................ 208!
BI Tools and Calendar Dimensions....................................................................................... 208!
Period Calendars ..................................................................................................................... 209!
Month Dimensions ................................................................................................................ 209!
Offset Calendars ................................................................................................................... 210!
Year-to-Date Comparisons ...................................................................................................... 210!
Fact-Specific Calendar Pattern ............................................................................................. 212!
Using Fact State Information in Report Footers.................................................................... 213!
Conformed Date Ranges ...................................................................................................... 214!
CLOCK DIMENSIONS ......................................................................................................................... 214!
Day Clock Pattern - Date and Time Relationships................................................................... 215!
Time Keys ................................................................................................................................ 216!
INTERNATIONAL TIME ........................................................................................................................ 217!
Multinational Calendar Pattern................................................................................................. 218!
Date Version Keys ................................................................................................................... 220!
INTERNATIONAL TRAVEL .................................................................................................................... 221!
Time Dimensions or Time Facts? ............................................................................................ 224!
NATIONAL LANGUAGE DIMENSIONS .................................................................................................... 225!
National Language Calendars.................................................................................................. 225!
Swappable National Language Dimensions Pattern................................................................ 225!
SUMMARY ........................................................................................................................................ 226!
CHAPTER 8
HOW MANY: DESIGN PATTERNS FOR HIGH PERFORMANCE FACT TABLES AND FLEXIBLE
MEASURES ......................................................................................................................................... 227!
FACT TABLE TYPES .......................................................................................................................... 228!
Transaction Fact Table ............................................................................................................ 228!
Contents XV
CHAPTER 9
WHY AND HOW: DESIGN PATTERNS FOR CAUSE AND EFFECT ................................................ 261!
WHY DIMENSIONS............................................................................................................................. 262!
Internal Why Dimensions ......................................................................................................... 262!
Unstructured Why Dimensions................................................................................................. 263!
External Why Dimensions ........................................................................................................ 264!
MULTI-VALUED DIMENSIONS ............................................................................................................. 265!
Weighting Factor Pattern ......................................................................................................... 265!
Modeling Multi-Valued Groups................................................................................................. 267!
Multi-Valued Bridge Pattern ..................................................................................................... 268!
Optional Bridge Pattern............................................................................................................ 270!
Pivoted Dimension Pattern....................................................................................................... 273!
HOW DIMENSIONS ............................................................................................................................ 276!
Too Many Degenerate Dimensions?........................................................................................ 277!
Creating How Dimensions........................................................................................................ 277!
Range Band Dimension Pattern............................................................................................... 278!
XVI Agile Data Warehouse Design
Agile, with its mantra of creating business value through the early and frequent Agile techniques
delivery of working software and responding to change, has had just such a revolu- can help, but they
tionary effect on the world of application development. Can it take on the chal- must address data
lenges of DW/BI? Agile’s emphasis on collaboration and incremental development warehouse design,
coupled with techniques such as Scrum and User Stories, will certainly improve BI not just BI
application development—once a data warehouse is in place. But to truly have an application
impact on DW/BI, agile must also address data warehouse design itself. Unfortu- development
nately, the agile approaches that have emerged, so far, are vague and non-
prescriptive in this one key area. For agile BI to be more than a marketing reboot of
business-as-usual business intelligence, it must be agile DW/BI and we, DW/BI
professionals, must do what every true agilist would recommend: adapt agile to
meet our needs while still upholding its values and principlFs (see Appendix A: The
Agile Manifesto). At the same time, agilists coming afresh to DW/BI, for their part,
must learn our hard-won data lessons.
With that aim in mind, this book introduces BEAM✲ (Business Event Analysis & This book is about
Modeling): a set of collaborative techniques for modelstorming BI data require- BEAM✲: an agile
ments and translating them into dimensional models on an agile timescale. We call approach to
the BEAM✲ approach “modelstorming” because it combines data modeling and dimensional
brainstorming techniques for rapidly creating inclusive, understandable models modeling
that fully engage BI stakeholders.
BEAM✲ modelers achieve this by asking stakeholders to tell data stories, using the BEAM✲ is used for
7W dimensional types—who, what, when, where, how many, why, and how—to modelstorming BI
describe the business events they need to measure. BEAM✲ models support mod- requirements
elstorming by differing radically from conventional entity-relationship (ER) based directly with BI
models. BEAM✲ uses tabular notation and example data stories to define business stakeholders
events in a format that is instantly recognizable to spreadsheet-literate BI
stakeholders, yet easily translated into atomic-detailed star schemas. By doing so,
BEAM✲ bridges the business-IT gap, creates consensus on data definitions and
generates a sense of business ownership and pride in the resulting database design.
XVII
XVIII Introduction
It is aimed at both For those new to data warehousing, this book provides a quick-study introduction
new and experienced to dimensional modeling techniques. For those of you who would like more
DW/BI practitioners. background on the techniques covered, the later chapters and Appendix C provide
It’s a quick-study references to case studies in other texts that will help you gain additional business
guide to dimensional insight. Experienced data warehousing professionals will find that this book offers
modeling and a a fresh perspective on familiar dimensional modeling patterns, covering many in
source of new more detail than previously available, and adding several new ones. For all readers,
dimensional design this book offers a radically new agile way of engaging with business users and kick-
patterns starting their next warehouse development project.
The bright modeler, not surprisingly, has some bright ideas. His tips, techniques and
practical modeling advice, distilled from the current topic, will help you improve
your design.
The experienced dimensional modeler has seen it all before. He’s here to warn you
when an activity or decision can steal your time, sanity or agility. Later in the book
he follows the pattern users (see below) to tell you about the consequences or side
effects of using their recommended design patterns. He would still recommend
you use their patterns though—just with a little care.
Introduction XIX
The note takers are the members of the team who always read the full instructions
before they use that new gadget or technique. They’re always here to tell you to
“make a note of that” when there is extra information on the current topic.
The agilists will let you know when we're being particularly agile. They wave their
banner whenever a design technique supports a core value of the agile manifesto or
principle of agile software development. These are listed in Appendix A.
The scribe appears whenever we introduce new BEAM✲ diagrams, notation con-
ventions or short codes for rapidly documenting your designs. All the scribe’s short
codes are listed in Appendix B.
The agile modeler engages with stakeholders and facilitates modelstorming. She is
here to ask example BEAM✲ questions, using the 7Ws, to get stakeholders to tell
their data stories.
The stakeholders are the subject matter experts, operational IT staff, BI users and BI
consumers, who know the data sources, or know the data they want—anyone who
can help define the data warehouse who is not a member of the DW/BI develop-
ment team. They are here to provide example answers to the agile modeler’s ques-
tions, tell data stories and pose their own tricky BI questions.
The bookworm points you to further reading on the current topic. All her reading
recommendations are gathered in Appendix C.
The agile developer appears when we have some practical advice about using soft-
ware tools or there is something useful you can download.
The pattern users have a solution to the head scratcher’s problems. They’re going to
use tried and tested dimensional modeling design patterns, some new in print.
XX Introduction
Part I: Modelstorming
Part I describes how to modelstorm BI stakeholders’ data requirements, validate
Collaborative these requirements using agile data profiling, review and prioritize them with
modeling with BI stakeholders, estimate their ETL tasks as a team, and convert them into star sche-
stakeholders mas. It illustrates how agile data modeling can be used to replace traditional BI
requirements gathering with accelerated database design, followed by BI prototyp-
ing to capture the real reporting and analysis requirements. Chapter 1 provides an
introduction to dimensional modeling. Chapters 2 to 4 provide a step-by-step
guide for using BEAM✲ to model business events and dimensions. Chapter 5
describes how BEAM✲ models are validated and translated into physical dimen-
sional models and development sprint plans.
cause (why), and effect (how), we document new and established dimensional
techniques from a dimensional perspective for the first time.
Cross-process analysis: Combining the results from multiple fact tables using
drill-across processing and multi-pass queries. Building derived fact tables and
consolidated data marts to simplify query processing.
Companion Website
Visit modelstorming.com to download the BEAM✲Modelstormer spreadsheet
and other templates that accompany this book. On the site you will find example
models and code listings together with links to articles, books, and the worldwide
schedule of training courses and workshops on BEAM✲ and agile data warehouse
design. Register your paperback copy online to receive a discounted eBook version.
PART I: MODELSTORMING
AGILE DIMENSIONAL MODELING, FROM WHITEBOARD TO STAR SCHEMA
In this first chapter we set out the motivation for adopting an agile approach to Dimensional
data warehouse design. We start by summarizing the fundamental differences modeling supports
between data warehouses and online transaction processing (OLTP) databases to data warehouse
show why they need to be designed using very different data modeling techniques. design
We then contrast entity-relationship and dimensional modeling and explain why
dimensional models are optimal for data warehousing/business intelligence
(DW/BI). While doing so we also describe how dimensional modeling enables
incremental design and delivery: key principles of agile software development.
Readers who are familiar with the benefits of traditional dimensional modeling Collaborative
may wish to skip to Data Warehouse Analysis and Design on Page 11 where we begin dimensional
the case for agile dimensional modeling. There, we take a step back in the DW/BI modeling
development lifecycle and examine the traditional approaches to data requirements supports agile
analysis, and highlight their shortcomings in dealing with ever more complex data data warehouse
sources and aggressive BI delivery schedules. We then describe how agile data analysis and design
modeling can significantly improve matters by actively involving business
stakeholders in the analysis and design process. We finish by introducing BEAM✲
(Business Event Analysis and Modeling): the set of agile techniques for collabora-
tive dimensional modeling described throughout this book.
3
4 Chapter 1
Figure 1-1
Entity-Relationship
diagram (ERD)
Within a relational database, entities are implemented as tables and their attributes Entities become
as columns. Relationships are implemented either as columns within existing tables, attributes
tables or as additional tables depending on their cardinality. One-to-one (1:1) and become columns
many-to-one (M:1) relationships are implemented as columns, whereas many-to-
many (M:M) relationships are implemented using additional tables, creating
additional M:1 relationships.
ER modeling is associated with normalization in general, and third normal form ER models are
(3NF) in particular. ER modeling and normalization have very specific technical typically in third
goals: to reduce data redundancy and make explicit the 1:1 and M:1 relationships normal form (3NF)
within the data that can be enforced by relational database management systems.
6 Chapter 1
Higher forms of normalization are available, but most ER modelers are satisfied
when their models are in 3NF. There is even a mnemonic to remind everyone that
data in 3NF depends on “The key, the whole key, and nothing but the key, so help
me Codd”—in memory of Edgar (Ted) Codd, inventor of the relational model.
3NF models are More importantly, queries will only produce the right answers if users navigate the
difficult to right join paths, i.e., ask the right questions in SQL terms. If the wrong joins are
understand used, they unknowingly get answers to some other (potentially meaningless)
questions. 3NF models are complex for both people and machines. Specialist
hardware (data warehouse appliances) is improving query/join performance all the
time, but the human problems are far more difficult to solve. Smart BI software can
hide database schema complexity behind a semantic layer, but that merely moves
the burden of understanding a 3NF model from BI users at query time to BI
developers at configuration time. That’s a good move but its not enough. 3NF
models remain too complex for business stakeholders to review and quality assure
(QA).
History further ER models are further complicated by data warehousing requirements to track
complicates 3NF history in full to support valid ‘like-for-like’ comparisons over time. Providing a
true historical perspective of business events requires that many otherwise simple
descriptive attributes become time relationships, i.e., existing M:1 relationships
become M:M relationships that translate into even more physical tables and com-
How to Model a Data Warehouse 7
plex join paths. Such temporal database designs can defeat even the smartest BI
tools and developers.
Laying out a readable ERD for any non-trivial data model isn’t easy. The mne- Large readable ER
monic “dead crows fly east” encourages modelers to keep crows’ feet pointing up diagrams are
or to the left. Theoretically this should keep the high-volume volatile entities difficult to draw: all
(transactions) top left and the low-volume stable entities (lookup tables) bottom those overlapping
right. However, this layout seldom survives as modelers attempt to increase read- lines
ability by moving closely related or commonly used entities together. The task
rapidly descends into an exercise in trying to reduce overlapping lines. Most ERDs
are visually overwhelming for BI stakeholders and developers who need simpler,
human-scale diagrams to aid their communication and understanding.
Figure 1-2
Multidimensional
analysis
8 Chapter 1
Star Schemas
Star schemas are Real-world dimensional models are used to measure far more complex business
used to visualize processes (with more dimensions) in far greater detail than could be attempted
dimensional models using spreadsheets. While it is difficult to envision models with more than three
dimensions as multi-dimensional cubes (they wouldn’t actually be cubes), they can
easily be represented using star schema diagrams. Figure 1-3 shows a classic star
schema for retail sales containing a fourth (causal) dimension: PROMOTION, in
addition to the dimensional attributes and facts from the previous cube example.
Figure 1-3
Sales star schema
Star schema is also the term used to describe the physical implementation of a
dimensional model as relational tables.
ER diagrams work best for viewing a small number of tables at one time. How
many tables? About as many as in a dimensional model: a star schema.
The term dimension in this book refers to a dimension table whereas dimensional
attribute refers to a column in a dimension table.
Dimensions contain sets of descriptive (dimensional) attributes that are used to Dimensional
filter data and group facts for aggregation. Their role is to provide good report row hierarchies support
headers and title/heading/footnote filter descriptions. Dimensional attributes often drill-down analysis
have a hierarchical relationship that allows BI tools to provide drill-down analysis.
For example, drilling down from Quarter to Month, Country to Store, and Cate-
gory to Product.
Not all dimensional attributes are text. Dimensions can contain numbers and dates Dimensions are
too, but these are generally used like the textual attributes to filter and group the small,
facts rather than to calculate aggregate measures. Despite their width, dimensions fact tables
are tiny relative to fact tables. Most dimensions contain considerably less than a are large
million rows.
The most useful facts are additive measures that can be aggregated using any
combination of the available dimensions. The most useful dimensions provide
rich sets of descriptive attributes that are familiar to BI users.
A deeper, less immediately obvious benefit of dimensional models is that they are Dimensional models
process-oriented. They are not just the result of some aggressive physical data are process-
model optimization (that has denormalized a logical 3NF ER model into a smaller oriented. They
number of tables) to overcome the limitations of databases to cope with join represent business
intensive BI queries. Instead, the best dimensional models are the result of asking processes
questions to discover which business processes need to be measured, how they described using the
should be described in business terms and how they should be measured. The 7Ws framework
resulting dimensions and fact tables are not arbitrary collections of denormalized
data but the 7Ws that describe the full details of each individual business event
worth measuring.
10 Chapter 1
7Ws
Framework
Where did it take place?
HoW many or much was recorded – how can it be measured?
Why did it happen?
HoW did it happen – in what manner?
The 7Ws are The 7Ws are an extension of the 5 or 6Ws that are often cited as the checklist in
interrogatives: essay writing and investigative journalism for getting the ‘full’ story. Each W is an
question forming interrogative: a word or phrase used to make questions. The 7Ws are especially
words useful for data warehouse data modeling because they focus the design on BI
activity: asking questions.
Fact tables represent verbs (they record business process activity). The facts they
contain and the dimensions that surround them are nouns, each classifiable as
one of the 7Ws. 6Ws: who, what, when, where, why, and how represent dimension
types. The 7th W: how many, represents facts. BEAM✲ data stories use the 7Ws
to discover these important verb and noun combinations.
Star schemas Detailed dimensional models usually contain more than 6 dimensions because any
usually contain of the 6Ws can appear multiple times. For example, an order fulfillment process
8-20 dimensions could be modeled with 3 who dimensions: CUSTOMER, EMPLOYEE, and
CARRIER, and 2 when dimensions: ORDER DATE and DELIVERY DATE.
Having said that, most dimensional models do not have many more than 10 or 12
dimensions. Even the most complex business events rarely have 20 dimensions.
Star schemas The deep benefit of process-oriented dimensional modeling is that it naturally
support agile, breaks data warehouse scope, design and development into manageable chunks
incremental BI consisting of just the individual business processes that need to be measured next.
Modeling each business process as a separate star schema supports incremental
design, development and usage. Agile dimensional modelers and BI stakeholders
can concentrate on one business process at a time to fully understand how it
should be measured. Agile development teams can build and incrementally deliver
individual star schemas earlier than monolithic designs. Agile BI users can gain
early value by analyzing these business processes initially in isolation and then
grow into more valuable, sophisticated cross-process analysis. Why develop ten
stars when one or two can be delivered far sooner with less investment ‘at risk’?
Figure 1-4
Data warehouse
analysis and design
biases
Data-Driven Analysis
Using a data-driven approach, data requirements are obtained by analyzing oper- Pure data-driven
ational data sources. This form of analysis was adopted by many early IT-lead data analysis avoided
warehousing initiatives to the exclusion of all others. User involvement was early user
avoided as it was mistakenly felt that data warehouse design was simply a matter of involvement
re-modeling multiple data sources using ER techniques to produce a single ‘per-
fect’ 3NF model. Only after that was built, would it then be time to approach the
users for their BI requirements.
12 Chapter 1
Leading to DW Unfortunately, without user input to prioritize data requirements and set a man-
designs that did ageable scope, these early data warehouse designs were time-consuming and
not met BI user expensive to build. Also, being heavily influenced by the OLTP perspective of the
needs source data, they were difficult to query and rarely answered the most pressing
business questions. Pure data-driven analysis and design became known as the
“build it and they will come” or “field of dreams” approach, and eventually died
out to be replaced by hybrid methods that included user requirements analysis,
source data profiling, and dimensional modeling.
Packaged apps are Data-driven analysis has benefited greatly from the use of modern data profiling
especially tools and methods but despite their availability, data-driven analysis has become
challenging data increasing problematic as operational data models have grown in complexity. This
sources to analyze is especially true where the operational systems are packaged applications, such as
Enterprise Resource Planning (ERP) systems built on highly generic data models.
IT staff are In spite of its problems, data-driven analysis continues to be a major source of data
comfortable with requirements for many data warehousing projects because it falls well within the
data-driven analysis technical comfort zone of IT staff who would rather not get too involved with
business stakeholders and BI users.
Reporting-Driven Analysis
Reporting Using a reporting-driven approach, data requirements are obtained by analyzing
requirements are the BI users’ reporting requirements. These requirements are gathered by inter-
gathered by viewing stakeholders one at a time or in small groups. Following rounds of meet-
interviewing ings, analyst’s interview notes and detailed report definitions (typically spreadsheet
potential BI users or word processor mock-ups) are cross-referenced to produce a consolidated list of
in small groups data requirements that are verified against available data sources. The results
requirements documentation is then presented to the stakeholders for ratification.
After they have signed off the requirements, the documentation is eventually used
to drive the data modeling process and subsequent BI development.
User involvement Reporting-driven analysis focuses the data warehouse design on efficiently priori-
helps to create more tizing the stakeholder’s most urgent reporting requirements and can lead to timely,
successful DWs successful deployments when the scope is managed carefully.
future information needs beyond the ‘next reports’, because these needs are de-
pendent upon the answers the ‘next reports’ will provide, and the unexpected new
business initiatives those answers will trigger. The ensuing steps of collating re-
quirements, feeding them back to business stakeholders, gaining consensus on data
terms, and obtaining sign off can also be an extremely lengthy process.
Over-reliance on reporting requirements has lead to many initially successful data Focusing too closely
warehouse designs that fail to handle change in the longer-term. This typically on current reports
occurs when inexperienced dimensional modelers produce designs that match the alone leads to
current report requests too closely, rather than treating these reports as clues to inflexible
discovering the underlying business processes that should be modeled in greater dimensional models
detail to provide true BI flexibility. The problem is often exasperated by initial
requirement analysis taking so long that there isn’t the budget or willpower to
swiftly iterate and discover the real BI requirements as they evolve. The resulting
inflexible designs have led some industry pundits to unfairly brand dimensional
modeling as too report-centric, suitable at the data mart level for satisfying the
current reporting needs of individual departments, but unsuitable for enterprise
data warehouse design. This is sadly misleading because dimensional modeling has
no such limitation when used correctly to iteratively and incrementally model
atomic-level detailed business processes rather than reverse engineer summary
report requests.
Figure 1-5
Reactive DW
timeline
Today, DW/BI has caught up and become proactive. The two different worlds of The lag between
OLTP and DW/BI have become parallel worlds where many new data warehouses OLTP and DW roll-
need to go live/be developed concurrently with their new operational source out is disappearing
systems, as shown on the Figure 1-6 timeline.
14 Chapter 1
Figure 1-6
Proactive DW
timeline
Proactive DW/BI DW/BI has steadily become proactive for a number of business-led reasons:
addresses
operational DW/BI itself has become more operational. The (largely technical) distinction
demands, avoids between operational and analytical reporting has blurred. Increasingly, sophis-
interim solutions ticated operational processes are leveraging the power of (near real-time) BI
and preempts BI and stakeholders want a one-stop shop for all reporting needs: the data ware-
performance house.
problems
Organizations (especially those that already have DW/BI success) now realize
that, sooner rather than later, each major new operational system will need its
own data mart or need to be integrated with an existing data warehouse.
BI stakeholders simply don’t want to support ‘less than perfect’ interim report-
ing solutions and suffer BI backlogs.
When source database schemas are not yet available, ETL development can still
proceed if ETL and OLTP designers can agree on flat file data extracts. Once
OLTP have committed to provide the specified extracts on a schedule to meet BI
needs, ETL transformation and load routines can be developed to match this
source to the proactive data warehouse design target.
Figure 1-7
Waterfall DW
development
timeline
Dimensional Dimensional modeling can help reduce the risks of pure waterfall by allowing
modeling enables developers to release early incremental BI functionality one star schema at a time,
incremental get feedback and make adjustments. But even dimensional modeling, like most
development other forms of data modeling, takes a (near) serial approach to analysis and design
(with ‘Big Requirements Up Front’ (BRUF) preceding BDUF data modeling) that
is subject to the inherent limitations and initial delays described already.
Agile data Agile data warehousing seeks to further reduce the risks associated with upfront
warehousing is analysis and provide even more timely BI value by taking a highly iterative, incre-
highly iterative and mental and collaborative approach to all aspects of DW design and development as
collaborative shown on the Figure 1-8 timeline.
Figure 1-8
Agile DW
development
timeline
How to Model a Data Warehouse 17
By avoiding the BDUF and instead doing ‘Just Enough Design Upfront’ (JEDUF) Agile focuses on the
in the initial iterations and ‘Just-In-Time’ (JIT) detailed design within each itera- early and frequent
tion, agile development concentrates on the early and frequent delivery of working delivery of working
software that adds value, rather than the production of exhaustive requirements software that adds
and design documentation that describes what will be done in the future to add value
value.
For agile DW/BI, the working software that adds value is a combination of query- For DW design, the
able database schemas, ETL processes and BI reports/dashboards. The minimum minimum valuable
set of valuable working software that can be delivered per iteration is a star schema, working software is
the ETL processes that populates it and a BI tool or application configured to a star schema
access it. The minimum amount of design is a star.
To design any type of significant database schema to match the early and frequent Agile database
delivery schedule of an agile timeline requires an equally agile alternative to the development needs
traditionally serial tasks of data requirements analysis and data modeling. agile data modeling
Iterative, incremental and collaborative all have very specific meanings in an agile Collaborative
development context that bring with them significant benefits: modeling combines
analysis and design
Collaborative data modeling obtains data requirements by modeling directly and actively
with stakeholders. It effectively combines analysis and design and ‘cuts to the involves
chase’ of producing a data model (working software and documentation) stakeholders
rather than ‘the establishing shot’ of recording data requirements (only docu-
mentation).
Incremental data modeling gives you more data requirements when they are Evolutionary
better understood/needed by stakeholders, and when you are ready to imple- modeling supports
ment them. Incremental modeling and development are scheduling strategies incremental
that support early and frequent software delivery. development by
capturing
Iterative data modeling helps you to understand existing data requirements requirements when
better and improve existing database schemas through refactoring: correcting they grow and
mistakes and adding missing attributes which have now become available or change
important. Iterative modeling and development are rework strategies that in-
crease software value.
18 Chapter 1
Agile dimensional Agile modeling avoids the ‘analysis paralysis’ caused by trying to discover the
modeling focuses on ‘right’ reports amongst the large (potentially infinite?) number of volatile, con-
business processes stantly re-prioritized requests in the BI backlog. Instead, agile dimensional
rather than reports modeling gets everyone to focus on the far smaller (finite) number of relatively
stable business processes that stakeholders want to measure now or next.
Agile dimensional Agile dimensional modeling avoids the need to decode detailed business events
modeling creates from current summary report definitions. Modeling business processes without
flexible, report- the blinkers of specific report requests produces more flexible, report-neutral,
neutral designs enterprise-wide data warehouse designs.
Agile modeling Agile data modeling can break the “data then requirements” stalemate that
enables proactive exists for DW/BI just before a new operational system is implemented. Proac-
DW/BI to influence tive agile dimensional modeling enables BI stakeholders to define new business
operational system processes from a measurement perspective and provide timely BI input to op-
development erational application development or package configuration.
Evolutionary Agile modeling’s evolutionary approach matches the accretive nature of genu-
modeling supports ine BI requirements. By following hands-on BI prototyping and/or real BI us-
accretive BI age, iterative and incremental dimensional modeling allows stakeholders to
requirements (re)define their real data requirements.
Collaborative Many of the stakeholders involved in collaborative modeling will become direct
modeling teaches users of the finished dimensional data models. Doing some form of dimen-
stakeholders to think sional modeling with these future BI users is an opportunity to teach them to
dimensionally think dimensionally about their data and define common, conformed dimen-
sions and facts from the outset.
Never underestimate the affection stakeholders will have for data models that they
themselves (help) create.
How to Model a Data Warehouse 19
Business stakeholders have little appetite for traditional data models, even Collaborative
conceptual models (see Data Model Types, shortly) that are supposedly targeted data modeling
at them. They find the ER diagrams and notation favored by data modelers must use simple,
(and generated by database modeling tools) too complex or too abstract. To inclusive notation
engage stakeholders, agile modelers need to create less abstract, more inclusive and tools
data models using simple tools that are easy to use, and easy to share. These in-
clusive models must easily translate into the more technically detailed,
20 Chapter 1
logical and physical, star schemas used by database administrators (DBAs) and
ETL/BI developers.
Data modeling To encourage collaboration and support iteration, agile data modeling needs to
sessions (model- be quick. If stakeholders are going to participate in multiple modeling sessions
storms) need to be they don’t want each one to take days or weeks. Agile modelers want speed too.
quick: hours rather They don’t want to wear out their welcome with stakeholders. The best results
than days are obtained by modeling with groups of stakeholders who have the experience
and knowledge to define common business terms (conformed dimensions) and
prioritize requirements. It is hard enough to schedule long meetings with these
people individually let alone in groups. Agile data modeling techniques must
support modelstorming: impromptu stand up modeling that is quicker, simpler,
easier and more fun than traditional approaches.
Agile modelers must Stakeholders don’t want to feel that a design is constantly iterating (fixing what
balance JIT and they have already paid for) when they want to be incrementing (adding func-
JEDUF modeling to tionality). They want to see obvious progress and visible results. Agile modelers
reduce design need techniques that support JIT modeling of current data requirement in details
rework and JEDUF modeling of ‘the big picture’ to help anticipate future iterations and
reduce the amount of design rework.
Evolutionary DW Developers need to embrace database change. They are used to working with
development (notionally) stable database designs, by-products of BDUF data modeling. It is
benefits from ETL/BI support staff who are more familiar with coding around the database changes
tools that support needed to match users’ real requirements. To respond efficiently to evolution-
automated testing ary data warehouse design, agile ETL and BI developers need tools that support
database impact analysis and automated testing.
DW designers must Data warehouse designers also need to embrace data model change. They will
embrace change naturally want to limit the amount of disruptive database refactoring required
and allow their by evolutionary design, but they must avoid resorting to generic data model
models to evolve patterns which reduce understandability and query performance, and can al-
ienate stakeholders. Agile data warehouse modelers need dimensional design
patterns that they can trust to represent tomorrow’s BI requirements tomorrow,
while they concentrate on today’s BI requirements now.
Agile dimensional If agile dimensional modeling that is interactive, inclusive, quick, supports JIT and
modeling techniques JEDUF, and enables DW teams to embrace change seems like a tall order don’t worry;
exist for addressing while there are no silver bullets that will make everyone or everything agile over-
these requirements night, there are proven tools and techniques that can address the majority of these
agile modeling prerequisites.
How to Model a Data Warehouse 21
BEAM✲
BEAM✲ is an agile data modeling method for designing dimensional data ware- BEAM✲ is an
houses and data marts. BEAM stands for Business Event Analysis & Modeling. As agile dimensional
the name suggests it combines analysis techniques for gathering business event modeling method
related data requirements and data modeling techniques for database design. The
trailing ✲ (six point open centre asterisk) represents its dimensional deliverables:
star schemas and the dimensional position of each of the 7Ws it uses.
BEAM✲ consists of a set of repeatable, collaborative modeling techniques for BEAM✲ is used to
rapidly discovering business event details and an inclusive modeling notation for discover and
documenting them in a tabular format that is easily understood by business stake- document business
holders and readily translated into logical/physical dimensional models by IT event details
developers.
visible signs of progress. Stakeholders can easily imagine sorting and filtering the
low-level detail columns of a business event using the higher-level dimensional
attributes that they subsequently model.
Figure 1-9
Customer Orders
BEAM✲ table
BEAM✲ BEAM✲ short codes act as dimensional modelers’ shorthand for documenting
short codes act generic data properties such as data type and nullability, and specific dimensional
as dimensional properties such as slowly changing dimensions and fact additivity. Short codes can
modeling be used to annotate any BEAM✲ diagram type for technical audiences but can
shorthand easily be hidden or ignored when modeling with stakeholders who are disinter-
ested in the more technical details. Short codes and other BEAM✲ notation con-
ventions will be highlighted in the text in bold. Appendix B provides a reference
list of short codes.
Figure 1-10
Order processing
ER Diagram
By looking at the ERD you can tell that customers may place orders for multiple Example data
products at a time. The BEAM✲ table records the same information, but the models capture
example data also reveals the following: more business
information than
Customers can be individuals, companies, and government bodies. ER models
Products were sold yesterday.
Products have been sold for 10 years.
Products vary considerably in price.
Products can be bundles (made up of 2 products).
Customers can order the same product again on the same day.
Orders are processed in both dollars and pounds.
Orders can be for a single product or bulk quantities.
Discounts are recorded as percentages and money.
Additionally, by scanning the BEAM✲ table you may have already guessed the type Example data
of products that Pomegranate sells and come to some conclusions as to what sort speaks volumes!
of company it is. Example data speaks volumes—wait until you hear what it says
about some of Pomegranate’s (fictional) staff!
BEAM✲ and ER Based on the detail levels described in Table 1-2 the order processing ERD in
notation are jointly Figure 1-10 is a logical data model as it shows primary keys, foreign keys and
used to create cardinality, while the BEAM✲ event in Figure 1-9 is a conceptual model (we prefer
collaborative models “business model”) as this information is missing. With additional columns and
for different short codes it could be added to the BEAM✲ table but each diagram type suits its
audiences target audience as is. BEAM✲ tables are more suitable for collaborative modeling
with stakeholders than traditional ERD based conceptual models. While other
BEAM✲ diagram types and short codes compliment and enhance ERDs for col-
laborating with developers on logical/physical star schema design.
BEAM✲ supports the core agile values: “Individuals and interactions over proc-
esses and tools.”, “Working software over comprehensive documentation.” and
“Customer collaboration over contract negotiation.” BEAM✲ upholds these
values and the agile principle of “maximizing the amount of work not done” by
encouraging DW practitioners to work directly with stakeholders to produce
compilable data models rather than requirements documents, and working BI
prototypes of reports/dashboards rather than mockups.
How to Model a Data Warehouse 25
DATA
PRINCIPAL
DIAGRAM USAGE MODEL AUDIENCE
CHAPTER
TYPE
BEAM✲ (Example Data) Modeling business events and Business Data Modelers 2
Table dimensions one at a time using Logical Business Analysts
example data to document their Physical Business Experts
7Ws details. Stakeholders
BI Users
Example data tables are also used
to describe physical dimension
and fact tables and explain
dimensional design patterns.
Summary
Data warehouses and operational systems are fundamentally different. They have radically
different database requirements and should be modeled using very different techniques.
Star schemas record and describe the measureable events of business processes as fact tables and
dimensions. These are not arbitrary denormalized data structures. Instead they represent the
combination of the 7Ws (who, what, when, where, how many, why and how) that fully describe
the details of each business event. In doing so, fact tables represents verbs, while the facts
(measures) they contain and the dimensions they reference represent nouns.
Even with the right database design techniques there are numerous analysis challenges in
gathering detailed data warehousing requirements in a timely manner.
Both data-driven and reporting-driven analysis are problematic, increasingly so, with DW/BI
development becoming more proactive and taking place in parallel with agile operational
application development.
Iterative, incremental and collaborative data modeling techniques are agile alternatives to the
traditional BI data requirements gathering.
BEAM✲ is an agile data modeling method for engaging BI stakeholders in the design of their
own dimensional data warehouses.
BEAM✲ data stories use the 7Ws framework to discover, describe and document business
events dimensionally.
BEAM✲ modelers encourage collaboration by using simple modeling tools such as whiteboards
and spreadsheets to create inclusive data models.
BEAM✲ models use example data tables and alphanumeric short codes rather than ER data
abstractions and graphical notation to improve stakeholder communication. These models are
readily translated into star schemas.
Business events are the individual actions performed by people or organizations Business events are
during the execution of business processes. When customers buy products or use the measureable
services, brokers trade stocks, and suppliers deliver components, they leave behind atomic details of
a trail of business events within the operational databases of the organizations business processes
involved. These business events contain the atomic-level measurable details of the
business processes that DW/BI systems are built to evaluate.
BEAM✲ uses business events as incremental units of data discovery/data model- BEAM✲ modelers
ing. By prompting business stakeholders to tell their event data stories, BEAM✲ discover BI data
modelers rapidly gather the clear and concise BI data requirements they need to requirements by
produce efficient dimensional designs. telling data stories
In this chapter we begin to describe the BEAM✲ collaborative approach to dimen- This chapter is a
sional modeling, and provide a step-by-step guide to discovering a business event step-by-step guide
and documenting its data stories in a BEAM✲ table: a simple tabular format that is to using BEAM✲
easily translated into a star schema. By following each step you will learn how to tables and the 7Ws
use the 7Ws (who, what, when, where, how many, why, and how) to get stake- to describe event
holders thinking dimensionally about their business processes, and describing the details
information that will become the dimensions and facts of their data warehouse—
one that they themselves helped to design!
Data stories and story types: discrete, recurring and evolving Chapter 2 Topics
Discovering business events: asking “Who does what?” At a Glance
Documenting events: using BEAM✲ Tables
Describing event details: using the 7Ws and stories themes
Modelstorming with whiteboards: practical collaborative data modeling
27
28 Chapter 2
Data Stories
Data stories are to Data stories are comparable to user stories: agile software development's lean
agile DW design as requirements gathering technique. Both are written or told by business stake-
user stories are to holders. While user stories concentrate on functional requirements and are written
agile software on index cards, data stories concentrate on data requirements and are written on
development whiteboards and spreadsheets.
Event stories use Business events, because they represent activity (verbs), have strong narratives.
the narrative of a BEAM✲ uses these event narratives to discover their details (nouns) by telling data
business event to stories. BEAM✲ events are the archetypes for many similar data stories. "Employee
discover BI data drives company car on appointment date." is an event. "James Bond drives an
requirements Aston Martin DB5 on the 17th September 1964" is a data story. By following five
event story themes, event stories, a specific type of data story, succinctly clarify the
meaning of each event detail and help elicit additional details.
Story Types
Event stories are BEAM✲ classifies business events into three story types: discrete, evolving, and
discrete, evolving or recurring based on how their stories play out with respect to time. Figure 2-1 shows
recurring depending example timelines for each type. Retail product purchases are examples of discrete
on how they events that happen at a point in time. They are (largely) unconnected with one
represent time another and occur unpredictably. Wholesale orders are evolving events that repre-
sent the irregular time spans it takes to fulfill orders. They too occur at unpredict-
able intervals. Interest charges are recurring events that represent the regular time
spans over which interest is accrued. They occur in step with one another at
predictable intervals.
Figure 2-1
Story type timelines
Modeling Business Events 29
Discrete Events
Discrete events are “point-in-time” or short (duration) stories. They typically Discrete events are
represent the atomic-level transactions recorded by operational systems. Example “point-in-time” or
discrete events include: short duration
Discrete events are completed either at the moment they occur or shortly thereaf- Discrete event
ter. By “shortly”, we mean within the ETL refresh cycle of the data warehouse; i.e., stories are
they have “finished” or reached some end state by the time they are used for BI. “finished”. They
Discrete event stories are generally associated with a single verb (e.g., “buys”, do not change
“views”, “calls”) and a single timestamp. There are exceptions to the one verb, one
timestamp rule, but for an event story to be discrete none of its details must change
over time, otherwise it is evolving.
Evolving Events
Evolving events are longer-running stories (sagas) that can take several days, Evolving events
weeks, or months to complete. They are typically loaded into a data warehouse represent irregular
when their stories begin. Example evolving events include: periods of times.
Their stories may
A customer orders a product online and waits for it to be delivered not have “finished”
A student applies for a place on a university course and is accepted
An employee processes an insurance claim
Evolving events often represent a series of discrete events (chapters if you like) that Multi-verb evolving
BI stakeholders view as milestones of a complex/time-consuming business process. events combine the
In Figure 2-1 the arrows that follow each evolving order event mark the shipment, verbs of discrete
delivery, and payment milestones that have been reached. Each of the verbs: events to
“order”, “ship”, “deliver” and “pay” can be modeled as separate discrete events, but support process
from the stakeholders’ perspective the really important measures of the order performance
fulfillment process only become visible when these events are combined to produce measurement
a multi-verb evolving event story.
Timelines (have you noticed how much we like them) are a great way to visualize
evolving events stories and an invaluable tool for modeling milestones and
interesting time intervals (duration measures). Modeling with timelines is covered
in Chapter 8.
Recurring Events
Recurring events are periodic measurement stories that occur at predictable inter- Recurring events
vals, such as daily, weekly, and monthly (serials). In Figure 2-1 the arrowed line occur at predictable
preceding each recurring event represents the period of time that the event meas- intervals
ures. Example recurring events include:
30 Chapter 2
Recurring events Recurring events are typically used to sample and summarize discrete events,
summarize discrete especially when cumulative measures, such as stock levels or account balances, are
events but can also required that would be expensive to derive from the discrete events. For example,
represent the atomic calculating an account balance at any point in time would require all the transac-
detail for “automatic” tions against the account from all prior periods to be aggregated. Recurring events
measurements can also represent atomic-level measurement events that “automatically” occur on
a periodic basis; for example, the hourly recording of rainfall at weather stations.
Table 2-1 BEAM✲ STORY TYPE STAR SCHEMA TYPE/PHYSICAL DIMENSIONAL MODEL
Story types and
Discrete Transaction fact table
their matching star
Recurring Periodic snapshot
schemas
Evolving Accumulating snapshot
Discrete events Discrete events are implemented as transaction fact tables. All the detail that there
are user models for is to know about discrete events is known before they are loaded into a data ware-
transaction fact house. This means that each discrete event story (fact record) is inserted once and
tables never updated, greatly simplifying the ETL process.
Recurring events Recurring events are implemented as periodic snapshot fact tables. Many of their
represent interesting measures are semi-additive balances that must be carefully reported
periodic snapshots over multiple time periods.
Evolving events Evolving events are implemented as accumulating snapshot fact tables. They are
represent loaded into a data warehouse shortly after the first event in a predictable sequence,
accumulating and are updated each time a milestone event occurs until the overall event story is
snapshots completed.
Chapter 5 describes the basic steps involved in translating events into star sche-
mas. Chapter 8 provides more detailed coverage on designing transaction fact
tables, periodic snapshots and accumulating snapshots.
Modeling Business Events 31
The 7Ws
BEAM✲ uses the 7Ws: who, what, when, where, how many, why, and how to Event stories are
discover and model data requirements as event details. Every event detail that told using the 7Ws
stakeholders need falls into one of the 7 W-types. They are the nouns for people
and organizations (who), things such as products and services (what), time (when),
locations (where), reasons (why), event methods (how), and numeric measures
(how many) that in combination form event stories.
Each of the 7Ws is also an interrogative, a word or phrase that can be used to Each of the 7Ws
construct a question, and that is precisely what you do with them. By asking gives you a question
stakeholders a who question, you discover the people and organizations they want to ask for story
to analyze. By asking stakeholders a what question you discover the products and details
services they want to analyze. By asking these questions in the right combination
and sequence you discover the business events they need to analyze.
As you capture event details you can record their dimension type in the type row
of a BEAM✲ table. You will use this knowledge to help you model the details as
dimensions and facts. Part 2 of this book has chapters dedicated to the 7Ws,
covering common BI issues and dimensional modeling design patterns associ-
ated with each type.
Thinking Dimensionally
The 7W questions you ask to discover event details, mirror the questions that The 7Ws help
stakeholders will ask themselves when they define queries and reports. For exam- stakeholders think
ple, a stakeholder will think about where, when, and how many to build a query dimensionally about
that asks: “Which sales locations are performing better than last year?” and who, their data and BI
when, what, and why to ask: “Which customers are responding early to product queries
promotions?” When stakeholders start using the 7Ws they are thinking about their
data dimensionally, because the 7Ws represent how data is naturally modeled
dimensionally. Table 2-2 shows the type of data that each of the 7Ws represent
together with examples of matching physical dimensions or facts.
Figure 2-2
BEAM✲ sequence:
7Ws flowchart
7W questions and Once you become familiar with using the 7Ws you will find they flow naturally
event detail answers from one to another; for example, quantity (how many) answers lead to why
naturally flow from questions. If you discover a discount quantity this would naturally lead to the
one to another question: “Why do some orders have discounts?” Similarly, the why answer: “be-
cause of promotions” might lead to the how question: “How are promotions
implemented?” and the answer: “with discount vouchers and codes.”
Discovering event There is no need to be a slave to the BEAM✲ Sequence. If stakeholders call out
details out of relevant event details at random (hopefully not all at once) or remember details out
sequence is okay of sequence, that’s okay, but try to return to the flowchart as soon as possible to
make sure all 7Ws are covered.
Put a simple version of the 7Ws flowchart up on the wall, so that who, what, when,
where, how many, why, and how can start working on everyone’s dimensional
imagination and stakeholders know your next question type.
Modeling Business Events 33
The following sections describe each of the event modeling steps using an order Imagine you are
processing example for Pomegranate Corp., our fictional multinational computer modeling
technology, consumer electronics, software, and consulting firm. In this initial Pomegranate’s
worked example, order creation will be modeled in detail as a discrete event. In order process
Chapter 4, shipments, deliveries and other related events will be modeled at a
summary level. In Chapter 8, several of theses events are combined as a single
evolving event that allows stakeholders to more easily measure the performance of
the entire order fulfillment process.
The answer to this blunt opening question is an event. An event is an action. An You are asking for
action means that a verb is involved. When a verb is involved, there is someone or a a subject, a verb,
something doing the action: the subject, and someone or a something having the and an object
action done to it: the object. So linguistically, the answer will be a subject-verb-
object combination: the simplest story possible.
“Who does what?” is really a mnemonic, a way of remembering to ask stakeholders You do this to
to name the subject, verb, and object that identify an interesting event. It’s a short discover interesting
way of saying: “Think of an activity. Who (or what) does it? What do they do? business activity
Who or what do they do it to?”. Whatever form of the question you use, what you that needs to be
actually want to discover is an interesting business activity (verb) that is in scope, so measured
it may need some qualification to work well. You might begin with your version of:
The answer to “Who You now have what you need to begin modeling: a subject: customer, a verb:
does what?” is the orders, and an object: product. This subject-verb-object combination is called the
main clause of an main clause of the event, and you will reuse it to ask most of the follow-up ques-
event tions for discovering the “whole” story.
Keep stakeholders Getting an eager group of stakeholders to slow down and take things one event
focused on one (verb) at a time can take some discipline. Try to reassure them (and yourself) that
event (verb) at a there will be plenty of time to capture all of this data, but you need to focus on one
time event at a time until it is complete—with all of its details. Don't worry about the
stakeholders, they will not forget their other favorite events while you are
documenting the current one.
Summary events can always be added afterwards for query efficiency, if necessary, Summary events
so long as the details are there. You should make sure you initially model the most can always be
granular discrete events the stakeholders are interested in. You can then model added later. You
recurring and evolving events from these (in subsequent iterations) to provide should initially
easier access to the performance measures that stakeholders need. Chapter 8 covers concentrate on
the event modeling and dimensional design techniques for doing this. the atomic detail
Asking: “Who does what?” does not always ensure that you will get actual who and Subjects and
what details. Objects especially can be any of the 7Ws, including how many. For objects are not
automated recurring events there may simply not be a responsible who subject that always who and
triggers them. For example, “Store stocks product” or “Weather station records what. They can be
temperature” are both perfectly valid events, but neither has who details. “Store” any of the 7Ws
and “weather station” are where-type subjects, and “temperature” is an example of
a how many object rather than a what object.
There is no need to fret over this, and try to coax stakeholders into supplying As long as you have
actual people (who) and things (what) in every case as this can get in the way of a verb you will find
capturing their perspective on the event. The most important thing is that stake- any whos or whats
holders supply a main clause containing a verb worth measuring. If their main involved (if any) by
clause doesn’t contain a who or what you will soon discover any that belong to the asking more W
event as you use each of the 7Ws to discover more details. questions shortly
Verbs
Verbs are one of the most difficult parts of any language. Because of numerous
tenses, cases, and persons, the possible ways of expressing a verb can be
confusing. For instance, the verb “to buy” can be written as “buy”, “bought”,
“buying” and “buys”. To simplify events, use this last version “buys” which is the
For verbs, use the
third-person singular present tense. This simply means it sounds right after “he”,
third-person singular
“she”, or “it”. In English, this version of the verb always end in “s”. For example,
present tense
“to call” becomes “calls”, “reviewed” becomes “reviews”, “auditing” becomes
“audits”, and “will sell” becomes “sells”. This standard form is intuitive and avoids
awkward verb variations.
Figure 2-3
Initial event table
Several important BEAM✲ notation conventions are indicated in Figure 2-3. The
subject (CUSTOMER) and object (PRODUCT) are capitalized, and have become
column headers. The verb (orders) is in lowercase, and is placed in its own row
above the object. This row will be used to hold other lowercase words shortly. The
capitalized column headers are the event details that will eventually become facts
or dimensions. The lowercase words will connect subsequent details to the event
and clarify their relationships with the main clause. They make event stories
readable, but are not components of the physical database design.
Draw data tables on whiteboards without borders between or below the example
rows. Fewer lines make freehand drawing quicker and neater, while at the same
time visually suggesting that the examples are open ended: that stakeholders can
add to them at any point to illustrate ever more interesting stories that help to
clarify exceptions.
Don’t name the The rest of the table is left blank, with several empty rows for example data and
event until it is space above for an event name. Don’t attempt to name the event just yet because
complete this may prejudice the details that the stakeholders provide.
Leave some The table is now ready to record event details and example data (event stories) to
working space for clarify the meaning of each detail. In Figure 2-3 there is also space reserved above
detail about detail the table as a scratchpad for recording detail about detail: important details that
you may capture along the way that don’t belong directly to the event (see What?
later in this chapter) but will need to be modeled as dimensional attributes after the
event is complete.
Modeling Business Events 37
When?
Every event story has at least one defining point in time. No meaningful BI analysis You discover when
takes place without a time element. Therefore, immediately following the discovery details by asking a
of an event, you should ask for its when detail. You do so by repeating the main when question
clause of the event to the stakeholders as a question, with a “when” appended or
prepended:
On order date.
This is certainly what you are hoping for: a prepositional phrase containing a You are looking for a
preposition: “on” followed by a noun: “order date.” If they respond with actual preposition and a
dates/times, ask what these should be called. You are looking for a noun to name name for the when
this detail; after you have it you can then use the date/time values for example detail
stories to help you understand the time nature of the event. The general form of a
when question is: “Subject Verb Object when?” or “When does a Subject Verb
Object? What do you call that date/time?” The required response is in the form:
“on/at/every Time Stamp Name”
The preposition on used with a when detail implies that the detail is recorded as a When prepositions
date, suggesting that the time of day is not available or is not important. An at contain clues to the
preposition implies that time of day is recorded and is important. Whenever stake- level of time detail
holders give you example when values you should check that prepositions and available/needed
examples match; so that event stories can be read correctly.
38 Chapter 2
Prepositions
Prepositions are the words that link nouns, pronouns and phrases in sentences
and describe their relationships. These relationships include time (when),
possession (who/what), proximity (where), quantity (how many), cause (why)
and manner (how). Examples of typical prepositions include: with, in, on, at, by,
to, from and for. BEAM✲ uses prepositions to:
Add the when detail After you have confirmed the prepositional phrase you add it to the event table, as
and preposition shown in Figure 2-4, with the on preposition above the new detail ORDER DATE.
to the table Now that you have the subject, object, and initial when details you can begin to fill
out the table with event stories.
Figure 2-4
Adding the first
when detail
The preposition for a when detail is highly significant. “on order date,” “at call
time,” and “every sales quarter” each contain an important clue to the level of
time detail available (or necessary) for their respective events.
Asking for examples and getting useful answers is a clear indication that you
are being agile, that you are modeling with the right people: stakeholders who
know their own data.
Modeling Business Events 39
Example data clarifies the meaning of each event detail as you discover it with
the minimum documentation.
Examples avoid abstraction. Stakeholders can start to visualize how their data
might appear on reports.
Wait until you have at least one when detail before collecting example data.
Having a when detail helps you get more interesting examples that tell a story.
Typical
Different
Repeat
Missing
Group
Figure 2-5 shows how the themes vary slightly across the 7Ws. The italic descrip- Themes help you
tions suggest the range of values that you want to illustrate for each “W” (by using discover data ranges
the typical and different themes). Armed with this information you are now ready for each
to start “modeling by example”: asking the stakeholders to tell you event stories for of the 7Ws
each theme.
Figure 2-5
Story theme
template
40 Chapter 2
Typical Stories
You start by asking Each event table should start with a typical event story that contains com-
for a typical event mon/normal/representative values for each detail. For who data this could be a
story frequent CUSTOMER. For what details this might be a popular PRODUCT.
Similarly, for how many details you are looking for average values that match the
other typical values. To fill out this example story you simply ask the stakeholders
for typical values for each detail.
Different Stories
Ask for different Following the typical example event you ask the stakeholders for another example
stories to explore with different values for each detail. If you ask for two different examples you can
value ranges use them to discover the range of values that the data warehouse will have to
represent. This is particularly important for when details because they indicate how
much history will be required and how urgently the data must be loaded into the
data warehouse.
Use relative time For when details, use relative time descriptions such as Today, Yesterday, This
descriptions to Month, and 5 Years Ago to capture the most recent and earliest values so that the
document event stories remain relevant long after the model is completed. If the latest when is
ETL urgency Yesterday, then you know that the data warehouse will demand a daily refresh for
and DW history this particular business process. In Figure 2-6, the fourth and fifth example events
requirements show that the data warehouse will need to support 10 years of history for this
event, and that a daily refresh policy is required.
If the latest when event story is Today, the data warehouse will need to be refreshed
more urgently than daily—perhaps in near real-time. Because this will significantly
complicate the ETL processing and increase development costs, you should con-
firm that this is a vital requirement with budgetary approval. If it is, you need to
find out if Today means “an hour ago” or “10 minutes ago”.
Look for old and For who and what details, ask for old and new values—representing customers who
new values as well have become inactive versus brand-new customers, or products that have been
as high and low discontinued versus those just released.
Repeat Stories
Ask for a repeat Once you have collected a few different examples you ask for a repeat story—one
story to find out that is as similar as possible to the typical story (the first row)—so you can discover
what makes an what makes each event story unique. You do this by asking whether the typical
event story unique values can appear again in the same combination; for example, you might ask:
The third event story in Figure 2-6 shows that this is possible. Each time you add a Repeat the typical
new detail to the event you return to the repeat story to see if that detail can be story if it is not yet
used to differentiate the event, by adding it to the previous question; for example, unique
“Can this CUSTOMER order this PRODUCT again on the same day, from the
same SALES LOCATION?” If that’s possible repeat the typical story values.
You might have uniqueness with the subject, object, and initial when details
alone, or you might not have it until you discover a how detail with your very last
question.
Figure 2-6
Adding event stories
Missing Stories
You ask for a missing story to discover which event details can have missing values Missing stories
(e.g. unknown, not applicable, or not available) and which are mandatory (always document how
present). You use a missing story to document how stakeholders want to see missing values will
missing values displayed on their reports. When you fill out a missing story (such be treated by BI
as the fifth story in Figure 2-6) you use normal values for mandatory details and applications. They
the stakeholders’ default missing value labels (e.g. “N/A”, “Unknown”) for the non also help to identify
mandatory details. For quantities you must find out whether missing data should mandatory event
be treated as NULL (the arithmetically correct representation of missing) or details
replaced by zero, or some other default value. You document the mandatory details
by also adding the short code MD to their column type.
Missing stories can be unrealistically sparse, containing missing values for any It’s OK for missing
detail that might ever be missing. It’s okay if there are more missing values than stories to be
would ever be seen in a single real event story. unrealistically empty
42 Chapter 2
Occasionally you may have an event subject that is consistently missing. For
example, a retail sales event might be described as “CUSTOMER buys PRODUCT
in STORE”, but the customer name is never recorded. When the event is imple-
mented as a physical fact table this virtual detail will be dropped, but during event
storytelling it focuses everyone on the atomic-level event. Perhaps the event
stories should contain “Anon.” or “J. Doe”.
Evolving events will Evolving events by their nature will have a number of validly missing details that
have many missing are unknown when the event begins; for example, ACTUAL DELIVERY DATE for
details. an order or FINAL PAID AMOUNT for an insurance claim. However, if you find
For discrete and discrete and recurring events with a lot of missing details it is often a clue that you
recurring events this are trying too hard to model a “one size fits all” generic event that is difficult for
is a warning that stakeholders to use and it may be better to model a number of more specific event
they may be too tables where the details that really define distinct business events are always pre-
generic sent.
Group Stories
Group stories You ask for example events containing groups to expose any variations in the
highlight details that meaning of a detail. For example, a typical order event consists of an individual
vary in meaning customer ordering a product. But is this always the case? You should ask the
stakeholders:
Mixed business The last two example events in Figure 2-6 are group themed. From these you learn
models (B2B, B2C, that customers can be organizations as well as individuals and orders can be placed
products and for multi-product bundles. The knowledge that there are different types of cus-
services) are often tomer (B2B: business-to-business and B2C: business to consumer), and prod-
discovered by group uct/service bundles will make you think carefully about how you implement these
stories details as dimensions. Chapter 6 covers mixed customer type dimensions, multi-
level dimensions and hierarchy maps, while Chapter 9 covers multi-valued dimen-
sions. These design patterns can be used to solve some of the more vexing model-
ing issues surfaced by group themed examples.
You should ask for just enough event stories so that everyone is clear about the
meaning of each event detail. Don’t get carried away trying to record every story—
that’s what the data warehouse is for—you want to concentrate on discovering all
the event details.
Modeling Business Events 43
You add this new when detail to the event table, as shown in Figure 2-7. With each Use when examples
additional when you also capture examples before proceeding on to the next when to describe long and
detail. As you do this you may want to adjust some existing example date/times to short durations
illustrate interesting time intervals (exceptionally short and long stories) between
milestones.
Figure 2-7
Adding a second
when detail
If you have more than two when details, draw a simple timeline to help stake-
holders describe the chronological sequence and name the most interesting
durations between pairs of whens.
44 Chapter 2
Recurring Event
Recurring events If the event contains a when detail with an every preposition and the example
contain a when stories confirms that it occurs on a regular periodic basis then the event is recur-
detail with an every ring. If so, it will often contain balance measures. You should check for these when
preposition you ask your how many questions.
Evolving Event
If an event has at If you have two or more when details you may have an evolving event. If any of the
least one when details are initially unknown and/or can change after the event has been
changeable when created (and loaded into the data warehouse) then it is definitely an evolving event.
detail it is evolving If so, you should look out for changeable duration measures that make use of the
multiple whens.
If an event is evolving you should ask the stakeholders for example stories that
illustrate the initial and final states—the emptiest event story possible, and a fully
completed one—to help explain how the event evolves.
Imagine for a moment that the stakeholders had responded to your additional
when question with:
This would make the event evolving if the actual delivery dates and payment dates
are unknown when orders are loaded into the data warehouse.
Evolving events Notice that the “on” prepositions for these when details are preceded with the verbs
contain additional “deliver” and “pay”. These verbs are events in their own right that occur some time
verbs that may be after the initial order event. However, if stakeholders respond in this way they view
modeled as discrete them primarily as order event milestones. Therefore, you should continue to
events in their own model them as when details of an evolving order event but you may also want to
right model delivery and payment as separate discrete events too: You would ask “Who
delivers what?” and “Who pays what?” to discover if there are important details
that will be lost if deliveries and payments are only available at the order level.
If stakeholders provide multiple when details, pay attention to the verbs used in
prepositional phrases. The multiple verbs can identify a process sequence of
related milestone events. These events can be modeled as part of the current
evolving event and as discrete events in their own right if you suspect they have
more details. See Chapters 4 and 8 for more details on modeling evolving events.
Modeling Business Events 45
Discrete Event
By a process of elimination, if an event is neither recurring nor evolving, it must be Discrete events
discrete. You reconfirm this each time you discover a new detail by asking if its contain details that
example values can ever change. If details never change, or changes are handled as do not change
new adjustment events, then the event remains discrete. In Figure 2-7 both the
ORDER DATE and the DELIVERY DUE DATE (if applicable) are known at the
time of an order and do not change, so the order events, as modeled so far, are
discrete.
Who?
Once you have identified the story type it’s time to double-back (to the top of the Ask a who question
7Ws flowchart in Figure 2-2) and find out whether any other whos are associated to see if anyone
with the event. The general form of a who question is “Subject Verb Object else is involved
from/for/with who?” Using the current subject, verb and object you might ask:
If so, you add the new who to the table and ask for example salespeople to match
the existing event stories. E.g., to continue a group themed story you might ask:
In Figure 2-8, the event stories introduce you to some of Pomegranate’s finest sales
personnel, but also shows that orders can be made without a salesperson, and that
some orders are handled by sales teams rather than individual employees (continu-
ing the group story theme).
Figure 2-8
Adding a second
who detail
46 Chapter 2
Don’t use real employee names in event stories. You may have to model stories
where employees underperform—you don’t want to point the finger at anyone in
the room or elsewhere. Try using famous fictional characters instead. This side-
steps any legal problems, and can be mildly entertaining, but don’t overdo it: you
don’t want to distract everyone from the real event stories and details.
What?
Ask a what question, Next you ask for any additional whats associated with the event. The general form
especially if you of the question is: “Subject Verb Object with/for what?” What questions are par-
don’t already have ticularly useful when the main clause doesn’t already contain a what detail; for
a what detail example:
might give you the what detail: SOFTWARE PRODUCT that would be added to
the table with a “for” preposition. You can keep repeating variations on the what
question to see if there are any more what details, but be careful not to collect
“detail about detail” (see sidebar: Detail about Detail)
Where?
Ask for a where next The next detail type to look for is a where. You ask for this by using the event’s
main clause with a where appended:
You are trying to find out whether the event occurs at a specific geographic loca-
tion (or website address). If the stakeholders respond:
Online and retail you would extend the table to record the website URL or retail store location as a
outlets could be where detail of the event. You might generalize the stakeholders’ response to:
generalized as sales CUSTOMER orders PRODUCT at SALES LOCATION. Naming the detail SALES
locations. LOCATION enables you to record websites and retail stores in the same column. If
Generalizations you define a generalization detail like this you should make sure that its meaning is
should be clearly clearly documented by examples. In Figure 2-9 the examples for the new where
documented by detail SALES LOCATION show three different types of location: store, website and
examples call center.
Modeling Business Events 47
This sounds okay, but try placing the new detail after the subject:
Oops, clearly this no longer makes sense. Customers don’t have product types,
products do. Product type only makes sense if it appears directly after (to the right
of) PRODUCT. It is position sensitive. This tells you that product type describes
PRODUCT, not the event itself, and is therefore detail about detail.
Stripping out any details that are not directly related to the event is important, so that Detail about detail
the event can be used to define an efficient fact table. However, you do not want to isn’t discarded.
completely discard the important finding that PRODUCT TYPE is a detail about It is used to define
products. It’s obviously something that stakeholders want to report on. Instead of dimensions
adding it to the table you can place it above the PRODUCT column in the space set
aside for capturing detail about detail. You will use it shortly to define the PRODUCT
dimension.
You can apply the same test to the SALESPERSON detail from the earlier who
question: swap it around the event main clause and listen to yourself saying:
You sound strange (like Yoda in Star Wars) but it still makes sense. You can see
that the additional who detail can be placed anywhere in the main clause and its
meaning is not lost. Therefore SALESPERSON is not position sensitive, and this
tells you that it is a detail about the event.
48 Chapter 2
If you find yourself generalizing several details you should ask questions about how
similar the event stories really are. If stories have very different details you will
probably want to model them in separate event tables, because highly generalized
events rapidly become meaningless to stakeholders.
Figure 2-9
Adding where
details
Check that each When you ask for additional where details emphasize that you are looking for
where is a detail of locations that are specific to the whole event, not the existing who or what details.
“who does what” not This helps avoid (for the moment) detail about detail—like customer address and
just who or what product manufacturing address—that are not dependent on the event. These
reference addresses will be modeled as dimensional attributes of CUSTOMER and
PRODUCT once the event is complete (see Chapter 3).
Each time you finish collecting a W-type, it’s good practice to quickly scan
through the previous Ws to check for missing details. After you finish asking
where questions check to see if any of the where details remind the stakeholders
of additional whos, whats and whens.
Put the primary details: the event main clause and initial when detail, on your
main whiteboard or first sheet. If you can’t fit those first three columns on your
whiteboard, it’s too small or your writing is too big. The primary details can stay
front and center while you add or remove extension sheets for blocks of the
other Ws. We suggest you divide up the details as we have the latter chapters
of this book, with at least one sheet each for who & what, when & where, how
many, and why & how.
Have a scribe recording the model as you go. With traditional interactive model-
ing efforts, scribes are usually members of the data modeling team because of
the technical nature of the information they record and modeling tools they use.
With BEAM✲, the scribe can be anyone who can use a spreadsheet. This is an
ideal role for the on-site customer or product owner (one of the stakeholders) on
an agile team.
If you are limited for whiteboard space and lacking a scribe because of the
impromptu nature of your modelstorming, take pictures so you can erase as you
go. The cameras in most smartphones and tablet devices are more than ade-
quate for this and can take advantage of scanner apps that will automatically
clean up whiteboard images (reduce glare, increase contrast, fix perspective)
and email the results to your group. Don’t forget to turn off the flash.
If you have to erase as you go, leave the primary details and example data on
the board. If room permits (or on a separate flipchart) keep a visible “shopping
list” of the detail names you’ve had to erase.
Use any color you like as long as its black! If you’re going to take photos of your
work, stick to black whiteboard markers to improve the results. BEAM✲ notation
is deliberately non-color coded to help you here. Why do you see so many rain-
bow-colored whiteboard diagrams? Occasionally someone will have a well
thought out color scheme (but did they remember 8% of the male population
have color vision deficiency?). More often than not it’s because black is the
missing/dried-up pen. Go out and buy a box of black dry-wipe markers now!
Right now!
If you want to increase the level of interest, interactivity, contribution and energy
when you’re modelstorming give everyone a (black) marker. Get stakeholders
on their feet writing their own event stories on the board as soon as they’re used
to BEAM✲. How well this works depends on your style, their style, everyone’s
handwriting and the number of modelstormers. Having everyone edit the white-
board model together can work well for small groups of peers but no one wants
to feel they’re back at school being told to “Write that on the board”.
50 Chapter 2
If you are using the BEAM✲Modelstormer spreadsheet you will find that the
primary details (the subject, object, and initial when detail) of each table are
frozen, so that you can scroll horizontally without losing the context of each
event story. This spreadsheet also draws a pivoted ER table diagram in sync
with the BEAM✲ table, so you can see a list of all the details at any time.
Appendix C provides recommendations for tools and further reading that will
improve your collaborative modeling efforts.
How Many?
Ask “how many?” How many questions are used to discover the quantities associated with an event
to discover facts, that will become facts in the physical data warehouse and the measures and key
measures, KPIs performance indicators (KPIs) on BI reports and dashboards. Again, you repeat
the main clause of the event as a question to the stakeholders, but this time with
“how many” and its variants: “how much”, “how long” etc. inserted to make
grammatical sense. For example:
In both cases you want the name of the quantity. Likely answers to these ques-
tions—ORDER QUANTITY and REVENUE—have been added to the event table
in Figure 2-10 along with examples that show a wide range of values. You should
ask how much/many questions for each detail to see if it has any interesting quanti-
ties that should be associated with the event. So you could ask:
The stakeholders would probably like to answer “thousands” but for each order
event story it is always one customer. For details like this where the answer is
always one, or zero if the detail is missing (not mandatory), there isn’t a useful
additional quantity to name and add to the event. When you have checked all the
details for quantities, you should follow up with the general question:
Figure 2-10
Adding quantities
Unit of Measure
When you ask the stakeholders for example quantity values, you should also Ask for the standard
discover their unit of measure. If you find that a quantity is captured in multiple unit of measure for
units of measure it will need to be stored in a standard unit in the data warehouse each quantity
to produce useful additive facts, so you should ask the stakeholders what that
standard unit should be. (Chapter 8 provides details on designing additive facts.)
You record the unit of measure in the quantity’s column type using square brackets
type notation; e.g., [$] or [Kg]. The unit of measure is a more useful descriptive
type for a quantity than [how many].
[ ]: Square brackets denote detail type (e.g. who, where) and unit of measure for
how many details.
If a quantity needs to be reported in multiple units of measure you can record them Multiple units of
as a list with the standard unit of measure first. Figure 2-10 shows examples events measure can be
where REVENUE is captured in dollars and pounds. The column type [$, £, €, ¥] listed in the column
records that US Dollar is the standard unit for the data warehouse, but BI applica- type
tions will also need to report REVENUE in Sterling, Euro and Yen.
Durations
You discover durations by asking how long questions. For example, “How long Ask “How long?” to
does it take a CUSTOMER to order a PRODUCT?” If the stakeholders view the discover durations
event as a point in time there will be no duration (not recorded or not significant). and evolving events
Asking for durations is another way of testing if the event should be modeled as
evolving. Duration calculations can expose missing when details and highlight
other events (verbs) that are so closely related to the current event that they should
all be part of an evolving event.
52 Chapter 2
Derived Quantities
Derived quantities Some modelers may question the need for modeling duration quantities. If time-
help define BI stamps are present then durations can be calculated rather than stored. This is true,
reporting but BEAM✲ tables are BI requirements models for documenting data and report-
requirements ing requirements not physical storage models. By documenting durations and
other derived measures as event details you have the opportunity to capture their
business names and document their maximum and minimum values (in stories),
which can be used as thresholds for dashboard alerts and other forms of condi-
tional reporting.
BEAM✲ event tables do not translate column for column into physical fact tables.
When an event table is physically implemented as a star schema the majority of
its non numeric details will be replaced by dimensional foreign keys, and some of
its quantities can be replaced by BI tool calculations or database views. This
process is covered in Chapter 5.
Why?
Time to ask “Why?” Capturing why details is the next step in modeling the event. As with the other “W”
questions you ask a why question using the main clause of the event:
Why details often will focus the stakeholders on identifying the causal factors that specifically explain
explain quantity variations in event quantities. The why details you are looking for can include
variations promotions, campaigns, special events, external marketplace conditions, regulatory
circumstances or even free-form text reasons for which data is readily available. If
the stakeholders respond with:
Product promotions.
Try to discover you would expand the event table as shown in Figure 2-11, and add example stories
typical and that illustrate typical and exceptional circumstances. Notice that the typical pro-
exceptional why motion is “No promotion” and that the why detail has prompted the stakeholders
stories to supply an additional DISCOUNT quantity. Chapter 9 provides detailed coverage
on modeling why details as causal dimensions.
If event stories show wide quantity variations, point this out as you model why
details. Ask stakeholders if there are any reasons that would explain these varia-
tions? If causal descriptions are well recorded they may also lead you to discover
additional quantities.
Modeling Business Events 53
Figure 2-11
Adding a why detail
How?
The final “W” questions discover any how details. How refers to the actual mecha- You finish with how
nism of the business event itself. You discover these details by asking a how ques- questions
tion using the main clause of the event:
Often how details include transaction identifiers from the operational system(s)
that capture each event. If the stakeholders respond with:
then you would add ORDER ID to the table as in Figure 2-12. ORDER ID might be Transaction IDs
an equally good answer to other how questions such as: “How do you know that a (how details) help
customer ordered a product; what evidence do you have?” or “How can you tell to differentiate
one similar order from another?” With these questions you are explicitly asking for events
operational evidence that these event stories exist and can be differentiated from
one another.
Figure 2-12
Adding a how detail
54 Chapter 2
How details can be You should ask further how questions to find out if there are any more descriptive
also be descriptive how details. You are typically looking for methods and status descriptions. A
suitably rephrased how question might be:
Event Granularity
Event granularity is Each event story must be uniquely identifiable (otherwise there would be no way to
the combination of identify duplicate errors). Therefore you must have enough details about an event
event details that so that each example story can be distinguished from all others by some combina-
guarantee tion of its values. This combination of detail is called the event granularity. Discov-
uniqueness ering the event granularity is the job of the repeat story theme. If every detail in the
repeat story matches the typical story you do not have enough details to define the
granularity.
Transaction IDs In most cases, event granularity can be defined by a combination of who, what,
(how details) can be when, and where details, but occasionally details stubbornly refuse to be unique
used to define through most of the 7Ws. While highly unlikely, perhaps the same customer really
granularity can order the same product at the same time at the same price from the same
salesperson for delivery on the same date to the same location. In cases like this,
the operational source system will have created a transaction identifier—such as
Order ID (a how detail)—that can be used to differentiate these event instances.
Granular details are After you have discovered the event granularity you document it by adding the
marked with the short code GD (granular detail) to the column type of each granular detail that in
code GD combination defines the granularity. Figure 2-12 shows that the order event granu-
larity is defined by a combination of ORDER ID and PRODUCT. This would
equate to order line items in the operational system.
Sufficient Detail
Just because you have enough details to define the event granularity does not mean Granularity is not
you stop adding details. If the stakeholders are still providing relevant details, keep enough, you want
adding then. Event uniqueness is a minimum requirement. What you are aiming all the details
for is the complete set of details that tell the “full” story (or at least as much of it as
is currently known).
Figure 2-13
Documenting the
event
If the event verb doesn’t provide a good name try using one of the how details. For If the verb doesn’t
example, if the main clause was “Customer Buys Product” an event name of sound right try a
CUSTOMER BUYS might be replaced by CUSTOMER PURCHASES or how detail
CUSTOMER INVOICES using how details like PURCHASE ID, or INVOICE
NUMBER.
The subject of an event is actually subjective: it is based on the stakeholders’ initial Event subjects are
perspective. Different stakeholders can describe the very same event starting with a subjective!
different subject. Once all its details have been discovered the initial event might be
better described using one of those alternative points of view, by swapping around
56 Chapter 2
Before naming an its details to establish a better subject and/or event name. For example, the event
event you may need “‘SALESPERSON sells PRODUCT to DISTRIBUTOR”’ might be reordered as
to change its subject “‘DISTRIBUTOR orders PRODUCT from SALESPERSON”’ and named
DISTRIBUTOR ORDERS rather than SALESPERSON SALES. The initial subject
(SALESPERSON) has helped to tease out the event stories, but its work here as a
subject is now done.
If the subject-verb combination doesn’t provide a good event name try using the
object and a how detail.
Draw a double bar When you finish documenting an event using a spreadsheet (as in Figure 2-13)
on the right edge to draw a double bar on the right edge to signify that the table is complete (for now).
denote a completed This is a helpful visual clue because BEAM✲ tables grow wider than the screen or
table printed page. If you can’t see the double bar you know you should scroll right or
look for a continuation page to see more details.
The finished stories By scanning the completed table stakeholders can now read their finished event
can now be told stories, such as:
Elvis Priestly (is that really his name?) orders 1 iPip Blue Suede on the 18th May
2011, for delivery due on the 22nd May 2011, from James Bond at the POMStore
NYC, for delivery to Memphis, TN, for $249, on no promotion with zero dis-
count, using ORD1234.
Vespa Lynd orders a POMBook Air on the 29 June 2011, for delivery on the 4th
July 2011, from POMStore.com, for delivery to London, UK, for £1,400 with a
launch event 10% discount, using ORD007.
With a little tweaking of prepositions and reordering of details (mainly the how
manys) you can now construct an event story archetype for CUSTOMER ORDERS
from the examples. This final piece of documentation might say:
Modeling Business Events 57
Customer orders a quantity of products, on order date, for delivery on delivery due The generic
date, from salesperson at sales location, for delivery to delivery address, for revenue, customer orders
on promotion with discount using order ID. story
In subsequent modelstorming iterations, when you have established a library of When you already
dimensions, stakeholders will want to move directly from event to event, rapidly have a library of
telling event stories by reusing common details (dimensions) and examples. This is common dimensions
a habit that you want to encourage early on by using the event matrix techniques, you can quickly move
described in Chapter 4. to the next event
When you and your stakeholders get the hang of telling event stories, BEAM✲ can
proceed at a storming pace, describing many events in quick succession but you
should be careful not to model too many events at the story level. It is a balancing
act. Stakeholders need to model multiple events, to describe their cross-process
analytical requirements and be able to prioritize the most important event(s) for
the next release—not necessarily the first event(s) you discover when modelstorm-
ing. However, telling many more detailed event stories than can be implemented in
the next sprint is unnecessarily time consuming and can create unrealistic expecta-
tions of what will soon be available. It can also lead to the dreaded BDUF (Big
Design Up Front) that does not reflect the business realties, changed requirements
and available knowledge when it is eventually implemented.
You should reserve event stories for just-in-time (JIT) modeling of the detailed
data requirements for your next sprint/iteration/release depending on your agile
development schedule. Look to Chapter 4’s JEDUF (Just Enough Design Up Front)
techniques for modeling ahead to (even more rapidly) create higher-level models of
the events in future releases. These models provide just enough information to help
stakeholders decide the best events to model in detail now, and help you design
more flexible versions of those events: ones that will require less rework in future
DW iterations.
58 Chapter 2
Summary
Business events represent the measureable business activity that give rise to DW/BI data
requirements. BEAM✲ uses data storytelling techniques for discovering business events by
modelstorming with business stakeholders.
BEAM✲ defines three event story types: discrete, evolving and recurring. They match the three
fact table types: transaction fact tables, accumulating snapshots and periodic snapshots.
Each BEAM✲ event consists of a main clause, containing a subject verb object, followed by
prepositional phrases containing prepositions and detail nouns. Each subject, object and detail
is one of the 7Ws; i.e., a noun that potentially belongs in a dimensional database design.
BEAM✲ modelers use the 7Ws to discover a business event and document its type, granularity,
dimensionality and measures—everything needed to design a fact table.
BEAM✲ modelers avoid abstract data models. They “model by example”: ask stakeholders to
describe their BI data requirements by telling data stories. BEAM✲ modelers document these
requirements using example data tables.
Event stories are sentences made up of main clause and preposition-detail examples.
Event stories succinctly describe event details and clarify their meaning by providing examples
of each of the five themes: typical, different, missing, repeat and group.
Typical and different themed stories explore data ranges and exceptions.
Missing stories help BEAM✲ modelers to discover mandatory details and document how BI
applications should display missing values.
Group stories uncover event complexities including mixed business models and multi-valued
relationships.
BEAM✲ short codes are used to document mandatory details, granular details and story type.
Other elements of the BEAM✲ notation document W-type, units of measure and completed
event models.
3
MODELING BUSINESS DIMENSIONS
I keep six honest serving-men (They taught me all I knew);
Their names are What and Why and When And How and Where and Who.
— Rudyard Kipling, The Elephant’s Child
Business events and their numeric measurements are only part of the agile dimen- Business events
sional modeling story. On their own, BEAM✲ event tables are not sufficient to need dimensions to
design a data warehouse or even a data mart, because they do not contain all the fully describe them
descriptive attributes required for reporting purposes. For complete BI flexibility, for reporting
stakeholders need both the atomic-level event details modeled so far and higher- purposes
level descriptions that allow those details to be analyzed in practical ways. The data
structures that provide these descriptive attributes are dimensions.
In addition to the 7Ws and example data tables, BEAM✲ uses hierarchy charts and BEAM✲ modelers
change stories to discover and define dimensional attributes. Hierarchy charts are draw hierarchy
used to explore the hierarchical relationships between attributes that support BI charts and tell
drill-down analysis, while change stories allow stakeholders to describe their change stories to
business rules for handling slowly changing dimensions. define dimensions
In this chapter we describe how these BEAM✲ tools and techniques are used to This chapter shows
model complete dimension definitions from individual event details. We will use you how to model
the CUSTOMER and PRODUCT event story details from Chapter 2 for our dimensions from
example dimension modelstorming with stakeholders. event story details
59
60 Chapter 3
Dimensions
Dimensional Dimensions are the significant nouns of a business or organization that form the
attributes are the subjects, objects, and supporting details of interesting business events. They are 6
nouns and of the 7Ws: the who, what, when, where, why, and how of every event story. Dimen-
adjectives that sional attributes further describe business events using terms that are familiar to
describe events in the stakeholders. They represent the adjectives that make data stories more inter-
familiar business esting. From a BI perspective, dimensions are the user interface to the data ware-
terms house, the way into the data. Dimensional attributes provide all the interesting
ways of rolling up and filtering the measures of business process performance. BI
applications use dimensions to provide the row headers that group figures on
reports and the lists of values used to filter reports. BI takes advantage of the
hierarchical relationships between dimensional attributes to support drill-down
analysis and efficient aggregation of atomic-detail measurements. The more
descriptive dimensional attributes you can provide, the more powerful the data
warehouse and BI applications appear. Consequently, good, richly descriptive
customer and product dimensions can have 50+ attributes.
The data values of a dimension (or an individual dimensional attribute) are re-
ferred to as its members.
Dimension Stories
Dimensions data Dimensions, because they represent descriptive reference data (adjectives and
stories have weak nouns), lack the strong narrative of business events. Events (and event stories) are
narratives. They are associated with “exciting”, active verbs such as “buy”, “sell”, and “drive” as used in:
subject and object “customer buys product”, “employee sells service” and “James Bond drives Aston
heavy but verb light Martin DB5”. Dimensions, on the other hand, are associated with static verbs such
as “has” and “is” that lead to weak narratives like “Customer has gender. Customer
has nationality” and “Product has product type. Product has storage”. These are
state of being events, archetypes for many “is/has” data stories such as “Vespa Lynd
is female. She is British” and “iPOM Pro is a computer. It has a 500GB disk”.
Important as these sentences are, we hardly think of them as stories because they
lack the drama of “who does what, to whom or what, and when, and where” that
propels you through all of the 7Ws to rapidly discover data requirements. Dimen-
sion stories are not exactly page-turners!
BEAM✲ modelers While data stories are highly effective at discovering dimensions and facts (as event
add drama and plot details) and the 7Ws remain a powerful checklist at all times, additional techniques
to help define are needed to uncover the information hidden within the weaker narratives of
dimensions dimension stories. BEAM✲ modelers have to add some drama to dimensions to
help stakeholders tell more interesting stories that fully describe their attributes
and business rules that affect ETL processing and BI reporting. To do this BEAM✲
modelers use two plot devices:
Modeling Business Dimensions 61
Hierarchy charts are used to ask stakeholders questions about how they Hierarchy charts
organize (an active verb) the members of a dimension (the values in event sto- help you to ask
ries) into groups. These groups, because they generally have much lower cardi- questions about
nality than the individual members of the dimension, make good row headers how dimensions
and filters for reports—good starting points for telling fewer, higher-level event are organized
stories for BI analysis. Many of these groups have hierarchical relationships
with one another. Exploring these hierarchies will prompt stakeholders for ad-
ditional descriptive attributes and provides the information needed to config-
ure BI drill-down and aggregation facilities.
Change stories are data stories that document how each dimensional attribute Change stories
should handle historic values. By asking stakeholders how dimension members describe how
change (another active verb) they not only describe which attributes can and dimensional
cannot change (but can be corrected), but also state their reporting preferences attributes change,
for using current values or historical values. Getting stakeholders to think and how they
about change reminds them of additional attributes that behave in similar ways should reflect
to existing ones and can lead to the discovery of auxiliary business processes history for reporting
that capture changes. Some of these auxiliary processes can be significant purposes
enough that they need to be modeled as business events in their own right.
Discovering Dimensions
Having modeled an event, there is no elaborate discovery technique for dimensions Dimensions are
because they naturally come out of the event details that you already have. Each discovered as event
event detail that has additional descriptive attributes will become a dimension. For details by telling
most details there is, at the very least, an identity attribute—a business key—that event stories
uniquely identifies each dimension member. This is typically the case for the who,
what, when, where, and why details. The candidate dimensions from Chapter 2’s
CUSTOMER ORDERS event are:
Missing from this list are the event quantities and any how details, such as ORDER
ID, which do not have any additional descriptions that need to be modeled in
separate dimension tables. When the event is converted into a star schema these
62 Chapter 3
details translate directly into facts and degenerate dimensions (denoted as DD)
within the fact table. Chapter 9 covers instances where physical how dimensions
are needed.
Documenting Dimensions
Dimensions are Dimensions are documented by taking each event detail (that has additional
modeled using attributes), one at a time, starting with the event subject and modeling it as a
BEAM✲ tables BEAM✲ (example data) table. Figure 3-1 shows how CUSTOMER—the subject of
CUSTOMER ORDERS—is used to define a dimension of the same name. Note that
dimensions are singular whereas events are plural.
Dimension Subject
The event detail The event detail also becomes the first attribute of the dimension table—its subject.
becomes the This mandatory attribute of the dimension is denoted by the code MD. You should
subject of the ask the stakeholders for a suitable subject name. Typically, a dimension’s subject is
dimension its “Name” attribute, such as CUSTOMER NAME, PRODUCT NAME, or
EMPLOYEE NAME. After you have a name for the attribute you populate it with
the unique examples from the event table: notice in Figure 3-1 that the repeated
customer names have been removed.
Figure 3-1
Modeling the
Customer
dimension
Modeling Business Dimensions 63
You check that this is what you need (a single, universal, reliable identifier) by
confirming with the stakeholders that:
CUSTOMER ID is mandatory. Every customer must have a value for this A business key
business key at all times. If this were not the case other business keys would be must be unique and
needed to augment this one. mandatory
Assuming CUSTOMER ID passes these stakeholder tests (which should be con- Add the identity
firmed by data profiling as soon as possible) you add this identity attribute to the attribute
dimension table with examples to match the existing customer names as shown in immediately after
Figure 3-2. You also mark it with the column code BK to denote that it is a “Busi- the dimension
ness Key” and because it is the only business key for CUSTOMER you also mark it subject
as mandatory using MD. You can leave out the “has” preposition as it adds little
value. When the relationship between an attribute and the dimension subject is
more complex (and not apparent from the attribute name) you can add a preposi-
tion to help you read the dimension members as stories.
Figure 3-2
Adding a business
key to a dimension
64 Chapter 3
Discovering a Asking for a customer business key can be a vexing question. If customer data
suitable customer comes from multiple sources, a single business key that identifies all customers
identifier can be uniquely across all business processes may simply not exist. If you are lucky,
difficult. There may customers will have a single major source, or a master data management (MDM)
be more than one application will have created a master customer ID that the data warehouse can
use. If not, be prepared for some interesting discussions with the stakeholders. You
may have to model several alternate business keys to have a reliable identifier for
each customer.
Dimensional Attributes
Use any detail about Having defined the dimension’s granularity, you are now ready to discover the rest
detail that you have of its attributes. You may already have a short list of candidate attributes that were
then ask for new identified as detail about detail while modeling an event. Figure 3-3 shows the
attributes PRODUCT TYPE detail about detail being added to the PRODUCT dimension.
These candidate attributes are a good place to start, because the stakeholders are
obviously keen to use them. If you don’t have any candidate attributes then it’s
time to ask the stakeholders:
This usually produces a list of attributes that should be tested, one at a time, to
ensure they are in scope.
Attribute Scope
Check that the new Before you add an attribute to a dimension you need to check that it is within the
attribute is in scope scope of the dimension and the current project iteration. You check the initial
for your current feasibility of an attribute by asking stakeholders whether they believe that data for
project, iteration or the new attribute will be readily available from the same sources as other attributes
release and event details. You should be wary of attributes that don’t exist yet, or greatly
expand the number of data sources you have to deal with in an iteration. Attempt-
ing to collect examples is a good way of weeding out attributes that are “nice to
have if we had an infinite budget.” If stakeholders struggle to provide any exam-
ples, or can’t agree on common examples, the attribute may be of limited value.
Modeling Business Dimensions 65
Figure 3-3
Product dimension
populated with detail
about detail
If you are modeling directly into (projected) spreadsheet BEAM✲ tables, as in the
figure examples in this chapter, freeze the dimension subject and business key
columns, as you would the primary details of events, to keep them visible as you
scroll horizontally to add new attributes.
Assuming that most of the attributes that stakeholders suggest are within the Check that the new
current scope, your next task is to check that they belong in the current dimension. attribute has only
You are looking for attributes that have a single value for each dimension member one value for each
at any one moment in time (including “now”): These are attributes that have a one- dimension member
to-one (1:1) or a one-to-many (1:M) relationship with the dimension subject when
you disregard history. You will consider historical values shortly when you ask for
change stories. For now, you check each new attribute by asking the following
“moment-in-time” question:
The question is carefully phrased so that you are checking for the condition you
don’t want, so if the answer is NO then the attribute belongs. NO is good! It tells
you that the attribute does not have a M:M or M:1 relationship with the dimension,
both of which would rule it out. For example, you might ask:
If the stakeholders answer NO, you add CUSTOMER TYPE to the dimension as in
Figure 3-4. Try this question on something which doesn’t belong to CUSTOMER,
like the detail about detail PRODUCT TYPE:
Product type obviously doesn’t belong to!Customer and the YES answer confirmed
it. In reality, common sense would have prevented you or the stakeholders from
considering this as a CUSTOMER attribute.
Figure 3-4
Adding
CUSTOMER TYPE
to CUSTOMER
Sometimes you will also have to exercise a little intuition to interpret a YES (multi-
ple values are possible at any moment) answer. If you ask:
then you have found a single-valued attribute that belongs. You should also find A small fixed
out how many other address values a customer can have. If customers have only number of values
two additional addresses, for example, home address and work address, then you can be modeled as
can define them as separate attributes in the dimension by being more precise separate attributes
about their meaning/name.
If customers can have multiple addresses (some might have hundreds) then you If a proposed
may have discovered a missing where detail of CUSTOMER ORDERS. Addresses attribute has
might be delivery addresses; customers are ordering gifts for their family and multiple values it
friends or resellers are ordering products for their clients. If that is the case, the may represent a
addresses in question don’t belong to customers. They belong instead to the busi- separate dimension
ness event as a DELIVERY ADDRESS detail and subsequently as a separate dimen- of the business
sion. This handles the genuine many-to-many (M:M) relationship between event
customers and these addresses.
If the answer to the question: “Can a [dimension name] have more than one [at- A proposed attribute
tribute name] (at one moment in time)?” is a resounding NO then the attribute can be qualified
belongs in the dimension. If the answer is a resounding YES it most likely doesn’t (e.g. Main… or
belong but may warrant more investigation before you rule it out. If no one is Primary…) to
happy to see the dimension without a particular attribute, you may have to qualify restrict it to a
the attribute in some way (for example, Primary Address), or adjust the dimen- single value
sion’s granularity to accommodate it.
If the multiple addresses do belong to a customer—they are the offices or stores of Alternatively the
a corporate customer—you might need to change the CUSTOMER dimension dimension
granularity to customer location, if event activity is tied to specific locations and granularity might
stakeholders treat each location as an individual customer. In which case you need to be adjusted
would redefine the granularity as one row per customer per location and define a to match the vital
composite business key to uniquely identify each member. attribute
Attribute Examples
After you have established that an attribute belongs in the dimension you add it to Ask for examples for
the BEAM✲ table and ask the stakeholders for example values that match the each new attribute
dimension subject. The dimension subject will already contain typical and excep- that match the
tional (different and group themed) examples copied from an event (that you dimension subject
captured using the event story themes in Chapter 2). These usually prompt the members
stakeholders to provide interesting values for each new attribute, too. Stakeholders
will typically give you values that they want to see on their reports.
68 Chapter 3
If stakeholders The goal of using examples is to ensure that everyone is in clear agreement about
cannot agree on the definition of each attribute. If the meaning or use of an attribute is unclear or
examples, you may contentious ask for additional examples. If stakeholders can’t agree on example
have homonyms: values for an attribute it can indicate that you have discovered homonyms: two or
multiple attributes more attributes with the same name but different meanings. If all the possible
with similar names meanings are valid, then each set needs to be uniquely named, and modeled as a
separate attribute with its own examples.
Descriptive Attributes
Check if codes such When documenting examples for business keys and any other cryptic codes check
as business keys if the values contain any hidden meaning. For example, the data in Figure 3-2
are smart keys: suggests that CUSTOMER ID is a “smart key” that can be used to differentiate
contain hidden business and consumer customers. How does this tally with CUSTOMER TYPE?
meaning You would investigate this further via data profiling. It might prove useful as an
additional quality assurance test during ETL processing. For every code you are
given you should ask the stakeholders:
If YES, you want to convert this report logic in to ETL code and define descriptive
Model descriptive attributes for these labels in your dimensions, where they will be consistently
attributes that maintained and available to everyone. Your motto should be “No SQL decodes at
decode all cryptic query time!” If BI applications need to decode dimensional attributes you and the
codes. No decodes stakeholders have not done a good enough job of defining the dimensions.
in BI queries!
Beware of “smart keys” with embedded meaning. They seldom remain smart over
their lifetime, and often become overloaded with multiple meanings as business
processes evolve. BI applications should not attempt to decode smart keys and
other codes to provide descriptive labels. It is almost impossible for embedded
report logic to keep up with these codes as their meaning morphs over time. It
should be replaced by descriptive data in the data warehouse.
If you find more BI-friendly descriptive attributes for codes you can remove or
Codes provide hide the codes in the final version of a dimension, as long as stakeholders have no
consistent sort order use for them on reports. However, if you are designing a multinational data ware-
for multi-language house that will translate descriptive attributes into several national languages, these
text otherwise cryptic codes are useful for consistently sorting reports that are re-run
internationally. Chapter 7 covers techniques for handling national languages.
Modeling Business Dimensions 69
Boolean flag attributes that contain "Y" or "N", (e.g., RECYCLABLE FLAG) can
usefully be augmented with matching report display-friendly attributes containing
descriptive values, (e.g., "Recyclable" and "Non-Recyclable").
Mandatory Attributes
While you are filling out example data for an attribute ask whether it is mandatory. Use MD to record
If the stakeholders believe it is, you should add MD to its column type. MD does that stakeholders
not necessarily define attributes as NOT NULL in the data warehouse. MD may believe an attribute
just represent the stakeholders’ idealistic view, while the data warehouse has to to be mandatory
cope with the actual operational system data quality. By documenting allegedly
mandatory attributes you are capturing rules that the ETL process can test for, and
identifying potentially useful attributes for defining dimensional hierarchies.
Missing Values
One example you must fill in for every attribute is its “Missing” value. If the di- Every dimension
mension subject has already been identified as possibly missing from an event needs x missing row
story there will already be a missing subject copied from the event. If not you to document
should add a missing row to the dimension just as you would to an event. You fill missing display
out this row by asking the stakeholders how they want “Missing” to be displayed values
for each attribute.
Paradoxically you need to ask for missing values even for mandatory attributes. For Even mandatory
example, If CUSTOMER TYPE is a mandatory attribute of CUSTOMER then for attributes need a
all SALES events involving Customers you can rely on the Customer Type to be missing value
present. But if Customers are missing for certain SALES events (for example,
anonymous cash transactions) then CUSTOMER TYPE will also be missing.
Figure 3-4 shows that when CUSTOMER is missing, the stakeholders want
CUSTOMER TYPE to be displayed as “Unknown.”
If stakeholders need the data warehouse to be able to differentiate between various If there are different
types of “Missing” (e.g., “Not Applicable”, “Missing in error”, or “To be assigned types of missing,
later”) the dimension will need additional special case missing stories with differ- you need multiple
ent descriptive values and ETL processes will have to work a little harder to assign missing stories
the correct “missing” version to the events. The implementation of this is discussed
in Chapter 5.
Don’t go overboard with examples. Dimensions usually have far more attributes
than events have details, and you want to discover as many dimensional attrib-
utes as possible rather than exhaustively capture examples for only a few.
70 Chapter 3
Figure 3-5
Exclusive customer
attributes
Exclusive attributes If you discover exclusive attributes, you need to find their defining characteristic(s):
have at least one one or more attributes whose values control the validity/applicability of the exclu-
defining sive attributes. For customer there is a single attribute, CUSTOMER TYPE, which
characteristic dictates whether the attributes in X1 or X2 are valid. This is marked as a defining
characteristic with the code DC. If a dimension contains multiple defining
characteristics each DC code should be followed by a list of the exclusive group
numbers controlled by the attribute. For example, if CUSTOMER TYPE was one
of several DC attributes in CUSTOMER it would be marked DC1,2 because it
selects between X1 and X2 group attributes only.
Exclusive attribute groups can be nested if required. For example, if “for profit” Exclusive attribute
and “non profit” organizations were described differently, an additional defining groups can be
characteristic BUSINESS TYPE marked DC3,4 would govern their descriptive nested with other
attributes marked X3 and X4. As these are all business related attributes they exclusive groups
would be nested within the “Business” exclusive group X2 using the code combina-
tions X2, X3 and X2, X4. Their defining characteristic BUSINESS TYPE is also a
business only exclusive attribute so it should be marked in full as X2, DC3,4.
Some of the exclusive attributes in Figure 3-5 are marked as mandatory (MD) but An attribute can only
are not always present because they are exclusive to a subset of the dimension be mandatory when
members. The code combination Xn, MD means exclusive mandatory attribute: its Xn group is valid
attribute is only mandatory when its exclusive group is valid.
Defining characteristic and exclusive attribute groups allow you to model subsets Exclusive attribute
within a single BEAM✲ table. Subsets can help you later to define restricted views subsets can be
(or swappable dimensions, see Chapter 6) to increase usability and query perform- implemented
ance. They also provide important ETL processing rules and checks. as separate tables
Answers like these would lead you to attributes such as CUSTOMER TYPE and
PRODUCT TYPE (if you didn’t already have them). Table 3-1 illustrates other
7W-inspired questions and example answers for customer and products.
Table 3-1
Example 7W
attribute questions
and answers
What dates (whens) are important to know Birth Date, Graduation Date, First
about a customer? Purchase Date, Last Purchase
Date, Renewal Date
What milestone dates (whens) are there for Launch Date, Arrival of First
a product? Competitor, Patent Expiration Date,
Discontinued Date
Are there any single-valued quantities (how Life Time Value, Loyalty Score,
many) that describe or group customers? Current Balance, Number of
Employees, Number of Dependents
What quantities (how many) describe Weight, Size, Capacity, List Price
products?
Dimensional Hierarchies
Hierarchies provide a mechanism for dealing with details that are too numerous or Dimensional
small to work with individually. Dimensional hierarchies describe sequences of hierarchies support
successively lower cardinality attributes. They allow individual business events to BI drill-down
be consistently rolled up to higher reporting levels, and subsequently drilled-down analysis
on, to explore progressively more detail. Without hierarchies, BI reporting would
be overwhelmed by detail.
So how much detail is too much? In everyday life we all seem to agree that 365 (or When dealing with
366) are too many! Too many days to always deal with individually—too short a lots of things, we
period of time to complete large tasks and activities or see trends. So we group our naturally tend to
days into longer periods: weeks, months, quarters, terms, semesters, seasons etc. so organize them
that we can plan bigger things and see patterns and trends in our lives. Figure 3-6 hierarchically into
shows how the days in the first quarter of 2012 can be grouped hierarchically. fewer and fewer
Organizations naturally do this with the many fiscal time periods, geographic groups
locations, people, products, and services that they work with. The clue is in the
name: organizations organize things and if there are enough things (their 7Ws)
they typically organize them hierarchically. The majority of the 7Ws that describe a
business will have de-facto hierarchies in place, and it is vitally important that they
are standardized and made available in the dimensions in the data warehouse.
Figure 3-6
The calendar: a
balanced hierarchy
Hierarchy definition Hierarchies provide a necessary hook to catch dimensional attributes. When
provides a hook for you ask stakeholders about hierarchies you are asking them how they (would
catching additional like to) organize their data. Discussing this activity is one technique for adding
attributes some otherwise missing narrative to dimension stories and prompting stake-
holders for the BI friendly attributes you need to model good dimensions.
When stakeholders think about their 7Ws hierarchically, they describe low
level attributes that can be used as discriminators for similar dimensional
members, and higher level attributes that can group together many dimen-
sional members.
Hierarchies help When stakeholders describe their favorite hierarchy levels they will frequently
expose “informal” provide you with additional “informal“ data sources (spreadsheets, personal
stakeholder databases) they own that contain this categorical information. These stake-
maintained data holder maintained sources often contain hierarchy definitions, vital to BI, that
sources that can are missing from “formal” OLTP databases because they are nonessential for
greatly enrich operational activity. Many operational applications happily perform their func-
dimensions tion at the bottom of each hierarchy with no knowledge of the higher level
groupings that are imposed upon their raw transactions for reporting purposes.
For example, orders can be processed day in, day out without the order proc-
essing system knowing how a single date is rolled into a fiscal period. Similarly,
items can be shipped to the correct street number/postal code without knowing
how the business currently organizes sales geographically (or how they might
have been organized differently last year).
Hierarchies help you Hierarchies exist so that organizations can plan. Discussing hierarchies with
discover planning stakeholders will get them thinking about their planning processes, and will
processes likely help you discover additional events and data sources that represent budg-
ets, targets or forecasts. You must make sure the dimensions you design con-
tain the common levels needed to roll up actual event measures for comparison
against these plans.
BI users and BI BI users like default hierarchies and BI tool “click to drill” functionality that
tools require default allows them to quickly drill-down on an attribute without having to manually
hierarchies to decide each time what to show next. For example, if users drill on “Quarter”
enable simple drill- they usually want to see monthly detailed data by default. Explicit hierarchies
down establish predictable analytical workflows that are very helpful to (new) BI us-
ers exploring the data for the first time. “Clicking to drill” is less laborious and
error prone than manually adding and removing report row headers.
Hierarchies are Everyone wants common drill-down and drill-up requests to happen quickly.
used to optimize Explicit hierarchies are needed to define efficient data aggregation strategies in
query performance the data warehouse. On-Line Analytical Processing (OLAP) cubes in multidi-
mensional databases, aggregate navigation /query rewrite optimization in rela-
tional databases and prefetched micro cubes in BI tools, all take advantage of
hierarchy definitions to maximize query performance.
Modeling Business Dimensions 75
Hierarchy Types
Data warehouses and BI applications have to deal with three types of hierarchies: There are three
balanced, ragged (unbalanced) and variable depth (recursive). Each of these can hierarchy types:
come in two flavors: simple single parent and more complex multi-parent. Of the balanced, ragged
six varieties shown in Figure 3-7, single parent balanced hierarchies are the easiest and variable depth
to implement and use and should be the main focus of your initial modeling
efforts.
Figure 3-7
Hierarchy types
Balanced Hierarchies
Balanced hierarchies have a fixed (known) number of levels, each with a unique Balanced
level name. Time (when) is an example of a balanced hierarchy, as the example hierarchies have
calendar data in Figure 3-6 shows. This example has four levels: day, month, fixed numbers of
quarter, and year. The hierarchy is balanced because there are always four levels; levels
days always roll up to months, months to quarters, and quarters to years—there are
no exceptional dates that do not belong to a month and only belong to a quarter or
a year.
Being balanced has nothing to do with the number of members (unique data The number of
values) at each hierarchy level. For example, even though the number of days in a members at each
month varies from 28 to 31, and days in a year can be 365 or 366, the calendar level can vary
hierarchy is still balanced in depth. Figure 3-6 is not the only time hierarchy;
alternative hierarchies of day → fiscal period → fiscal quarter → fiscal year and
day → week → year may all exist in the same calendar dimension. Each of these is
a separate balanced hierarchy.
A hierarchy is implemented in a dimension by adding an attribute for each of its Balanced hierarchy
levels. For a balanced hierarchy each of its fixed levels must be a mandatory attrib- levels are
ute with a strict M:1 relationship with the parent attribute one level above it and a mandatory attributes
1:M relationship with the child levels below it.
76 Chapter 3
Ragged Hierarchies
Ragged hierarchies Ragged (or unbalanced) hierarchies are similar to balanced hierarchies in that they
have missing levels have a known maximum number of levels and each level has a unique name, but
(with zero members) not all levels are present (have values) for every path up or down the hierarchy—
that unbalance them making some paths appear shorter than others. Figure 3-8 illustrates a ragged
product hierarchy, where a product (POMServer) does not belong to a subcate-
gory. This product is effectively a subcategory all of its own.
Figure 3-8
Ragged product
hierarchy
Try to balance You can model ragged hierarchies in a dimension by using non-mandatory attrib-
slightly ragged utes for the missing levels, but these gaps (missing values) cause problems for BI
hierarchies by drilling. If a hierarchy is only slightly ragged you can often redesign it with the
removing levels or stakeholders help as a balanced hierarchy, to improve reporting functionality. This
filling in missing can involve removing levels that are not consistently implemented for all members
values or creating new level values to fill in the gaps (e.g. a subcategory value of “Server”
for the Figure 3-8 example). See Chapter 6 for more details on balancing ragged
hierarchies.
Multi-Parent Hierarchies
The time hierarchy in Figure 3-6 is single parent hierarchy because each child level Multi-valued
value rolls up to just one parent level value. In contrast, Figure 3-9 shows a multi- hierarchies contain
parent product hierarchy where a product (iPipPhone) belongs to more than one members with two
Product Type (it is part telephone, part media player). In a multi-parent hierarchy or more parents at
each child level can roll up to multiple parents. If a multi-parent product hierarchy the same level
is used to roll up sales to the Product Type level, something must be done to
account for products that fall into multiple types. Their sales will need to be care-
fully allocated; otherwise revenue for products with two parents will be double-
counted at the Product Type or Subcategory level.
Figure 3-9
Multi-valued Product
hierarchy
Multi-parent hierarchies can also be ragged or variable depth. The latter are typi- Multi-parent,
cally represented in source systems by M:M recursive relationships. Multi-parent variable-depth
hierarchies and variable depth hierarchies cannot be modeled directly in dimen- hierarchies
sion tables. Chapter 6 covers additional structures (hierarchy maps) for coping represent M:M
with these complex hierarchies and handling fact allocation at query time across recursive
multiple parents. For the remainder of this chapter, assume hierarchies to be single relationships
parent hierarchies that are modeled within dimension tables.
A dimension can contain multiple hierarchies of different types. You should model
at least one balanced hierarchy for each dimension to help discover additional
attributes and common levels for comparisons across processes, and to enable
default BI drill-down facilities.
Hierarchy Charts
Hierarchy charts are simple, quick to draw diagrams used to model single or Hierarchy charts
multiple hierarchies. On a hierarchy chart a dimensional hierarchy is represented are a quick way
by a vertical bar with the dimension name at the bottom and the highest-level to visualize
attribute of the hierarchy at the top. The levels are represented as marks on the bar, hierarchies
in ascending order. Figure 3-10 shows three example hierarchy charts for Time and
Product.
Figure 3-10a, b, c
Hierarchy charts for
Time and Product
Levels can be When you draw a hierarchy chart you can space out the level tics evenly, as in
spaced evenly or Figure 3-10a and 3-10c, or in rough approximation of their relative aggregation, as
relative to the in Figure 3-10b where levels that expose more details are placed further below their
aggregation they parent than levels that reveal fewer details. Relative spacing gives stakeholders a
provide visual clue as to how much more detail they can expect to drill down to at each
level, or how selective filters would be placed at various levels. You can also anno-
tate levels with their approximate cardinalities, as in Figure 3-10a. Large gaps or
jumps in cardinality on a hierarchy chart can prompt stakeholders for missing
levels that would give them ‘finer grain’ drill-down and even more interesting
descriptions.
Hierarchy charts In addition to providing a visual comparison of levels within a single hierarchy, a
can show single or hierarchy chart can also be used to compare multiple hierarchies for a single
multiple hierarchies dimension, as in Figure 3-10b, or all the dimensions associated with an event, as in
Figure 3-11.
Figure 3-11
CUSTOMER
ORDERS
hierarchy chart
type parents. In Figure 3-12c the variable depth of a human resources (HR) hierar-
chy caused by the recursive relationship (see Chapter 6) between managers and
employees is modeled by adding a circular path between the two levels. These
annotations can be combined to model the most complex hierarchies.
Figure 3-12a, b, c
Ragged,
multi-parent and
variable depth
hierarchy charts
Modeling Queries
Event hierarchy charts which combine multiple dimension hierarchy charts for the Event hierarchy
same event can be used to model query definitions for report and dashboard charts can model
design. One or more queries can be defined on an event hierarchy chart as lines the dimensionality
connecting the referenced levels, as shown in Figure 3-13 where X marks levels that of queries, OLAP
are used to filter the query (WHERE clause), and O marks those used to aggregate cubes and
it (GROUP BY clause). In this way, event hierarchy charts can also be used to aggregates
model the dimensionality of OLAP cubes and aggregate fact tables.
Figure 3-13
Query definition
hierarchy chart
Event hierarchy charts can be used while modeling events to help capture their
major dimensional attributes. By drawing a hierarchy chart above an event table
you can record detail about detail (dimensional attributes) in hierarchical order at
the same time that you model event details (dimensions and facts).
80 Chapter 3
It’s a leading question if you add “hierarchically”, but stakeholders generally have a
good idea about the hierarchies they need, and will usually offer you candidate
levels in hierarchical order—which helps. If any are new attributes check that they
belong in the dimension (have a M:1 relationship with it) and ask for examples
before considering them as candidates.
Ask stakeholders to Ask stakeholders where they think a candidate attribute sits on the hierarchy chart
add their new levels bar and add it to your diagram where they suggest, or better still get them to add it.
to the hierarchy Figure 3-14 shows SUBCATEGORY added to a PRODUCT hierarchy chart be-
chart tween BRAND and CATEGORY.
Figure 3-14
Adding
SUBCATEGORY to
the PRODUCT
hierarchy at the
correct level
Check that each As each new candidate is added, you need to check that it is in the right position
candidate level has relative to the existing levels. It must have a M:1 relationship with its parent (if
the correct parent present) and a 1:M relationship with its child. If you or any stakeholders are
child relationships unsure, you can check the relationship by methodically asking the following
Modeling Business Dimensions 81
then:
If you finish with “M” above the candidate (SUBCATEGORY) and “1” below the If a new level is M:1
parent (CATEGORY) (NO, YES answers) as in Figure 3-14 then you have the M:1 with its parent,
relationship you are looking for and the candidate belongs in the hierarchy below check that it is 1:M
the parent. If the child below the new level is the dimension itself then the candi- with its child
date is in the correct position (you already know that the new level has a 1:M
relationship with the dimension). Otherwise you test that the child relationship is
1:M by asking a few more quick fire questions while pointing at the hierarchy chart
(pointing always helps):
If the answer is YES then put “M” just above the child.
If the answer is NO put “1” just above the child.
then:
If you finish with “M” above the child (BRAND) and “1” below the candidate
(SUBCATEGORY) (YES, NO answers) you have the right M:1 child relationship,
the candidate is in the correct position and you can move on to the search for
another level in the hierarchy.
Follow each hierarchy level name with a few example values as a bracketed list as
in the Figure 3-14 chart. This is especially useful if you are modelstorming a
hierarchy chart with limited whiteboard space and stakeholders cannot see a copy
of the dimension example data table.
Modeling Business Dimensions 83
Completing a Hierarchy
After you have found the correct position for the new level—or discarded it if it Check for “hot”
does not belong—you continue to find more levels by pointing at the existing ones planning levels
and asking stakeholders whether any other levels exist above or below them. When before finishing
they have finished providing new levels you should ask one more hierarchy related each hierarchy
question:
If stakeholders identify planning levels, mark these with an asterisk (*) to denote Mark hot levels with
that they are “hot”, i.e., likely to be particularly important levels for many BI an * and be
comparisons. You may need to design aggregates or OLAP cubes at these levels to prepared to model
improve query performance (see Chapter 8). You should definitely model the their additional
planning events themselves along with any additional hot level rollup dimensions matching events
they require. E.g., Month is typically a hot level in the when hierarchy that is and rollup
implemented as a rollup dimension (a separate physical dimension derived from dimensions
the base calendar dimension) to match the granularity of plan and aggregate facts.
(See Chapters 4 and 8.)
Hot levels often appear at the points where different W-type hierarchies logically or Hot levels exist
physically intersect. For example, Category and Department are hot levels in the where different W-
Product and Employee hierarchies (as denoted in Figure 3-12) because this is type hierarchies
notionally where a what hierarchy of things (products) intersects with a who intersect
hierarchy of people (employees). At that point, a de facto 1:1 relationship exists
between the HR and product hierarchies: a single employee (a product sales man-
ager) responsible for a single group of employees (a department) is also responsible
for a single group of products (a category). He or she will want many reports
summarized to these levels.
When you have finished modeling a hierarchy, check that each level is a mandatory Check all levels are
attribute. If some are not mandatory then you may have a ragged hierarchy instead mandatory to avoid
of your preferred balanced one. If data profiling confirms that certain level attrib- ragged hierarchies
utes contain nulls, then update the hierarchy chart to document the missing levels
by putting brackets around their names (as in Figure 3-12a) prior to resolving the
issue with stakeholders. It is especially important that hot levels are mandatory for
successful cross-process analysis.
When you have completed all the hierarchy charts for a dimension, rearrange the
level attributes in hierarchy order in the dimension table with the low level attrib-
utes first followed by higher level attributes (reading left to right). Hierarchical
column order increases readability and helps to roughly document the hierarchi-
cal relationships within physical dimension tables.
84 Chapter 3
Dimensional History
You must define When you have discovered all the attributes of a dimension, described them using
how dimensional examples, and modeled their hierarchical relationships, there is one more piece of
attributes handle information that stakeholders must tell you about each one: how to handle its
history history. This information is also known as a dimension’s temporal properties or
slowly changing dimension (SCD) policy.
Slowly changing Stakeholders instinctively feel the need to preserve event history—especially legally
dimensions binding financial transactions—but may think of dimensions as (relatively static)
dramatically affect reference data that simply has to be kept up to date when it does change. While it is
how historical true that dimensions are relatively static compared to dynamic business events,
events are reported slowly changing dimension history, or rather the lack of it, can have a profound
effect on a data warehouse’s ability to accurately compare events over time and
meet stakeholders’ initial and longer-term needs. For example, Dean Moriarty, a
customer who was based in New York, relocates to California at the beginning of
this year. Should a BI query for “order totals by state, this year vs. last year” associ-
ate all of Moriarty’s orders to his current location: California? Or should Mori-
arty’s orders last year be associated with New York (last year’s location), and only
this year’s with California? What if BI users want to look at the biggest spenders in
California over the last two years, should their queries include Moriarty based on
his high spending while he was still in New York last year or exclude him because
he hasn’t spent so much since moving to L.A.? Another way of asking these ques-
tions is “Should queries use the current or historical values of customer state?” Is
there a simple answer?
Current value (CV) Dimensional attributes that only contain the current value descriptions provide “as
attributes provide is” reporting; i.e., they roll up event history using today’s descriptions, making it
“as is” reporting that seem as if everything has always been described as it is now. For the previous
matches operational example, current value (CV) attributes would roll up all of Moriarty’s orders to
reporting results California (his current location) regardless of where, on the road, he was living
when he placed them. This is typically the style of reporting that stakeholders are
used to from their attempts to analyze history directly from operational systems.
Modeling Business Dimensions 85
CV attributes enable the data warehouse to produce the same results as existing
operational reports—often an initial acceptance criteria for stakeholders.
Unfortunately problems arise when current values are the only descriptions avail- With current values
able for DW/BI systems, which by definition, must support accurate historical only, dimensional
comparisons. CV attributes may be capable of answering questions such as: history is lost and
“Where are the customers, now, who bought … in the last three years?” or “What many potentially
are the top selling products this year vs. last year?” But they cannot answer: “Where important BI
were those customers and what were they like when they bought our products?” or questions cannot
“Exactly what were products like (how were they described and categorized) at the be answered
time they were purchased?” These questions cannot be answered because dimen- correctly
sional history is lost when CV attributes are updated (overwritten).
Another limitation of current value only solutions is that they cannot reproduce Current value only
previous historical analyses. The same report with exactly the same filters will often designs make it
yield different results when run later—even though every detailed event remains impossible to
unaltered—because the reference data used to group, sort and filter the events has reproduce reports
changed when it should not have. This is a common bane of reporting from opera- when dimensions
tional sources that stakeholders do not want repeated in the data warehouse. change
It’s not all bad news for CV attributes. Even though they are historically incom- Current value
plete/incorrect, they can be useful for certain types of historical comparisons that attributes can
recast history: deliberately pretend everything was described as it is now. For usefully recast
example, a sales manager who wants to compare channel sales for this year versus history
last year, may need to pretend that today’s channel structures also existed last year,
in order to make the comparison. This is exactly what CV channel description
attributes will do.
HV attributes Preserving dimensional history requires more ETL work but data warehouses that
require more ETL are built using HV dimensions are more flexible. Not only can they correctly
resources, but answer the “What were things really like when …?” questions by default, they can
provide more also be used with minimal effort to recast history to current values for “as is”
flexible reporting reporting, or to a specific date, such as a financial year-end for “as at” reporting.
HV dimensions techniques for supporting both “as is” and “as was” reporting are
covered in Chapter 6.
Historic value dimensional attributes support the agile principle “Welcome chang-
ing requirements, even late in development.” Stakeholders are able to change
their mind about using current or historic values without ETL developers “tearing
their hair out” and having to reload the data warehouse.
If the answer is NO, label the attribute as fixed value with the short code FV and
copy its example value from the first row (the typical member) into its change story
as in Figure 3-15 to illustrate that it is unchanging over time. Then move on to the
next attribute.
If the answer is YES, the attribute’s values are not fixed and you need to ask a
follow-on question to discover if stakeholders want/need (not always the same
thing) historical values. For example, if PRODUCT TYPE is not FV ask:
Make sure you never ask: “Do you want current values or historic values?” The Ask about historic
answer, which is invariably “current values”, tells you nothing you shouldn’t values: you know
already know. Of course stakeholders want current values—that’s a given: they are current values are
incredibly interested in current business events and want those events to be de- important already
scribed properly using current values just as they are in the operational systems.
You want to discover if attribute history is equally important. It is also highly
misleading to present historic values and current values as an either/or choice: HV
attributes must include current values because current values are the historically
correct descriptions for the most recent events.
Figure 3-15
Modeling
dimensional history
using a change
story
If the answer to your “Do you need historic values?” question is NO, double-check Double-check that
with the stakeholders that they will only ever care about current values and are history really is
fully aware of the BI limitations they are settling for. The problems of misstated unnecessary before
history and unrepeatable reports caused by CV only attributes are best explained defining a CV
using examples (see Documenting CV Change Stories shortly), to any stakeholder attribute
making this decision for the first time.
For many attributes it is a good idea to treat the stakeholders’ CV answers as You may need to
reporting directives rather than storage directives. They, quite rightly, are telling you provide CV
how they want their (initial) BI reports to behave: often exactly like existing opera- reporting behavior
tional reports they know (and love), but not how to store information in the data initially but store HV
warehouse—that’s not their role. For all but the largest dimensions and most attributes to provide
volatile attributes it is possible to efficiently store (compliance ready) HV attributes flexibility in the
but provide CV versions by default for reporting purposes. future
88 Chapter 3
You can document If you, or the stakeholders, decide that both HV and CV data is needed for an
hybrid temporal attribute you can label it as an HV/CV or CV/HV hybrid attribute, with its default
requirements by reporting behavior listed first. In figure 3-15, SUBCATEGORY and CATEGORY
combining HV and default to HV but CV reporting will also be available. For both attributes their
CV short codes change stories reflect the more complex HV behavior.
Unless you are using specialist temporal database technology, the CV and HV
values of a hybrid attribute will need to be stored as separate physical columns in
the same dimension or in separate hot swappable dimensions. See Chapter 6 for
the hybrid slowly changing dimension design pattern to implement this.
Use CV/PV to For certain attributes, that change very infrequently, stakeholders may be content
document with the current value and one previous value: the value before the last change.
requirements for These attributes can be labeled CV/PV in the BEAM✲ model and implemented as
previous values separate columns in a physical star schema. This design pattern is know as a type 3
slowly changing dimension.
Ideally, source systems should provide reason codes for the most important HV Source systems
attribute updates. Unfortunately, explicit reason descriptions are not often avail- rarely provide
able. One of the many benefits of proactive dimensional modeling is that ETL update reasons
designers can take advantage of preemptive HV definitions to request that not only to help detect
update notification but update reason notification is built into a new operational corrections
system while it is still in the early stages of development.
If update reasons are not available, the next best thing is to define business rules Group change rules
that identify important changes based on groups of attributes that should all can help detect
change at the same time. An example group change rule which tracks only “large” corrections and
changes affecting several attributes at once, might be: minor changes
“If customer STREET Address changes but ZIP CODE is unchanged, then handle
the update as a correction (or minor move in the same Zip code area): do not
preserve the existing address. If STREET and ZIP CODE change together, track
the customer’s address history prior to this major relocation”
You can discover these rules by asking general questions like “What attributes To discover these
change together?” or specific questions for each attribute such as “what other rules ask for
attributes must change when this changes?”. You can also tie your questions to attributes that
some activity; you might ask: change together
Asking for change Questions like this, not only expose change dependencies between existing attrib-
groups can help find utes, they can help uncover missed attributes too. They are another one of the
missed attributes BEAM✲ modeler’s secret weapons for attribute discovery. Discussing the activity
of change is another way of adding narrative to an otherwise static dimension and
will get stakeholders thinking. They might respond:
STREET, ZIP CODE and CITY should all change together. If only
one or two of those change, its probably a correction or a move
within the same zip or city. If customers move locally—in the same
city—we don’t need their old addresses. But if they move city, we
will want to use those previous addresses.
Use HVn to define a You can model a group change rule like this one, very concisely, by using num-
conditional HV bered HV codes to define conditional HV groups: attributes that only act as HV
group of attributes when every member of the same numbered group changes at the same time. In
that must change Figure 3-16, the stakeholders’ rule has been documented by marking STREET, ZIP
together CODE, and CITY as CV, HV1. They are each CV by default (the first temporal
short code in the list) so that individual changes will be treated as corrections.
Additionally, they are all members of the conditional group HV1 and will act as
HV to preserve address history when a customer moves city (when all three attrib-
utes change) unless, that is, an exceptional customer manages to move to the very
same street address in a different city. Perhaps ZIP CODE and CITY should be in a
group of their own (HV3) to safeguard missing this rare type of relocation.
Figure 3-16
Modeling a CV
change story and
group change rules
An attribute can Notice that the three HV1 attributes in Figure 3-16 are also in group HV2 along
belong to multiple with COUNTRY. This means that their history is tracked when all three (group
HVn groups and be HV1) change even if COUNTRY does not change, but COUNTRY itself will only
HV by default be tracked when all four address attributes (group HV2) change. If an attribute is
Modeling Business Dimensions 91
Effective Dating
When you have captured change stories and temporal business rules for each Add administrative
attribute, add three more attributes to the dimension table: EFFECTIVE DATE, attributes to each
END DATE, and CURRENT as in Figure 3-17. These additional administrative dimension for
attributes enable ETL processes to track changes and flag the current version of effective dating
each member. They effectively convert HV dimensions into minor event tables
capable of recording numerous small events.
Figure 3-17
Effective dating a
dimension table
With the addition of effective dating, readers who are familiar with how type 2 Change stories
slowly changing dimensions are implemented will notice how closely the change demonstrate type 2
story row in a BEAM✲ dimension matches this ETL technique. This is intentional slowly changing
as BEAM✲ models are designed to be translated into physical dimensional models dimension behavior
with minimal changes. It is also important that BEAM✲ modelers do not refer to but don’t use this
HV attributes as type 2 SCDs or attempt to modelstorm the final piece of their ETL jargon with
puzzle: surrogate keys, with business stakeholders. Type n SCD terminology and stakeholders
surrogate keys (covered in Chapter 5) are appropriate star schema-level topics for
discussion with ETL and BI developers not stakeholders.
Minor Events
Not every event is a Occasionally, you will discover events that do not seem to have enough details or
significant business occur frequently enough to represent significant business processes in their own
process right; they seem more like dimensions. For example, imagine the answer to your
next “Who does what?” event discovery question is:
Minor events have You model several event stories and end up with the CUSTOMER MOVES event
few details. They table in Figure 3-18. This is a perfectly acceptable event, with a subject-verb-object
often represent main clause, containing a who subject (CUSTOMER), an active verb (“moves”), a
external activity where object (ADDRESS), and a when detail (MOVE DATE) but that’s all. Despite
asking all the 7Ws questions, it lacks any other who, what, why, how, or how many
details. Why customers move, how much it costs them, or who helps them are
unknowns because the event is external to Pomegranate’s business. In BEAM✲
terms CUSTOMER MOVES is a minor event (despite being quite a major event for
the customer). Minor events represent activities that are not always interesting or
detailed enough for standalone analysis. But the data values arising from them are
important for correctly labeling, grouping, and filtering the other, far more inter-
esting, major events of the organization.
Figure 3-18
Minor CUSTOMER
MOVE event
If the subject and the object of an event both describe the same thing (e.g.,
customers) and there are no other details except when, you can handle the event
object as an HV attribute of the subject dimension, as long as the change repre-
sented by the event does not occur too often. Daily or monthly change would
make it a rapidly changing dimension—better handled as an event.
Modeling Business Dimensions 93
If you discover a minor event with a small number of details (typically three Ws,
including when), ask how and when these details are captured. You may be able
to model their capture within a far more interesting major event.
Figure 3-19 shows a very different version of the CUSTOMER MOVES event One organization’s
compared to the minor version of Figure 3-18. By reading the event stories in both minor event may be
tables you can see that these are actually the same events happening to the same another
people—but in Figure 3-19 they have been modeled in far greater detail for the data organization’s major
warehouse of a relocation company. This CUSTOMER MOVES is clearly a major event
event—for that company.
Figure 3-19
Major CUSTOMER
MOVES event
Sufficient Attributes?
How do you know when you have sufficient attributes for a dimension or levels in
a hierarchy? There is no magic test; stakeholders will simply run out of ideas. If you
feel you have not discovered every possible attribute: don’t worry, be agile, press
on. As long as you have the major hierarchies and HV attributes, and a clear
definition of granularity for each dimension, additional attributes can be added
with relative ease in future iterations. That said, the great benefit of modelstorming
with stakeholders is the ability to define common (conformed, see next chapter)
dimensional attributes early on, so don’t miss your opportunity while you have
their initial attention.
94 Chapter 3
Summary
A dimensional data warehouse is only as good as its dimensions. Good dimensions contain
dimensional attributes that describe business events using terms that are familiar and
meaningful to the BI stakeholders. This is the best reason for asking stakeholders to modelstorm
the dimensions they need using examples to clarify the terms they use.
Dimensions themselves are discovered by event modeling; most who, what, when, where and
why event details become dimensions. Dimensional attributes are discovered by modeling each
of these details as the subject of its own BEAM✲ dimension table. How details can be
dimensionalized in this way too but they typically do not have additional descriptive attributes.
Non-descriptive how details become degenerate dimensions, stored in fact tables along with the
how many details which become facts.
The first additional attribute that must be modeled for a dimension subject is its identity. This is
the business key (BK) which uniquely identifies each dimension member and defines the
dimension granularity. If a dimension is created from multiple source systems there may be
more than one BK for a dimension member. The unique BK for a dimension can be a
composite of multiple source systems keys.
Hierarchy charts describe how stakeholders (want to) organize dimensional members
hierarchically to support drill-down analysis and plan vs. actual reporting. Drawing these charts
helps to prompt stakeholders for additional hierarchical attributes and data sources. Hot
hierarchy levels, that represent popular levels of summarization, help to identify additional
planning events, rollup dimensions and aggregation opportunities.
Hierarchies can be balanced, ragged or variable depth. Each type can be single or multi-parent.
Single parent, balanced hierarchies are the easiest hierarchies to implement dimensionally and
the simplest to work with for BI. Additional techniques are needed to balance ragged
hierarchies and represent multi-parent and variable depth hierarchies (See Chapter 6).
Change stories describe how dimensional history is handled. The short codes CV, HV, FV, PV
are used to document the temporal properties of each attribute. These temporal codes can be
numbered to define group change rules involving multiple attributes.
Minor events are events that occur infrequently and contain few details. They typically do not
represent significant business processes that warrant modeling as separate events. Often they
can be modeled as HV dimensional attributes, or as additional details of other major events
(recognizable business processes).
4
abbreviations
Designing a data warehouse or data mart for business process measurement BI Stakeholders
demands that you quickly move beyond modeling single business events. All but need multiple
the simplest business processes are made up of multiple business events and BI events for process
stakeholders invariably want to do cross-process analysis. When you modelstorm measurement
these multi-event requirements you soon notice two crucial things:
Stakeholders model events chronologically. As you complete one event, Events sequences
stakeholders naturally think of related events that immediately follow or pre- represent business
cede it. These event sequences represent business processes and value chains processes and
that need to be measured end-to-end. value chains
Stakeholders describe different events using many of the same 7Ws. When Events share
you define an event in terms of its 7Ws, stakeholders start thinking of other common
events with the same details, especially events that share its subject or object. dimensions that
These shared details, known to dimensional modelers as conformed dimen- support cross-
sions, are the basis for cross-process analysis. process analysis
In this chapter we describe how an event matrix, the single most powerful BEAM✲ The event matrix is
artifact, is used to storyboard the data warehouse: rapidly model multiple events, an agile tool for
identify significant business processes and conformed dimensions, and prioritize modeling multiple
their development. events
This can result in With a tightly-controlled initial scope, BI users can receive their agile data marts
silo data marts early and obtain valuable insight from them individually on a department by
that are unable to department basis. So far so good, but when users want to step up to cross-
support cross- department, cross-process analysis they find they cannot make the necessary
process analysis comparisons due to incompatible or missing descriptions and measures. Rebuild-
ing each data mart from scratch is unthinkable so data is re-extracted from the
source systems so that each department can look at it “their way”. The cost of this
extra work and the inconsistent or conflicting answers that emanate from these
“multiple versions of the truth” drive BI stakeholders crazy!
Figure 4-1
Silo data marts that
cannot be shared:
a data warehouse
anti-pattern
With too limited a Silo data marts are examples of technical debt. Agile software development inten-
scope data tionally takes on technical debt when “just barely good enough” code is released.
warehouse design This makes good sense when the business value of early working software out-
incurs heavy weighs the interest on the debt: the extra effort involved in refactoring the code in
technical debt future iterations. However, for DW/BI projects, the cost of servicing high interest
technical debt: refactoring terabytes of incorrectly represented historical data, can
be ruinously high.
Traditional BDUF The especially high interest of database technical debt could be argued as a good
does not match agile reason for taking a traditional, non-agile, approach to data requirements gathering
BI requirements and data warehouse design itself and postponing agile practices for ETL and BI
Modeling Business Processes 97
development. But the problem is this fallback to the “big design upfront” (BDUF)
simply does not match the evolutionary nature of modern BI requirements nor
their delivery timescales. Plus it is incredibly hard for a DW/BI project to become
agile when it does not start off agile.
Instead, agile data warehouse modelers should stay agile (and dimensional), but Agile dimensional
lower their technical debt by balancing “just in time” (JIT) detailed modeling of modelers lower their
business events for the next development sprint and “just enough design up front” technical debt by
(JEDUF) for cross-process BI in the future. To do so, modelers need to rapidly modeling ahead just
model ahead in just enough detail to discover which of the dimensions, needed for enough to define
the next sprint, should also be conformed dimensions that will help to future proof conformed
their designs for enterprise BI. dimensions
Conformed Dimensions
Figure 4-2 shows a Promotion Analysis Report that combines information from two Conformed
events: CUSTOMER ORDERS and PRODUCT CAMPAIGNS to explore the dimensions allow
connection between campaign activity and sales revenue. The report is possible measures from
because the two different events have identical descriptions of PRODUCT and different events to
PROMOTION. These conformed dimensions allow measures from both events to be combined and
be aggregated to a compatible level and lined up next to one another on the report. compared
Lining up the answers or drilling-across like this appears obvious but if the events
are handled by different operational systems (an Oracle-based order processing
application and a SQL Server-based customer relationship management system)
then this report might be the first time that the two sets of data have actually met.
If each source system describes products and promotions differently and the
individual star schemas use these non-conformed descriptions, the analysis would
not be possible because the measures would not align.
Figure 4-2
Conformed
dimensions
enable
cross-process
analysis
98 Chapter 4
Swappable, rollup Swappable dimensions [SD] that are subsets of conformed dimensions. For
and role-playing example, a CUSTOMER dimension (1M people) and a subset EXTENDED
dimensions are WARRANTY CUSTOMER dimension (100K people) are conformed if they
conformed at the describe the same customer in exactly the same way. These two dimensions
dimensional would allow product sales and extended warranty claims to be compared for all
attribute level customers or just warranty holding customers. Swappable dimensions are cov-
ered in Chapter 6.
Rollup dimensions [RU] with conformed attributes in common with their base
dimensions. Figure 4-3 shows an example of the conformed when dimensions
CALENDAR and MONTH. These two dimensions can be used to compare
daily and monthly granularity measures at the Month, Quarter, or Year level.
Rollup dimensions are typically used to describe planning events and aggregate
fact tables.
Figure 4-3
When dimensions,
conformed at the
attribute level
Modeling Business Processes 99
While dimensions are frequently shared across many business processes, facts are Conformed
typically specific to a single process or event. However, they can be used to create measures rely on
conformed measures if they have compatible calculation methods and common compatible facts
units of measure that allow totaling and comparison across processes; for example, with common units
if sales revenue and support revenue are both pre-tax dollar figures they can be of measure
combined to produce region totals.
Conforming data is not so much a technical challenge, as a political one, requiring Conforming data
consensus on data definitions across many departments within an organization as is a political
well as operational systems. By modelstorming with stakeholders you highlight the challenge. BEAM✲
value of conformed dimensions to the very people who can make them happen. tackles this by
Modeling multiple events by example, as BEAM✲ encourages you to do, quickly modeling with
reveals inconsistencies that would otherwise thwart conformance. Stakeholders will stakeholders who
work to address these inconsistencies and conform dimensions when they see the can make it happen
potential business value they provide.
Homonyms are data terms with the same name but different meanings. They are Homonyms are non-
the opposite of conformed dimensions and attributes. For example, both Pome- conformed data
granate’s Sales and Finance departments use the term “Customer Type” but Sales terms with the same
has five types of customer and Finance only three. If stakeholders cannot agree on name but different
a conformed customer type then you would have to define two uniquely named meanings
details: SALES CUSTOMER TYPE and FINANCE CUSTOMER TYPE. However,
taking this approach for every homonym perpetuates incompatible reporting and
weakens the analytical power of the data warehouse. Perhaps by discovering this
problem through modelstorming examples, Sales and Finance stakeholders could
agree on a new conformed version of Customer Type with four descriptive values.
Synonyms are data terms with the same meaning but different names. Organiza- Synonyms are
tions will often use different names across different departments/ business proc- conformable data
esses for what could be the same conformed dimension or attribute. For example, terms with the same
an insurance company might use the terms Customer Enrollee, Subscriber, Policy meaning but
Holder and Claimant interchangeably, while a pharmaceutical company may refer different names
to the same person as a Physician, Doctor, Healthcare Provider or Practitioner.
Defining a data Compared to standalone data mart projects or the silo data mart anti-pattern, the
warehouse bus data warehouse bus requires some more initial work to:
requires more
initial work Model enough different business processes/events to identify potentially
valuable conformed dimensions and expose conformance issues.
Build more robust ETL processes that actively conform dimensional attributes,
from multiple operational data sources, not just the event source(s) currently in
scope.
The pay-back is The reward for conforming is less technical debt and rework and greater agility in
reduced technical the long run. Once the initial conformed dimensions have been defined, self-
debt and greater governing agile teams, that promise to use them, can work in parallel to develop
long-term agility data marts for individual business events or processes, becoming experts in their
data sources and measurement.
While the inception costs of conforming are higher, the data warehouse bus is still
an agile JEDUF technique: once the bus has been defined, only the conformed
dimensions for the current development sprint need to be modeled in detail and
actively conformed; i.e. you can conform incrementally.
Modeling Business Processes 101
Figure 4-4
Data warehouse
bus design pattern
The most useful tool for planning conformance and designing a data warehouse A dimensional
bus is a dimensional matrix. This is a grid of rows representing business processes matrix is the ideal
and columns representing dimensions with tick marks at the intersections where a conformance
dimension is a candidate detail of a process. Figure 4-5 shows an example dimen- planning tool for
sional matrix for Pomegranate’s manufacturing process. The simplicity of this designing a data
diagram belies the power of the single page overview it provides (even for a com- warehouse bus
plex real-world design as opposed to this text book example). The clarity of this
model, in a format readily understood by stakeholders and IT alike, compared to
individual tables or data warehouse level ER diagrams, can be truly inspiring! Start
scanning the matrix and see.
Figure 4-5
A dimensional
matrix
102 Chapter 4
Scan down the Scanning down the dimension columns reveals the potential for dimensional
matrix columns to conformance. Conformed dimensions that could form a data warehouse bus show
identify potential up with multiple ticks. The contrast between these valuable dimensions that
conformed support cross-process analysis (Hurray!) and the non-conformed dimensions that
dimensions do not (Boo!) should encourage everyone to work towards conformance.
Scan across the Scanning across the process rows helps to estimate the complexity of a business
matrix rows to process: generally, the more dimension ticks, the more complex a process is likely
compare process to be and the more resource needed to define its business events and implement
complexity them.
Start with a high- It’s a good idea to start your agile DW/BI project by creating a high-level matrix to
level matrix to help help you plan your data warehouse design from a conformed dimensional process
you plan measurement perspective from the outset. You may want to add to it some of the
dimensionally additional features of the event matrix described below.
Use a high-level dimensional matrix to gain support from senior business and IT
management for conforming dimensions.
It contains details Event Sequences: Business events, including their main clause short stories,
for BEAM✲ event recorded in time/value/process order.
story telling and
Scrum planning Dimensions in BEAM✲ story sequence (who, what, where, why, and how).
This helps you fill in the matrix using the 7Ws, read summary event stories,
spot opportunities to reuse dimensions of the same W-type and focus on con-
forming the most important who and what dimensions: typically customer,
employee and product.
Stakeholder Group columns for recording event interest and ownership. Ticks
can be linked to attendee lists of who was involved in modelstorming the event
details, or should be.
Importance and Estimate rows and columns for prioritizing events and
dimensions on a Scrum product backlog and estimating their ETL tasks for a
sprint backlog.
Modeling Business Processes 103
Figure 4-6
An event
matrix
!
104 Chapter 4
Event Sequences
Events are listed on Look back at the event rows on the matrix in Figure 4-6 and you will notice that
the matrix in value events are not listed alphabetically. Instead, they are listed in value sequence begin-
chain sequence ning with MANUFACTURING PLANS, and ending with WAREHOUSE
SHIPMENTS. This sequence orders the events by the increasing value of their
outputs. In this example, the sequence starts with potentially valuable planning
followed by the procurement of lower value components, and proceeds through
the building and shipment of higher value products. When business activity is
ordered in this way it is often referred to as a value chain.
Time/Value Sequence
Value sequence can Value sequence can also represent time sequence. Generally low value output
also represent time activity occurs before high value output activity or at least that is how most of us
sequence which think of business activity at a macro-level. For example, in manufacturing, pro-
helps stakeholders curement happens before product assembly, shipping, and sales. Similarly, in
to think of the next service industries, time and money is spent acquiring low value (high cost) pros-
and previous events pects before converting them into potentially valuable customers and then into
high value (low cost) repeat customers. In reality, value sequencing may not be a
strict chronology because many of the micro-level business events described in a
value chain occur simultaneously and asynchronously—not waiting for one an-
other. However, time/value sequencing is highly intuitive and by documenting
events in this way, the matrix helps stakeholders to think of next or previous events,
and spot gaps (missing links) in their value chains.
Add events to an event matrix in the order in which they increase business value
by asking “Who does what next that adds value?”
Process Sequence
Events that occur in Within the flexible chronology of value chains there will be stricter chronological
a strict sequence sequences of events that must occur sequentially to complete a significant time
often represent consuming process such as order fulfillment or insurance claim settlement. These
process milestones process sequences—which begin with a process (initiating) event and continue
serially through a number of milestone events—are denoted on an event matrix by
indentation.
Milestone events Figure 4-6 shows a process sequence of PURCHASE ORDERS to SUPPLIER
are indented PAYMENTS. This documents that a delivery only occurs after a purchase order
beneath the event (PO) has been processed and a payment is only made after a delivery has been
that triggers them. received. Notice that these events share a conformed PO dimension. This may only
A * degenerate be a degenerate PO NUMBER dimension in each event table but it ties these events
dimension creator together at the atomic detail level and allows stakeholders to track the progress of
often indicates the each PO item through delivery and payment. Notice also that POs are created by
start of a process PURCHASE ORDER events (denoted by a * on the matrix): PO numbers are
sequence generated when an employee raises a purchase order. This confirms the strict
Modeling Business Processes 105
Figure 4-7
The “shape” of
modelstorming
from A to B
106 Chapter 4
Time-box Like most agile activity, modelstorming should be time-boxed. For an initial
modelstorming modelstorm use four hours as a guideline. Reserve two hours for modeling the first
meetings to four (most important?) event table and its dimensions tables. One hour for modeling
hours (maximum) related events on a matrix, and a further hour for prioritizing events and making
sure the most important event(s) and dimensions are modeled in detail. Not
enough time? Don’t overrun. Schedule another.
Use an event matrix to So far, we have covered how to open a modelstorm, at point A, with the question
identify the most “Who does what?”, and use BEAM✲ tables and 7W data story telling techniques to
important events and model the answer as single event and matching dimensions in great detail. Now we
conformed dimensions describe how you get to point B’s implementation decisions using an event matrix
to rapidly storyboard several more events, in just enough detail, to identify the most
important events and conformed dimensions for the next sprint. To show how the
matrix gets you there we shall continue modeling Pomegranate’s order processing
BI requirements.
Include degenerate Don’t forget to add any degenerate how dimensions, such as ORDER ID. Even
dimensions: they can though these dimensions are not modeled as tables (because they have no addi-
be conformed too tional descriptive attributes) they still need to be recorded on the matrix because
they can be conformed degenerate dimensions appearing in multiple events. You
will see shortly how important they are for identifying process sequences.
Ask if the event Tick off the dimensions referenced by the event. As you do so, ask stakeholders if
creates any new the event can create new values for any of its dimensions. For example, you might
dimension values ask:
Mark any dimensions that can have new members created by the event (e.g. Cus-
tomer, Delivery Address and Order ID) with a * rather than a tick to record this
significant dependency. When you have finished, the matrix should look like
Figure 4-8.
Modeling Business Processes 107
Figure 4-8
Adding CUSTOMER
ORDERS to the
event matrix
Stakeholders might say that “Packing follows Orders” or “Shipments follow What happens
Orders.” If you were given both of these verbs at once, the next one in time se- next depends on the
quence would be obvious but when you are modeling less familiar events the stakeholders’
sequence many not be so apparent to everyone, in which case you can draw a departmental
simple timeline to help sort them chronologically. With a mixed group of stake- perspective
holders, the answer to “what happens next?” can vary depending on their individ-
ual departmental perspectives.
Watch out for instances where multiple verbs refer to the same event. Stakeholders Watch out for
may use several verbs for the same activity, or multiple activities may be indivisible: verb synonyms that
captured as a single transaction by the source system. For example, if products are represent
packed and shipped by the same person within a short period of time, the two tasks the same event
may be recorded as a single shipment event. If you have any doubts, model each
verb as a separate event but if you uncover no extra details, or later discover they
represent a single transaction, you can merge the events with no loss of informa-
tion. Once you have a new verb (assume it is “ship”) you can use it to ask a more
focused “Who does what?” question to get the next event’s subject-verb-object main
clause:
Add the next event You add this new main clause to the matrix below CUSTOMER ORDERS, as in
and any new Figure 4-9, leaving enough room to add an event name later. You then check the
dimensions. Tick off dimension columns to see if the new subject (WAREHOUSE WORKER) and
its conformed object (PRODUCT) are potential conformed dimensions. PRODUCT is already on
dimensions the matrix so you should tick its use on the new event row, once you have con-
firmed that stakeholders are talking about the same products described in the same
way as before. Though it seems unlikely, you should also confirm that shipping
does not create new products otherwise you would use a * instead of a tick.
Figure 4-9
Adding “warehouse
worker ships
product” to the
event matrix
Role-Playing Dimensions
Check each new WAREHOUSE WORKER looks like a new dimension but before you add it you
dimension for should check if it is a synonym for an existing dimension; W-type can help you
synonyms among here. WAREHOUSE WORKER is a who, and there are already two other whos:
the dimensions CUSTOMER and SALESPERSON. Are either of these similar to warehouse work-
already on the ers? Customers obviously aren’t but warehouse workers and salespeople, while
matrix they’re not the same people, they maybe a specific type of who: employees of the
same organization. If so, they would share many common attributes (e.g., Em-
ployee ID, Department, Hire Date, etc.) and could be modeled as a single role-
playing conformed dimension. You should confirm with stakeholders:
If the answer is NO, logistics could be handled by a contractor and warehouse Use [RP] to identify
workers are not Pomegranate employees. However if the answer is YES then you role-playing
have discovered two different roles for a conformed EMPLOYEE dimension. You dimensions
record this by renaming the SALESPERSON dimension to EMPLOYEE and
adding the dimension type code [RP] to denote that it is a role-playing dimension.
This change needs to be made to the matrix as in Figure 4-9 and the dimension
table as in Figure 4-10. However, you should leave the subject of the new shipping
event on the matrix as “warehouse worker” to record the specific employee role
that stakeholders used to describe the event.
Role names such as WAREHOUSE WORKER and SALESPERSON are used as Roles are
detail column headers in event tables, so the SALESPERSON column in documented as
CUSTOMER ORDERS does not need to be renamed but you do need to associate event details with a
it with the EMPLOYEE dimension. You document an event detail, such as [ ] type identifying
SALESPERSON, as a role of an existing dimension by adding the role-playing their RP dimension
dimension name to its column type using the [ ] type notation as in Figure 4-10.
Figure 4-10
A role-playing
dimension and an
event detail role
[ ] type notation can be used to qualify the type of any event detail or dimensional [ ] type notation is
attribute. Initially it can be useful to type every event detail with its W-type, such as used to record
[who], [what] or [where], to help everyone think dimensionally using the 7Ws. W-type, unit of
Details that are dimension roles use this notation instead to document their role- measure, flag
playing dimension name; for example, [employee] or [calendar]. As other details values and RP
are named after their dimension, they don’t need this qualification. For quantities dimension names
their type is their unit of measure, for example, [£], [$], or [miles] as described in
Chapter 2, while Yes/No flags can be documented with a type of [Y/N] showing
their permissible values.
110 Chapter 4
RP dimensions can A role-playing dimension can play multiple roles in the same event. For example,
play multiple roles in EMPLOYEE could appear twice in an evolving event containing both order and
the same event shipment details as both SALESPERSON and WAREHOUSE WORKER. Similarly,
CALENDAR—the most commonly occurring role-playing dimension—would play
the roles of ORDER DATE and SHIP DATE.
When using [ ] notation to document an event role you can drop its generic W-
type (e.g., [who] or [what]) to save space, because this is already documented
within the dimension table and on the matrix.
Define conformed Changing the name of a dimension (and its attributes) to make it more reusable at
role-playing the design stage is painless compared to the refactoring and testing involved if the
dimensions as early dimension had already been deployed. Hence the importance of modeling multiple
as possible events to identify conformed dimensions and their role-playing opportunities
before the first star schema is deployed.
Don’t implement any dimension until you have used an event matrix to check
whether it should be conformed across multiple events.
Role-playing Is a role-playing employee dimension the right approach? Stakeholders can often
dimensions, while feel uncomfortable with generalization (see opposite) like this, if they cannot see
more conformed, any business benefit, i.e., cannot imagine ever wanting to group together the
may not initially activities of salespeople and warehouse staff. If stakeholders do voice concerns, you
appeal to should try to encourage them to see the “bigger picture” benefit of a conformed
stakeholders dimension beyond the current scope. You can also assure them that when they
query sales or logistics they will have filtered lists of salespeople or warehouse staff
to choose from, and will never have to search through all employees.
Use the new event’s Once you have added the new event to the matrix, you ask for the rest of its details
main clause with the almost exactly as you would if filling out an event table: by turning its main clause
7Ws to ask for into a series of questions using the 7Ws. The only difference being that you ignore
further details when and how many questions as you don’t need that level of detail for the matrix.
Using the who, what and where column headings on the matrix as a checklist, you
might ask:
Add any new In response to these who, what, where questions the stakeholder might identify
dimensions to the CUSTOMER (who) and DELIVERY ADDRESS (where) as potential conformed
matrix and then tick dimensions, and introduce new dimensions for CARRIER (who), SHIP MODE
off all dimensions (more of a how than a what) and WAREHOUSE (where) as new dimensions.
used by the event When you, or better still the stakeholders themselves, have added these to the
matrix, it should look like Figure 4-11.
Modeling Business Processes 111
Figure 4-11
Adding shipment
dimensions to the
matrix
A common generalization design pattern is the use of a single Party entity to repre- Party and Party
sent all who details (persons and organizations), with an associative entity Party Role are common
Role to represent their various types, positions, titles, and responsibilities, ( e.g. examples of
customer, employee, supplier, etc). This database pattern is capable of recording generalized entities
the multiple positions that people might hold throughout their lives or the multiple used to model all
responsibilities they might have simultaneously, but is it a good generalization to types of people and
make when modeling a data warehouse? If BI stakeholders are explicitly looking for organizations
people who change roles—such as spies and criminals who change identities, or
government regulators who become political lobbyists—then a generic who dimen-
sion that plays multiple roles might be exactly what they need.
However, if stakeholders are not terribly concerned about role switchers, or the Stakeholders may
available data sources simply lack any reliable means of capturing role changes, not see any obvious
then this design flexibility is wasted. Worse still, it can get in the way of what BI benefit in
stakeholders really want to do. For example, a single dimension representing generalization
Customers and Employees containing every possible who related attribute would be
very confusing to use compared to separate dimensions containing customer and
employee specific attributes.
112 Chapter 4
Generalization Agile data warehouse modelers must use generalization carefully. Data models that
produces data value flexibility over simplicity are notoriously difficult to understand and use for BI.
models that are They can work for transactional software products because their data structures are
difficult for BI users completely hidden from the users by application interfaces. But “universal data
to understand and models” that rely on high levels of generalization or abstraction do not work so well
query for BI users who—despite the semantic layers provided by BI tools—need far
simpler data warehouse designs to be able to construct and run ad-hoc queries
efficiently.
Modelstorming data One of the great benefits of modelstorming is that stakeholders feel a sense of
requirements ownership in the resulting design. If they have abstractions forced upon them they
specifically rather start to lose that feeling: it’s no longer their model, their data—it could be anyone’s.
than generally The only Party Roles most stakeholders recognize are Host, Guest, or Gate-
promotes stake- crasher—or maybe political ones if that’s their specialist field. In extreme cases
holder design where generalization is taken too far, to the point where the data model can be used
ownership to represent almost anything, it will actually mean nothing to stakeholders. This
defeats the goal of modelstorming, which is not to design data structures that merely
store data but to design ones that stakeholders will use and cherish. Modeling each
interesting who, what, when, where, why and how as specifically as possible helps
to promote the data model understanding needed to construct meaningful queries
and interpret their results.
Postpone ‘technical Stakeholders are happy with “reasonable” levels of generalization if they can see an
benefit only’ obvious business benefit such as a better understanding of the commonalities
generalization until (conformance) between business processes that improves analysis. But if the
star schema design benefits are purely technical—to cut down database administration or streamline
ETL—then you should postpone generalization until you design your star schemas
and ETL processes.
This sounds a lot like the main clause of the previous order event. The stakeholders
have told you that orders are the reason for shipments. You can find the evidence
for this: the conformed dimension that ties the two events together, by turning
their answer into a how question:
The answer:
reveals that the order how detail (ORDER ID [DD]) is effectively a why detail of Conformed
shipment, indicating that the two events are likely to be part of a process sequence degenerate
with shipments treated as milestones of CUSTOMER ORDERS. You can check dimensions
how strict this sequence is by asking, “What about free samples or replacement represent how
products and parts; how are shipments for these processed?” If the answer is, “We details of
don’t want to consider these when measuring our sales fulfillment processes,” then process events and
perhaps they are events for a marketing or product support data mart, another why details of
sprint, another day. Alternatively, stakeholders might tell you “Pseudo orders are milestone events
generated when we ship samples or replacements”. Either way, all the in-scope
shipments are milestone events of orders and ORDER ID is a conformed degener-
ate how/why dimension.
Figure 4-12 shows the completed shipment event, now named PRODUCT Document process
SHIPMENTS, with its final how detail a SHIPMENT NUMBER degenerate dimen- sequences by
sion. The new event name and main clause have also been indented under indenting milestone
CUSTOMER ORDERS to document the process sequence. Note that this sequence, events
shipments follow orders and not the reverse, is confirmed by CUSTOMER
ORDERS being an ORDER ID creator (denoted by the * in the ORDER ID
column). CUSTOMER ORDERS must occur first to create the ORDER IDs refer-
enced by PRODUCT SHIPMENTS.
Figure 4-12
Adding why and
how dimensions
to shipments
114 Chapter 4
After completing shipments, your search for the next event begins anew. This can
be the next one in sequence, or simply the next one the stakeholders think of when
Ask for the next they see popular dimensions like CUSTOMER and PRODUCT on the matrix. If
event but don’t their next event doesn’t sound like the very next one chronologically, don’t worry,
worry about strict just go with their train of thought—don’t try and derail it. Missing ‘next’ events are
chronology much easier to spot as gaps on the matrix once you add the events you are freely
given. Imagine that the Pomegranate stakeholders respond to your next “Who
does what?” question with:
Exceptional steps Figure 4-13 shows the matrix after PRODUCT RETURNS have been added along
within a process are with its new PROBLEM REASON dimension. PRODUCT RETURNS are depend-
documented ent on PRODUCT SHIPMENTS because customers have to order and receive
by bracketing their products to be able to return them, but this sequence of events is exceptional: only
event names a small percentage of orders are returned. You can document an optional or
exceptional event within a process by bracketing it. This acts as a visual clue that
you might want to handle the event separately to mandatory/unexceptional process
milestones. For example, order and shipment could be combined in a worthwhile
evolving order event because almost every order leads to a shipment, but the much
smaller number of returns might be better treated as part of a separate customer
support process rather than complicate orders. Exceptional events often indicate
that there may be missing events and other processes that need to be considered.
Figure 4-13
Adding PRODUCT
RETURNS to the
matrix
Figure 4-14
CARRIER
DELIVERIES,
CUSTOMER
COMPLAINTS and
SALES TARGETS
added to the matrix
Trying to find the correct position for an event within a process sequence can often Model the first and
help to expose additional events that represent the end of one process and the start last events in a
of another. In our example, deliveries are the final milestones for most orders. process. They are
Complaints and returns, on the other hand, are thankfully not part of many orders. the basis for almost
The indentation in Figure 4-14 shows how CARRIER DELIVERIES completes the all process
order fulfillment process and CUSTOMER COMPLAINTS begins a new customer performance
support process. Documenting the first and last events of a process is particularly measurement
important. They represent cause and effect, origin and outcome and are the most
116 Chapter 4
Add rollup (RU) Figure 4-14 shows another new event: SALES TARGETS. It is not part of the order
dimensions next to or customer support processes, hence no indentation, but stakeholders believe that
their base sales targets drive orders so they have placed the event before CUSTOMER
dimensions and tick ORDERS in time/value sequence. From its main clause “Salesperson has product
all the events that type target” it is immediately obvious that it should take advantage of the con-
can be rolled up to formed role-playing EMPLOYEE dimension. But it cannot reuse the conformed
their level PRODUCT dimension because stakeholders have stated that targets are set for
product types not for individual products. The good news is that the event can still
be conformed with PRODUCT at the product type level because this is a con-
formed PRODUCT attribute. You record this by adding a rollup dimension
PRODUCT TYPE [RU] (immediately to the right of PRODUCT, if possible, to
denote that it is derived from it) and ticking it for each PRODUCT-related event to
denote that they can be compared to SALES TARGETS at the PRODUCT TYPE
level. There is no need to model the rollup any further, at the moment, because it
will not contain any new attributes, just PRODUCT TYPE and any other con-
formed product attributes above it in the product hierarchy, such as
SUBCATEGORY and CATEGORY already defined in PRODUCT.
Simply pointing at each empty cell in turn like this takes full advantage of the
physical proximity of all the events and dimensions on the same spreadsheet or
whiteboard that the matrix provides and can often jolt someone into spotting a
valuable missing conformed detail at the last minute.
Higher rated events are more important and should be implemented sooner Event Rating Rules
—if possible.
Events that are truly unimportant (currently) can all have an importance of
100.
Events are rated in 100 importance point increments, e.g. 100, 200 (you’ll see
why the gaps are useful shortly).
The importance rating is only used to sort events by importance not measure
their relative business value. If Event A has importance 100 and Event B has
importance 500, B is simply more important than A, not five times more im-
portant.
Figure 4-15
Event importance
rating
If you are using the downloadable BEAM✲ matrix template, you can hide and
unhide the built in planning columns (Figure 4-15) which include event importance
and planning rows (Figure 4-16) which in turn include dimension importance.
Before you use the importance column to actually sort the event rows make sure
you fill in the event sequence column first, so that you can re-sort events back
into time/value sequence when you have finished.
As soon as the importance rules are understood, start by rating the initial event Start by rating the
that the stakeholders modeled. Theoretically, this should be the most important initial event highly.
event so you might suggest an importance based on that starting position; for Then rate other
example, if the matrix describes 10 events that haven’t been implemented yet events relative to it
118 Chapter 4
suggest an importance of 1,000. This event may not stay the most important;
stakeholders can easily give a higher importance to an event that was modeled in
less detail at the last minute but this opening gambit gets the rating game going. In
Figure 4-15 the initial CUSTOMER ORDERS event has remained the most impor-
tant and is followed by PRODUCT SHIPMENTS rather than CARRIER
DELIVERIES (perhaps stakeholders realize that data will not be readily available
from carriers). Stakeholders have also rated the customer support events as cur-
rently unimportant!
You may not wish to ask all the modelstorming stakeholders to vote on impor-
tance. Arguments may ensue! If you are using Scrum to manage your agile DW/BI
development, prioritizing requirements is the role of the product owner who
manages the product backlog. A subset of the stakeholders can act as a proxy for
the product owner and provide input to the product backlog prior to release and
sprint planning meetings. At these meetings, event importance will be adjusted by
the product owner in the cold light of source data profiling results and the DW
team’s ETL task estimates.
Add a dimension When all events have been rated and any tied positions resolved—remember
importance row and important events should have unique importance ratings—add a dimension
rate each dimension importance row to the matrix (as in Figure 4-16). Dimensions are rated after events
higher than its most because dimensions are only important if the events that use them are important.
important event Now that the stakeholders have decided which events are important, you rate each
dimension higher than its most important event, because it must be implemented
before any fact table based on the event can be implemented (due to foreign key
dependencies).
Dimension Rating Dimension importance rating follows the same rules as event rating with a few
Rules additions/variations:
A dimension should be rated higher than its highest importance event, unless it
has already been implemented, in which case it should be 0. E.g., if Event B
with importance 500 is the highest rated event using Dimension C then C must
have an importance between 505 and 595.
A dimension should be rated lower than any higher importance events that do
not use it.
In Figure 4-16, dimensions have been rated by first sorting events by importance. Dimensions and
CUSTOMERS ORDERS and PRODUCT SHIPMENTS have come top so their events are rated so
importance points (600 or 500) are copied to their dimensions. Stakeholders have they sort correctly
then rated order dimensions 610-670 and shipment dimensions 510-540. This on a single product
numbering scheme allows dimensions and events to be sorted correctly when backlog
transferred to a single product backlog.
Figure 4-16
Dimension
importance
rating
Figure 4-17
A DW/BI product
backlog
For more advice on Scrum, sprint planning and time-boxing read Scrum and XP
from the Trenches, Henrik Kniberg (InfoQ.com 2007).
When you have finished rating all the events and dimensions on the matrix, if the When you have
most important events (top 1 or 2 usually) and all their dimensions have been modeled the most
modeled with examples, your modeling work is done, for now, and you can bring important events
the modelstorm to an end. You have reached point B with enough information. and dimensions with
However, if matrix only events have been rated highly important you may have examples, the
one or two more events to model in detail before you can proceed to star schema modelstorm is
design and sprint planning. complete. If not…
120 Chapter 4
On a shipment date.
Reuse conformed Add this to the table, as in Figure 4-18, and ask for event stories just as you did
dimensions and with the initial CUSTOMER ORDERS event. The only difference this time is that
examples wherever you will be using candidate conformed dimensions that already have examples.
possible You want to re-use these examples where possible, to illustrate the conformance.
Figure 4-18
New PRODUCT
SHIPMENTS
event table
Don’t use cryptic You might be tempted to speed up the process by using shorter business keys
business keys to rather than rewriting dimension subjects out in longhand. This may even appear to
speed up event be good data modeling practice because event tables will then contain foreign key
story telling references to the dimension business keys that will surely make them easier to
Modeling Business Processes 121
translate into physical tables. But this rush to physically model events is counter-
productive. Business keys are mostly cryptic codes that will rob event tables and
stories of their readability and descriptive power, and—as you will see in Chapter
5—business keys do not make the best foreign keys (or primary keys) in a dimen-
sional data warehouse.
Figure 4-19
Using
abbreviations to
tell event stories
To avoid confusing abbreviated stories, keep abbreviations unique within dimen- Keep abbreviations
sions. If an abbreviation is not unique, just add a sequence number to it. For unique by adding a
example, if your two favorite employees are James Bond and Jed Bartlet, they can sequence number if
appear in stories as JB1 and JB2. Of course employee James Bond is exceptional; his necessary
business key, employee ID 007, is so well known that it is more descriptive than his
initials, and could be used very successfully in many eventful stories.
Test conformed When stakeholders give you new examples try adding them to the appropriate
dimensions by dimension before using them in event stories. Apart from allowing you to use the
adding new examples by abbreviation, filling out their dimensional attributes is also a great test
examples of conformance that helps you to spot missing or non-conformed attributes. For
role-playing dimensions, you may have to adjust some existing attributes to match
new roles; e.g., COMMISSION is a mandatory (MD) attribute of SALESPERSON,
but would be a non-mandatory exclusive (X) attribute of a conformed EMPLOYEE
[RP] dimension that must play the role of warehouse worker as well as salesperson.
Asking for examples encourages everyone to define and use conformed dimen-
sions. Why make up new example values when you can copy them from a con-
formed BEAM✲ dimension table?
Don’t forget to Before you move on to the next W-type always check for additional details of the
check for additional current type. Seeing the event stories build up will often prompt stakeholders to
who, what, where, suggest additional details they couldn’t think of when modeling at the matrix
why and how details summary level. As soon as stakeholders confirm any additional who, what, where,
too and add them to why or how detail with relevant examples, add them to the matrix where they too
the matrix might become conformed dimensions.
Mark new details Figure 4-20 shows four who, and where details added to the shipping event.
with a [?] type as a CUSTOMER and DELIVERY ADDRESS, with their highly abbreviated examples,
reminder to model are conformed dimensions, while CARRIER and WAREHOUSE are new and have
their dimensional not been modeled as dimension tables yet. Any new details/dimensions, like these,
attributes can be marked as type [?] as a reminder that, while they maybe on the matrix, they
still need to be modeled at the attribute level, with examples, when the event is
completed.
Modeling Business Processes 123
Figure 4-20
Adding details to the
shipment event
Following the where details, you ask for the how manys. These quantitative details Ask how many? to
do not feature on the event matrix, just like the when details, and are the main discover the event
reason for modeling the event in table form; the matrix shows how events are measures not
described using (conformed) dimensions. The how many examples show how modeled on the
events can be measured. matrix
In Figure 4-21, two new quantity details: SHIPPED QUANTITY and SHIP-MENT Add any existing
COST have been added, along with the ORDER ID why detail. The quantity why dimensions
examples are new, supplied by stakeholders, but the order ids are copied from the from the matrix
previously modeled CUSTOMER ORDERS event because ORDER ID was identi- and ask additional
fied, on the matrix, as a conformed degenerate dimension linking the events. With why questions to
that existing why filled in, you ask for additional whys, remembering to ask why explain story
quantities vary. If you know that an event, such as shipment, is a process milestone variations in the
you should ask why similar details vary (or do not vary) within the process; for measures
example, you might ask:
This tells you there is a 1:M relationship between orders and shipment. It also tells Why answers can
you that you haven’t yet found a combination of details that would make a ship- represent the need
ment event unique. You record this by adding a new repeat story to the table, as in for additional
Figure 4-21, which demonstrates there can be multiple identical shipment events examples as well as
for the same order line item by duplicating the granular details (ORDER ID, new why details
PRODUCT) of an original order (ORD5466).
124 Chapter 4
Figure 4-21
Adding new
quantities, a
why/how detail
from a previous
event and an
additional repeat
story
Concentrate on Thanks to its ORDER ID why detail you have the option to embellish PRODUCT
completing the event SHIPMENTS with additional order dimensions, but because these dimensions are
with the when, how already well documented in the CUSTOMER ORDERS table and on the matrix,
many and granularity you can, if pressed for time, add them later without stakeholders involvement. Just
details not make sure you let the stakeholders know you will be doing this. While you have
recorded their attention now you should concentrate the modelstorm on capturing brand
elsewhere new shipment details, especially the when, how many and granularity details not
recorded on the matrix. You also need to allow time for modelstorming the attrib-
utes of any new dimensions (the details you temporally marked [?]).
You can add any useful order dimensions to shipments but you should avoid the
how many details, such as ORDER QUANTITY or REVENUE, because the 1:M
relationship between orders and shipments would cause these measures to be
overstated when summarized; e.g., 2 partial shipments events that both record (i.e.
duplicate) the original ORDER QUANTITY of 10 units will produce a total of 20
units with double the correct order REVENUE.
Don’t copy Order measures are better left in the order event and its subsequent order fact table
measures from one at their true granularity (the order line item) rather than also stored at the shipped
event to another. line item. In Chapter 8, we cover how you would instead combine shipments with
This can lead to orders to produce a single evolving order event and model the additional measures
facts that double that provides. For now, we will press on and complete the shipment event with
count how details.
Modeling Business Processes 125
When you finish modeling an event table don’t forget to model dimension tables
for any details that you have marked as [?]. You still need to define some dimen-
sional attributes for these details, before ending the modelstorm.
Figure 4-22
Completed
PRODUCT
SHIPMENTS event
Sufficient Events?
Merge subject After the earlier manufacturing events in Figure 4-6 are added to the sales targets,
area matrices order processing and customer support events of Figure 4-14, the matrix should
to provide a DW- look like Figure 4-23. If this matrix inspires you to reuse more dimensions, particu-
wide overview of larly dimensions from process initiating events such as PURCHASE ORDERS or
conformance CUSTOMER ORDERS that could be carried over to their dependent milestone
events, then the matrix is doing its job. It should encourage you to maximize
dimensional reuse to make each event as descriptive as possible. In addition, if the
similarities of the dimensions of PRODUCT SHIPMENTS and WAREHOUSE
SHIPMENTS makes you think that they might actually be the same type of event,
then the matrix is also doing its job. It may turn out that wholesale shipments to
resellers are quite different to retail shipments to consumers: These events might
be handled by completely different systems. Even so, the matrix is again doing its
job in highlighting an opportunity to conform the dimensions of both events, just
in case there is business value in doing so.
The event matrix is a great technique for upholding the agile value of working
software over comprehensive documentation. The event matrix is enough com-
prehensive documentation to help you create working software based on con-
formed dimensions but if you do need more documentation, link to it from your
event matrix spreadsheet cells; e.g., events and dimensions can be hyper-linked
to their BEAM✲ tables or star schema models.
A matrix may Although the event matrix in Figure 4-23 might be complete enough for several
never contain every DW/BI development sprints, is it the complete matrix for the Pomegranate Data
event and not every Warehouse? What about customer invoicing and payment events after orders, or
event it does product configuration prior to shipments? What about events in other subject
contain will be areas such as HR, finance, R&D? Many of these events may be out of scope for
implemented sometime, or will never capture sufficiently interesting additional details to be
worth measuring.
The role of the Rather than initially modeling every possible event on a matrix, agile DW design-
matrix is to identify ers concentrate on making the matrix complete enough for the next release. When
the conformed a matrix contains enough detail to help prioritize the right events for the next
dimensions for the release and understand their conformed dimensions that will be used again in
next release future releases, its job, for now, is done.
Put a large version of the event matrix on the wall where everyone can see it.
Regardless of your preferred methods for modeling events and dimensions:
BEAM✲ tables or ER notation, flipcharts, whiteboards, or projected spreadsheets,
viewing more than a few details at once is impossible. When event and dimension
tables cover all your walls, or are buried in spreadsheets, a matrix enables stake-
holders and the DW team to see the entire design at a glance.
Modeling Business Processes 127
Figure 4-23
A complete event matrix?
Keep the event matrix up to date! It’s not an initial planning tool or a one-time
modeling technique. Use it to document the ongoing data warehouse design.
Refer to it and update it whenever you are modelstorming. A well-maintained
matrix acts as a constant reminder to everyone to reuse and enhance conformed
dimensions.
128 Chapter 4
Summary
“Just barely good enough” dimensional modeling can lead to the early and frequent deployment
of data marts that answer current departmental reporting requirements, but it also stores up
technical debt, in the form of incompatible data silos that cannot support cross-process analysis
and enterprise level BI. Due to the large data volumes associated with DW/BI, repaying this
debt can be ruinous.
To avoid silo data marts and reduce technical debt, agile DW designers need to model ahead of
the current development sprints and release plans, just enough to identify and define conformed
dimensions. These reusable components of a dimensional model enable drill-across reporting by
providing the consistent row headers and filters needed to combine and compare measures
from multiple business processes. A well documented, well publicized and well maintained set
of conformed dimensions form a data warehouse bus architecture that supports the incremental
development of truly agile data marts.
Conformed dimensions are single dimensions tables or synchronized copies shared by multiple
star schemas. They can also be swappable [SD] subsets or rollups [RU], derived from a base
dimension, conformed at the attribute level with identical business meaning and contents.
Generalized conformed dimensions that play multiple roles in the same or different events are
referred to as role-playing [RP] dimensions.
Agile dimensional modelers define conformed dimensions by modeling with examples, with
business stakeholders. BEAM✲ example data stories highlight the value of conformance to the
very people who can make it happen politically. Examples can quickly expose the inconsistent
business terms that would hinder conformance.
The event matrix is a modeling and planning tool that documents the relationship between
events and dimensions. It acts as a storyboard for an entire data warehouse design showing just
enough detail to help identify the most valuable conformed dimensions and prioritize their
development.
Listing events in time/value sequence on an event matrix helps you discover missing events by
highlighting large time gaps or value jumps in process workflows. It also helps you identify
strict chronological process sequences: candidate evolving events, that combine all the milestone
events of a business process, to support end-to-end process performance measurement.
When modeling new events, abbreviated examples allow you to quickly tell stories by reusing
conformed examples where applicable. Unlike codes, abbreviations help to keep stories brief
and readable for stakeholders. They also support the validation, reuse and enhancement of
conformed dimensions.
5
MODELING STAR SCHEMAS
We are all in the gutter, but some of us are looking at the stars.
— Oscar Wilde
In this chapter we describe the star schema design process for converting This chapter is a
BEAM✲ models into flexible and efficient dimensional data warehouse models. guide to:
The agile approach that we take begins with test-first design, by using data profiling Verifying BEAM✲
techniques to verify the BEAM✲ model against the data available in source sys- models against
tems. This results in an annotated model which documents source data characteris- available data
tics and issues. This is used for model review with stakeholders and development sources
sprint planning with the DW/BI team.
Next, the revised BEAM✲ model is translated into a logical dimensional model by Converting BEAM✲
adding surrogate keys. The resulting facts and dimensions are documented by models into star
drawing enhanced star schemas using a combination of BEAM✲ and ER notation. schemas
Finally, the star schemas are used to generate physical data warehouse schemas Validating DW
which are validated by BI prototyping and documented by creating a physical designs by
dimensional matrix. prototyping
129
130 Chapter 5
Done early, as a data modeling task to help define the dimensional model.
Agile data profiling Done frequently, to ensure that the model responds to change; this is espe-
is done early as a cially important for new data sources that are being developed in parallel with
modeling activity – the data warehouse.
before a target DW
schema is created Done by DW/BI team members who will load the data, to give them a feel for
its complexity that will help them with their ETL task estimates.
Recorded in the business model so that data profiles can be used to review that
BEAM✲ BI data requirements model with the stakeholders, before any techni-
cal data models are proposed.
Figure 5-1
Agile data profiling
The most expensive and painful way to discover the data profile of an operational
source is to create an idealized target schema, attempt to ETL the source into the
target and record all the errors. Don’t make this extremely late/non-existent data
profiling mistake. Agile data warehouse designers never create a detailed physical
model before profiling a source, unless they are deliberately doing proactive
DW/BI design to help define a brand new source.
Modeling Star Schemas 131
Agile data profiling is a form of test-driven (or test-first) design (TDD). Profiling Think of agile data
the source data provides you with metrics that can be used to test the fit of a data profiling as a form of
warehouse model and the content of a data warehouse database before you develop test-driven design
your SQL DDL (data definition language) and ETL code. When profiling isn’t
possible yet, a proactive DW/BI design can be viewed as an advanced test specifica-
tion for the new operational system; a test that asks “can the system supply this
data required for BI to this specification?”
For business events and the facts they provide, finding the system-of-record that Events/facts often
creates and maintains the original transactions is relatively straightforward as there have unique
is often only one source system for a specific event type. For example, the claims sources
processing system would be the obvious and possibly only choice for sourcing
claim submission events. But where should claiming customers or insurance
products be sourced from? Identifying the system-of-record for dimensions like
these can be far more challenging.
Conformed dimensions are common to multiple business processes, which may Conformed
themselves be implemented using a mixture of purchased enterprise software dimensions can
packages and bespoke in-house applications. It is not uncommon for several have multiple
operational systems to independently maintain common reference data (some- sources that should
times called master data), such as Customers, Products, and Employees: the most be profiled to
valuable candidate conformed dimensions. You may need to profile systems that identify
are outside of your present prioritized event scope to find the best source for a conformance
conformed dimension and spot any conflicts that would hamper conformance and conflicts
reuse in the future. There may be no single best source for a dimension!
If you are fortunate, one system will have been declared as the system-of-record for Conforming ETL
each conformed dimension. But even then, facts (events) from other systems may processes may
use alternate business keys and carry additional dimensional attributes. If so, have to merge
conforming ETL processes will need to match the keys from the systems-of-record sources to obtain all
and these other sources to create the “perfect” set of conformed dimensional the necessary keys
attributes for the next sprint and ultimately to be able to load facts from these and attributes
sources in the future.
132 Chapter 5
Master data If you are extremely fortunate, you may have a Master Data Management (MDM)
management system that can help you identify the sources to profile for the most important
systems help conformed dimensions. MDM captures, cleanses, and synchronizes reference data
dimensional across operational systems and can provide the cross-referenced business keys that
conformance ETL needs to conform multiple sources.
Profile early, before the data warehouse model exists or is updated, then any data
quality issues you discover must be inherent in the established system-of-record
not a problem with the newcomer database or ETL process used to build it. Do it
the other way round and see who stakeholders (subconsciously) blame.
Missing Values
The first, best test Nothing (literally) illustrates the value of data profiling more than the early discov-
for any source is ery of missing data that the stakeholders have deemed mandatory. Profile for
to count missing missing values by counting the occurrence of Nulls or blanks in each candidate
values and calculate source column/field and calculating the percentage missing. Knowing how often
the percentage the source data is Null is essential for any column—but especially for columns that
missing have been identified as mandatory (MD) by the stakeholders. The SQL for count-
ing Null values in a column is:
When you are working with non-database sources, such as flat files, you can map
the source data as external tables, or perform basic ETL, with minimal transforma-
tions, to move it into DBMS tables, so that it can be profiled using SQL queries
and BI tools.
Modeling Star Schemas 133
A source column with 100% unique values may be a good candidate for a business
key while progressively lower percentage uniqueness can suggest that a set of
columns represent a viable hierarchy. The SQL for ranking each value in a column
by its frequency is:
Source column value frequency can be graphed to help you spot columns that have Graph source
no informational content in spite of not being Null. For example: column values
by frequency
Columns where values are (almost) all the same (equal to the default) to discover poor
Empty or spaces only strings: the logical equivalent of Null quality content
Favorite dates for lazy data entry staff such as “1/1/01”
Data profiling requires full table scans, making some of the queries involved very
resource intensive. You should avoid profiling a live operational system directly,
because transactional performance can be adversely effected. This is clearly not
the ideal first impression that any data warehouse team wants to make on opera-
tional support! Instead use snapshots (off-line copies) of the candidate data
sources held on your own server or wait until after-hours.
If source data is reliably time-stamped, try grouping your data profiling queries by
the Month, Quarter, or Year that the data was inserted/updated to see how data
quality changes over time. The worst quality issues may be older than the histori-
cal scope of the warehouse—if you’re lucky.
134 Chapter 5
SELECT
'INSERT INTO PROFILING_RESULTS(TABLE_NAME, COLUMN_NAME,
MISSING_COUNT) SELECT '''
|| Table_Name
|| ''', '''
|| Column_Name
|| ''', COUNT(*) FROM '
|| Table_Name
|| ' WHERE '
|| Column_Name
|| ' IS NULL;'
FROM SYS.All_Tab_Columns
WHERE …
Search online for Search online for “SQL data profiling script” and you should be able to find ready-
ready-made made scripts that you can adapt for your database platform that will create all the
profiling scripts tests recommended above and more and store the results in table form.
For in-depth coverage of data profiling, data quality measurement, and ETL
techniques for continuously addressing data quality read:
Use no source as ETL development is especially challenging when source data definitions are still
an opportunity to fluid (non-existent), but this does present an opportunity for the agile data ware-
define the perfect house team to negotiate a better “data deal.” The BEAM✲ model can be used to
BI data source provide a detailed specification of business intelligence data requirements to the
Modeling Star Schemas 135
When source database development lags behind data warehouse design, you can
avoid delaying ETL development by defining extract file layouts, based on your
BEAM✲ tables, and getting the operational development team to agree to their
scheduled delivery. The agile ETL team can then get on with mapping these
initially empty files to their star schema targets.
Once data take-on has begun for a new operational system you should profile the Profile sources as
initial data and the previously agreed-upon extract files as early as possible to help soon as they are
the operational team keep to their promises. Trust no one! available
In Figure 5-2, the PRODUCT dimension has been extended with data profiling BEAM✲ tables are
results showing counts and percentages for missing, unique, minimum and maxi- extended to hold
mum values for each column. These simple profiling measures are a great start for profiling metrics and
highlighting potential issues, and can be augmented with more sophisticated annotated to
measures and graphics generated by data profiling tools. The Figure 5-3 table has highlight source
also been annotated to show data sources, unavailable details, new attributes and data issues
definition mismatches. The following sections describe the model review notation
used.
reference numbers within the braces and expand upon the mapping rules in
hyperlinked supporting documentation or footnotes. If there are conflicting
sources for the same data, slash (/) delimit the choices. As well as identifying
column sources you should also record their data type and length using the codes
in tables 5-1.
Figure 5-2
Dimension table
annotated with
data sources and
profiling results
Place data source references on new rows in the table header and column type (as
in Figure 5-2) so they can be hidden when not needed; e.g., during a model review,
if the source names are not meaningful to the stakeholders.
Additional Data
While profiling the candidate sources, it is extremely likely that you will discover Use italics to
relevant data that the stakeholder didn’t request. If any of it looks like potential highlight
facts, or dimensional attributes for the currently prioritized events, you should add additional data
them to the model for review. Additional business keys that represent further reuse
of conformed dimensions are especially interesting. Use bold italics to highlight
new columns.
Unavailable Data
If you cannot find a data source, or the only available source conveys little or no Use strikethrough
information, use bold strikethrough on the unavailable column and its examples. to highlight
Figure 5-2 shows that PRODUCT WEIGHT is unavailable. If an entire event or unavailable data.
dimension is unavailable you should strikethrough the whole table and the appro- If an entire table is
priate row or column on the matrix (and inform the stakeholders as soon as unavailable highlight
possible). Figure 5-3 shows the (thankfully unlikely) situation that there is no this on the matrix
reliable source for a product dimension. If this really was the case you would also too
strikethrough all PRODUCT details in event tables—making them non-viable.
Figure 5-3
BEAM✲ diagram
showing missing
data source for an
entire dimension
It can be very useful to point out ‘not Null’ sources for any event details and di- Highlight mandatory
mensional attributes that the stakeholder did not explicitly identify as mandatory, source data as NN:
by highlighting them as NN. These rare cases, where data is more reliably available Not Null
than stakeholders thought, may open up new areas of analysis that they previously
didn’t consider.
Figure 5-4
Initial planning
meeting
Team Estimating
Estimating is an agile DW/BI team activity, every team member should be involved
to bring them up to speed with the emerging design. Everyone can usefully con-
tribute; e.g., BI developers can often help with ETL estimates if they are familiar
with the data sources, from having had to report directly off them in the past.
Play planning poker A downside of team estimating is that one person, who “knows best” or has the
to get unbiased loudest voice, can influence everyone’s estimate. A great way to avoid this is to play
team estimates planning poker: using a special deck of cards, everyone reveals their estimate for a
task simultaneously, and the team learns a lot from the differing opinions.
Dimension and When task estimates have been agreed, the totals for each table are added to the
event estimates event matrix so that star schema estimates can be calculated by summing the
are added to the relevant dimension and event totals. These estimates, used in conjunction with the
event matrix team’s velocity (work delivered per iteration), will give stakeholders an idea of what
could be prototyped after the next sprint or delivered in the next release.
Review the annotated data model and task estimates with stakeholders as soon as
possible. Delaying the review can allow unrealistic expectations for the data
warehouse to grow. Don’t let the stakeholders dream for too long!
Modeling Star Schemas 139
Figure 5-5
Model review
Start with the most severe issues (see Table 5-2) and work your way down by Concentrate on
reviewing any major missing sources first. Completely missing event or dimension severe data issues:
sources—strikethrough tables like Figure 5-3—may cause a serious rethink of missing sources and
priorities. Missing individual details are generally less disruptive, but some may be conflicting sources
indispensable. Problems with conflicting conformed dimension sources are highly for conformed
significant as they can have a knock on effect for future iterations and have the dimensions
potential to build up the greatest technical debt.
Revise the model As you go through each table or column issue, update the model (the individual
with the help of tables and the matrix) with the stakeholders assistance. Ask them to help you to
the stakeholders decide:
If stakeholders want You should finish the review by asking the stakeholders if they want to reprioritize
to reprioritize events in light of the data issues and task estimates—bearing in mind that the task
events, revise the estimates may also need to be adjusted based on the changes they have just agreed
matrix accordingly to. If the stakeholders do want to alter their priorities, revise the matrix by replay-
ing the event rating game described in Chapter 4.
Sprint Planning
Use the revised Following the model review, you hold a sprint planning meeting (Figure 5-6)
model, estimates where the DW/BI team will revise their estimates based on the model amendments
and priorities to and the product owner will decide on the data items that will make their way onto
define the sprint the sprint backlog: the list of data and user stories (tables and BI re-
backlog ports/dashboards) to be implemented in the next sprint. To help the team revise
their estimates you may need to draw some quick star schemas. It is at this point
you would introduce some of the design patterns described later in this book.
Figure 5-6
Sprint planning
meeting
Figure 5-7
Creating a (logical)
dimensional model
If you are using the BEAM✲Modelstormer spreadsheet, copy your BEAM✲ model
to a graphical modeling tool for star schema layout by using the customizable
SQL DDL it generates. If you haven’t done so already, download the spreadsheet
template and find full instructions for using it at modelstorming.com
Natural key (NK): A key that is used to uniquely identify something in the “real-
world” outside of a database; e.g., a barcode printed on a product package or a
Social Security number on an ID card. Natural key values are sometimes known
by stakeholders and used directly in reports and queries. The Employee ID 007
belonging to our favorite salesperson James Bond, has taken on a life of its own
beyond the HR system and become a natural key.
Business key (BK): A primary key from a source system. This can be a mean-
ingful natural key or a meaningless system-generated surrogate within the
source system, but by the time it reaches the data warehouse it has meaning to
the business outside of the warehouse and so is referred to as a business key.
Reserve the surrogate key value zero for the default Missing row in each dimen-
sion. You use this row zero to hold the stakeholders’ Missing labels recorded in
the BEAM✲ dimension table example data. If different types of missing are needed
you can add additional special rows, using negative surrogate keys, to represent
“Unknown”, “Not Applicable”, “Error” etc., leaving the normal dimension values
to use positive integers. Being consistent in the use of special value surrogate
keys can greatly simplify ETL processing.
Adding surrogate keys to dimensions in addition to their business keys slightly Don’t worry about
increases the size of the dimensions, but this is insignificant. In general do not the size of
worry about the size of dimensions. Although dimension rows can be long with dimension tables…
dozens of descriptive attributes, dimensions typically only contain hundreds or
thousands of rows in total, whereas fact tables can record millions of rows per day.
If you want to see where you should concentrate on saving storage take a look at
the Figure 5-8 “scale” diagram of a star schema. In a dimensional data warehouse,
fact tables along with their indexes and their staging tables account for 80%–90% of
the storage requirements.
Of course not every dimension is small. Customer dimensions that contain indi- …except customer
vidual consumers can contain tens or hundreds of millions of rows. These need to dimensions – they
be treated with the same respect as fact tables and will benefit too from a primary can be big!
key index based on a compact 4-byte integer rather than a longer “smarter’ alpha-
numeric Customer ID that contains a check digit. Chapter 6 covers specific tech-
niques for handling very large dimensions (VLDs), also known as “monster
dimensions”.
Figure 5-8
Scale diagram of a
star schema by
space used
For fixed length surrogate keys, a 4-byte integer is suitable for most dimension
populations. If you are expecting a crowd (more than 2.1 billion whos or whats) or
have specialized calendar dimension requirements (discussed in Chapter 7), you
should use an 8-byte integer surrogate key.
RI prevents bad Referential integrity means that every foreign key has a matching primary key.
keys getting into Without RI checks, facts could be loaded into fact tables with corrupt dimension
good fact tables foreign keys that do not match any existing dimensional values. If this happens,
any query that uses these bad keys will fail to include those facts because they will
not join to the appropriate dimensions. If these “bent needles” find there way into
the giant haystack of fact tables the (SQL “NOT IN”) queries needed to find them
are prohibitively expensive.
DBMS constraints RI can be enforced by defining foreign key constraints in the database. However, in
can enforce RI but practice, DBMSs can be too slow at loading data warehousing quantities of data
ETL SK lookups can with RI switched on. In contrast, ETL processes are optimized for performing the
often do this more type of in-memory lookups required to check foreign keys against primary keys—
efficiently this is exactly what ETL does when translating business keys into surrogate keys
prior to loading fact tables. Effectively, the surrogate key processing provides “free”
procedural referential integrity, which allows DBMS RI checking to be safely
disabled.
DBMS query optimizers often benefit from having fact table RI constraints defined.
You can retain these optimization clues but still boost ETL performance by setting
the constraints to unenforced (you might call this “trust me I know what my ETL
process is doing” mode). This tells the optimizer what it needs to know about the
relationships between facts and dimensions to speed up queries, but avoids
unnecessary insert and update checks that would slow down ETL.
Enable DBMS-enforced RI for fact tables during ETL development and initial data
take-on to provide “belt and braces” data integrity assurance and test ETL surro-
gate key lookups. If no DBMS RI errors are raised, the ETL processes are assign-
ing valid surrogate keys to facts and the additional DBMS checks are
unnecessary. You can then disable the DBMS RI (drop the constraints or set them
to unenforced), if it is having an adverse effect on load times.
Figure 5-9
Slowly changing
EMPLOYEE
dimension
Creating new rows to track change presents an issue for uniquely keying a dimen- Tracking history
sion. For example, Bond’s business key “007” no longer uniquely identifies a single within a dimension
Employee row and must be combined with the effective date of the change to means you cannot
provide a valid primary key. Unfortunately a composite key such as EMPLOYEE rely on the business
ID, EFFECTIVE DATE would ruinously complicate the joining of historical facts key alone as a
to the correct historical descriptions. Prior to tracking history a simple equi-join primary key
would locate Bond’s expenses:
Employee.Employee_ID = Expenses_Fact.Employee_ID
148 Chapter 5
Composite keys With a composite key involving effective date this becomes a far more difficult to
involving effective optimize complex (or theta) join:
dates require Employee.Employee_ID = Expenses_Fact.Employee_ID and
complex joins to Expenses_Fact.Expense_Date
between Employee.Effective_date and Employee.End_date
fact tables
Without the between join on the dates, all of Bond’s expenses would be joined to
each historical version of him, triple counting his total based on the three versions
of Bond in Figure 5-9. If the above join looks complex, imagine now that
EMPLOYEE isn’t the only HV dimension that must be joined to the facts, each
join would be just as complex. This would not be a viable query strategy against
typical data warehousing quantities of facts.
A Type 2 SCD Instead, Type 2 SCDs use a surrogate key as an efficient minimal primary key that
surrogate key uniquely identifies each historical version of a dimension member. Figure 5-9
partitions history shows the surrogate key EMPLOYEE KEY being added to the dimension. This
by using a simple would become a foreign key in all employee related fact tables. For Bond, his
equi-join earliest expense claims and sales transactions would have an EMPLOYEE KEY of
1010 while his most recent will be have 2120. A Type 2 SCD surrogate key guaran-
tees that efficient equi-joins will automatically join historical facts to the correct
historical descriptions and the most recent facts to current descriptions. They also
have the effect of making reports “repeatable”; for example, Bond’s 1968 expenses
will always be reported as incurred by a single man never a widower.
Treat CV as a As discussed in Chapter 3, you should often treat CV codes added to the model by
reporting default stakeholders as reporting directives rather than storage decisions. CV tells you that
rather than an ETL the stakeholders would prefer their reports to initially default to current values
instruction (because that is what they are used to). When their analysis requirements become
Modeling Star Schemas 149
more sophisticated they may change their minds. With modern DW/BI hardware Store dimensional
you have the luxury of storing and processing dimensional history for most dimen- history if possible.
sions as standard practice. And just because you track history you don’t have to Enable CV reporting
give it to BI users who don’t want it (yet). Chapter 6 covers the hybrid SCD pattern by providing a
for providing both current value (“as is”) reporting and historic value (“as was”) hybrid SCD
reporting without further complicating the model or ETL processes.
The Data Warehouse ETL Toolkit, Ralph Kimball, Joe Caserta (Wiley, 2004),
Chapter 5, pages 183–196 provides information on the ETL processing needed to
support slowly changing dimensions. Pages 194–196 describe the complexities of
handling late-arriving dimensional history.
Figure 5-10
Updated PRODUCT
dimension
150 Chapter 5
Use unique example ranges for the most common surrogate keys; e.g., 1–1000 for
customers, 2000–3000 for products. This can help the DW team read the foreign
key examples in fact tables (stakeholders would never look at these values). This
convention is just for human readability; reserving value ranges for specific
dimensions keys in the physical database is not recommended.
Effective dating EFFECTIVE DATE and END DATE define the valid date range for each dimen-
attributes support sion row. For example, in Figure 5-9 employee Bond’s three MARITAL STATUS
point in time (HV) changes have unique effective date ranges—with no overlaps or gaps. For the
dimension queries current version of each EMPLOYEE there is no END DATE. But rather than
leaving END DATE as Null, make sure ETL processes set it to the maximum date
supported by the database. This allows query tools to use simple BETWEEN logic
when asking questions about the dimension population at a specific point in time.
For example, a query to count the number of employees in each city at the close of
2011 would be:
SELECT city, count(*)
FROM employee
WHERE TO_DATE('31/12/2011','DD/MM/YYYY') BETWEEN
effective_date AND end_date
GROUP BY city
Queries should use The CURRENT flag indicates whether a row version is current (Y). This could be
a CURRENT flag inferred from the value of END DATE but this saves stakeholders and query tools
rather than the max from remembering the otherwise meaningless maximum date value, which can
DBMS date vary by DBMS.
SCD administrative SCD administrative attributes should all be defined as Not Null. END DATE
attributes should be should have a default value of the maximum database date, and CURRENT should
Not Null default to “Y”.
EFFECTIVE DATE and END DATE in Figure 5-10 are shown as dates. This would
allow the dimension to track one set of changes per day because the minimum
effective range for a historical version is one day. Multiple changes on the same
day (if they could be detected from the source system feed) would have to be
batched into a single update to the dimension. This is a reasonable approach if
multiple changes to the same attribute on the same day are corrections. If inter-
day changes are more significant and must be tracked to match inter-day facts,
EFFECTIVE DATE and END DATE need to be stored as full timestamps.
Modeling Star Schemas 151
There are five additional administrative attributes in Figure 5-10 that you should Additional ETL
also consider adding to every dimension: attributes can be
useful…
MISSING
CREATED DATE
CREATED BY
UPDATED DATE
UPDATED BY
The MISSING flag “Y” indicates that the row is a special “Missing” dimensional …for identifying
record (usually with a zero or negative surrogate key). “N” indicates a normal dim- special “missing”
ension record. This can be useful for filtering out all forms of missing (e.g. “N/A” rows…
and “Error”) without exposing their surrogate key values to BI users.
The CREATED and UPDATED attributes provide basic ETL audit information …and providing
on the date/time and ETL version used to create and update the dimension. Chap- dimension audit
ter 9 provides more details on audit techniques. information
Time Dimensions
If you haven’t already done so you should model when dimensionally—just like Model time
any other W-type. A CALENDAR dimension is essential to the data warehouse dimensionally
because it provides the conformed time hierarchies (discussed in Chapter 3) and as separate
descriptions that stakeholders need to analyze every business process. You should CALENDAR and
also model time of day to discover if stakeholders have custom descriptions for CLOCK dimensions
periods during a day, such as peak/off peak or shift names. If so, these should be
implemented as attributes of a separate minute granularity CLOCK dimension to
avoid a single monster TIME dimension that would contain 2.6 million minutes
for every 5 years of warehouse history.
Figure 5-11 shows how a single time of day granularity when event detail: CALL Datetime details
TIME is replaced by two surrogate keys: CALL DATE KEY and CALL TIME KEY become separate
in a fact table. Both CALENDAR and CLOCK are role-playing (RP) dimensions date and time
that will be used to replace the when details of every event. Chapter 7 provides full surrogate keys
details on time dimensions and their special property surrogate date keys.
152 Chapter 5
Figure 5-11
Splitting a when
detail into separate
Calendar and
Clock dimensions
Rename fact tables With the event table copies saved you can replace event names with fact table
and record their fact names and change story types to fact table types. In Figure 5-12 the CUSTOMER
type ORDERS discrete event has been renamed ORDERS FACTS and its story type DE
replaced with the fact table type TF for transaction fact. Chapter 8 describes each
of the fact table types in detail.
The following table codes are used to identify fact table type:
[TF] : Transaction Fact table, the physical version of discrete events
[PS] : Periodic Snapshot, the physical version of recurring events
[AS] : Accumulating Snapshot, the physical version of evolving events
Replacing the examples is an optional step; you might change them to integer Leave descriptive
sequence numbers, as we have here, if you are using a BEAM✲ table to explain a examples for
surrogate key technique to the team. Alternatively, you can leave the descriptive readability or
examples from the original event modeling unaltered so that you don’t have to change them to
keep referring to the separate dimension tables to understand the event stories integers to explain
behind the facts. Regardless of what you do to the examples, a column type of SK SK techniques
documents that a fact column is an integer dimension foreign key in the physical
database schema.
Figure 5-12
Creating the
ORDERS FACT
table
Modeling Facts
The remaining quantity columns in the fact table are defined as facts. Facts should The remaining how
be modeled in their most additive form, so that they can be easily aggregated at many details are
query time. Additivity describes how easy or possible it is to sum up a fact and get a converted into facts
meaningful result. The ideal facts are fully additive (FA) ones that can be summed that can be
using any of their dimensions. aggregated
The three order facts in Figure 5-12 have all been defined as fully additive (FA). To Full additive (FA)
convert the raw how many details to (fully) additive facts they must be stored using facts are ideal
consistent additive units of measure (UOM). ORDER QUANTITY can use the because they can
product units from the original business events, but REVENUE and DISCOUNT be summed using
which originally showed examples in numerous currencies must be transformed any dimension
154 Chapter 5
into dollars during ETL otherwise they would be non-additive (NA) facts. In the
case of DISCOUNT some of the source figures were percentages. You could create
a consistent UOM by transforming all discounts into percentages but that would
not be an additive UOM. Additive fact design is covered in detail in Chapter 8.
Don’t attempt to create one single ER diagram showing all the fact tables and
dimensions in the data warehouse. Even for a small subset of stars this quickly
becomes a mess of overlapping lines. ER notation is best restricted to viewing
one star at a time. Instead, develop a data warehouse matrix (covered shortly) to
provide a more useful overview of multiple stars or the entire model.
Enhanced star You can do two simple things to turn a standard star schema into an enhanced star
schema = star + schema. The first is to consistently place dimensions based on their W-type. The
consistent layout + second is to add BEAM✲ short codes to the tables and columns to describe their
BEAM✲ codes dimensional properties.
Modeling Star Schemas 155
Figure 5-13
Consistent
dimensional layout
based on W-type
Discover the BI
Model Canvas at
modelstorming.com
to help you model
collaboratively using
this layout
If your modeling tool allows you to include table and column comments or ex-
tended attributes on ER diagrams you can use these to display BEAM✲ codes.
Alternatively, if this feature isn’t available you may be able to display the codes by
appending them to the business or logical table and column name in your model
and setting the tool’s model view to conceptual or logical. The BEAM✲Model-
stormer spreadsheet contains configurable options for export names and codes
as comments or extended database attributes that can be imported by many
modeling tools.
156 Chapter 5
Figure 5-14
Enhanced star
schema for
customer orders
Resist the urge to Once the model is in a familiar ER modeling tool you (or the DBAs) may be
snowflake. For most tempted to introduce snowflaking to reduce data redundancy and simplify dimen-
dimensions there sion maintenance. However, snowflake schemas are not generally recommended.
are no advantages They are too complex for user presentation (if required by your BI tool), offer no
significant space savings (see Figure 5-8), exhibit poor dimension browsing per-
formance, and negate the advantages of bitmap indices. There are legitimate
reasons for snowflaking very large dimensions, covered in Chapter 6, but resist any
3NF (third normal form) urges brought on solely by using an ER modeling tool.
Modeling Star Schemas 157
Figure 5-15
Snowflake schema
Do not use snowflake schema outriggers to document hierarchies if your database Use hierarchy
or BI toolset doesn’t need them explicitly defined as 1:M relationships—most charts rather than
don’t. Draw hierarchy charts instead. Do be pragmatic, if your toolset works better snowflakes to define
with snowflake schemas, create them as a physical optimization. hierarchies
Rollup dimensions are often not explicitly modeled as BEAM✲ tables because they Define rollups by
do not contain any attributes or values that are not present in their base dimen- copying and editing
sions. If a rollup dimension is as yet undefined you should create it at the star their base
schema level by copying its base dimension and removing all the attributes below dimensions
the rollup level in the base dimension hierarchy. This is analogous to the ETL
process that should build the rollup from its base dimension data, rather than
source data, to keep the two in sync and guarantee conformance.
If you are using the BEAM✲Modelstormer spreadsheet, you can edit its DDL
template to generate custom SQL for your DBMS.
158 Chapter 5
A common naming convention is to prefix all dimension tables with DIM_ so that
they sort together. What do stakeholders and developers (subconsciously) think
every time they see DIM_CUSTOMER or DIM_EMPLOYEE? Instead, reserve a
common schema or owner name, perhaps “DIMENSION”, for creating dimensions,
to achieve the same grouping and avoid such pejorative table names.
Turn end of sprint demos into prototyping workshops; have BI developers help the Stakeholders will be
original modelstormers (real stakeholders) get their “hands dirty” using their ready to define their
design with real data and real BI tools, as in Figure 5-16. These workshops can be reports using the
remarkably productive because the stakeholders—having used the 7Ws to model- 7Ws
storm their data requirements—will already be thinking about their business
questions and report layouts in terms of these 7W dimensional interrogatives.
Figure 5-16
DW/BI Prototyping
For prototypes, avoid test data generation—it proves nothing. Instead, validate the Load prototype stars
ETL process by sampling small amounts of real data, extracted from the actual with 10,000 recent
sources documented in the model. 10,000 recent facts with matching dimensional facts and similar
descriptions plus similar samples from one or two previous years is usually just samples from
enough representative data for stakeholders to get a true feel for what the final previous
solution will be like. Use data profiling to set realistic expectations of the prototype time periods
before any queries are run. Make sure stakeholders understand that counts and
totals will be low because a small percentage of the data has been sampled.
Speed up ETL prototyping by not indexing the data. BI prototyping with un-
indexed sample data on modest hardware will also help to set realistic expecta-
tions for query performance against complete data, fully-indexed on specialist
DW/BI hardware.
160 Chapter 5
The event matrix is This physical matrix and the event matrix should be kept in sync as much as
for planning. The possible but will diverge at times because of their distinct functions. The event
physical matrix matrix is a modeling and planning tool that reflects the stakeholders’ requirements,
documents the whereas the data warehouse matrix is a management tool that reflects the current
current live model state of the data warehouse—including any conformance failures.
Document If you have to compromise within a sprint and postpone conforming a dimension,
conformance failure or you inherit a warehouse that has evolved without conformed dimensions, you
on the matrix by should record these conformance failures on the data warehouse matrix by using
using dimension dimension version numbers. Rather than create a separate column for each non-
version numbers conformed version of a dimension, continue to use a single column for each
planned conformed dimension but number each different version in use, rather
than just tick usage. Reserve the highest number for the best version of each di-
mension (usually the most recently developed). For example, Figure 5-17 shows
that Pomegranate has failed to conform product across manufacturing, sales and
customer support, instead there are three different versions of PRODUCT (per-
haps it really should be called DIM_PRODUCT). Thankfully, PRODUCT is
partially conformed; the best version is already the most widely used and only two
stars (Customer Orders and Customer Complaints) need to be refactored.
Update the event After each sprint, the event matrix should be updated with conformance failures
matrix with any (planned conformance that did not happen) and non-conformed realities (planned
conformance issues conformance that could not happen because it was wrong) so that these issues can
and address them be addressed with the stakeholders during the next modelstorm. Use the updated
with stakeholders event matrix to plan the refactoring of older stars with newer more conformed
dimensions as part of your iterative development approach.
A live version of the matrix, showing up-to-date volumetrics and the current ETL
status for each star, is the ideal dashboard for a DW/BI team. You could develop
one by using BI tools to summarize ETL and DBMS catalogue metadata.
Modeling Star Schemas
Figure 5-17
Data warehouse
161
162 Chapter 5
Summary
Agile data profiling targets the data sources implicated by the BEAM✲ model. It is done early as
a data-driven modeling activity to validate the stakeholders data requirements before detailed
star schemas are designed. When data sources don’t yet exist, proactive DW/BI designs based
on the BEAM✲ model can help define better BI data feeds from new operational systems.
Annotated models present data profiling results in a format stakeholders are familiar with. An
annotated table contains source names, data types and summary data profiling metrics. Data
source issues such as missing data and mismatched definitions are highlighted using
strikethrough. Additional data is highlighted using italics.
The DW/BI team uses the annotated model and detailed data profiling results to provide initial
task estimates for building and loading the proposed facts and dimensions. These ETL estimates
are added to the event matrix for use during model review and sprint planning.
During a model review, stakeholders use the annotated model and the DW/BI team estimates to
agree amendments to the design and reprioritize their requirements in light of the data realities
and available development resources.
BEAM✲ models are easily translated into logical dimensional models and star schemas.
Dimension tables are updated by adding primary keys and administrative attributes. Event
tables are converted into fact tables by replacing dimensional details with foreign keys and
changing quantities (how many details) into fully-additive (FA), semi-additive (SA), or non-
additive (NA) facts with standardized (conformed) units of measure.
(Data warehouse) surrogate keys are used as dimension primary keys to insulate the data
warehouse from business keys, provide dimensional flexibility (manage SCDs, missing values,
multi-levels, etc.) and improve query efficiency.
Chapter 6: Who and What: People and Organizations, Products and Services
Who’s on first?
— Bud Abbott and Lou Costello
What’s next?
— President Jed Bartlet, “The West Wing”
Who and what dimensions such as CUSTOMER, EMPLOYEE and PRODUCT Who and what are
represent some of the most interesting, highly scrutinized, and complex dimen- the most important
sions of a data warehouse. Modeling these dimensions and their inherent hierar- dimensions
chies presents a number of challenges that can be addressed by design patterns.
In the first of our W-themed design pattern chapters we begin by describing mini- This chapter
dimensions and snowflaking for handling large, volatile customer dimensions, describes design
swappable dimensions for mixed customer type models and hierarchy maps for patterns for defining
recursive customer relationships. We then move on to employee dimensions to flexible, high
cover hybrid SCD views for current value/historic value (CV/HV) reporting re- performance who
quirements and multi-valued hierarchy maps for multi-parent HR hierarchies with and what
dotted-line relationships. We finish by looking at product and service dimension dimensions
issues and introduce multi-level dimensions for variable detail facts and reverse
hierarchy maps for component analysis.
Customer Dimensions
Customer Customer dimensions are particularly challenging because of their size. Business-
dimensions are to-consumer (B2C) customer dimensions can be deep (millions of customers),
typically very large wide (many interesting descriptive attributes), and volatile (people are volatile).
dimensions (VLDs) This combination of high data volumes and high volatility is the reason customer
dimensions are often referred to as “monster dimensions”—they’re a little scary.
How large is a very large dimension (or table of any type)? Everything is relative.
Any absolute figure we quote will be trumped by future hardware and that
trumped again by unimagined requirements for capturing big data. The only
definition that stands the test of time is: “a very large table is one that does not
perform as well as you wish it to.”
Mini-Dimension Pattern
Problem/Requirement
Stakeholders are very interested in tracking descriptive changes to the customer
base to help to explain changes in customer behavior. So they have defined many
Customer attributes historic value (HV) customer attributes. Unfortunately using the Type 2 SCD
can be too volatile technique for each HV attribute is likely to cause explosive growth in the customer
to track using the dimension; for example, a 10 million row CUSTOMER [HV] dimension with an
Type 2 SCD AGE [HV] attribute will grow by 10 million rows per year. Obviously AGE is a
technique poor choice as an HV attribute; it alone would turn CUSTOMER into a rapidly
changing dimension. This issue is quickly solved by replacing AGE in the dimen-
sion with the fixed value (FV) attribute DATE OF BIRTH [FV] and calculating the
historically correct age, at the time of the facts, in the BI query layer. Sadly, very
few customer dimension historical value requirements are as easy to solve as age.
Customer HV It only takes 5 HV attributes (that cannot be calculated) that change on average
attributes must be once every two years for each customer, for an initial 10M row CUSTOMER
carefully chosen. dimension to grow by up to 25 million rows per year. With growth like that, you
Not every change will have to be careful about which attributes you define as HV, and what types of
should be tracked, change you track. You don’t want to track attributes that have no historical signifi-
can be tracked, or cance—they should be defined as current value (CV). Nor do you want to track a
is worth tracking history of corrections that should be handled as simple updates. Corrections are
easy to spot for FV attributes, such as date of birth (cannot change, can only be
corrected), but may require group change rules (described in Chapter 3) that look
for combinations of HV and CV attributes changing together to detect genuine
change. You may also want to avoid tracking macro-level changes.
Dimensional Design Patterns for People and Organizations, Products and Services 167
Micro-level change occurs when individual dimension members experience Most Type 2 SCDs
change that is unique to them; for example, a customer changes can cope well with
CUSTOMER CATEGORY and goes from being a “Good Customer" to a individual micro-
"Great Customer". If CUSTOMER CATEGORY is an HV attribute, one row level changes
will be updated to give the old value an end date and one row will be in-
serted with the new value. Hierarchically, this is “change from below”.
Macro-level change occurs when many dimension members are changed Macro-level, global
at once; for example, every "Great Customer" becomes a "Wonderful Cus- changes are more
tomer". Rows affected: 1,000,000 updated, 1,000,000 inserted. Hierarchi- challenging. Should
cally, this is “change from above”: it’s not customers who have changed but they be handled as
CUSTOMER CATEGORY itself. The category "Great Customer" has changes or
changed to "Wonderful Customer". corrections?
Solution
Rapidly changing HV customer attributes have a high cardinality many-to-many
(M:M) relationship with customer. One possible solution for tracking these attrib-
utes is to model them just as you would model other customer M:M relationships, Volatile HV
such as the products they consume, or the sales locations they visit. Products and attributes have a
locations are of course modeled as separate dimensions and related to customers high cardinality M:M
through fact tables. The same can be done with volatile customer attributes by with their dimension
moving them to their own mini-dimension.
Figure 6-1 shows CUSTOMER DEMOGRAPHICS, a customer mini-dimension They can be stored
formed by relocating the volatile HV CUSTOMER attributes relating to location, in a separate mini-
family size, income, and credit score. This mini-dimension has its own surrogate dimension with its
key (DEMO KEY) which is added to customer-related fact tables to describe the own surrogate key
historic demographic values at the time of each fact. With fact relationships used to and related to the
track history, all the problematic HV attributes can be removed from dimension through
CUSTOMER, or changed to CV only. This would leave you with an entirely CV fact tables
CUSTOMER dimension that only grows as new customers are acquired.
168 Chapter 6 Who and What
Figure 6-1
Removing volatile
attributes from
CUSTOMER
Poorly designed So, problem solved? Unfortunately, it might just be a case of problem moved. If the
mini-dimensions can mini-dimension contains several high cardinality attributes, the number of unique
be almost as large demographic profiles may be almost as high as the number of customers and
and volatile as the customer changes will create new profiles rather than reuse existing ones because
original dimension they are too specific. The CV customer dimension might not grow but the so-
called “mini-dimension” will, to become the new “monster dimension”.
Create stable mini- Mini-dimensions need to be mini and stay mini if they are to solve the VLD HV
dimensions by problem. Figure 6-2 shows a redesign of CUSTOMER DEMOGRAPHIC where
removing high some of the original high cardinality attributes (CITY, POST CODE) have been
cardinality attributes removed and the continuously valued attributes (INCOME, CREDIT SCORE)
or reducing their have been converted into low cardinality discrete bands. This dramatically reduces
cardinality by the number of unique profiles and increases the chances of reusing them when
banding customers change.
When you have defined a small stable customer mini-dimension you may be able
to add additional frequently queried, low cardinality (GENDER, AGE BAND) attrib-
utes without significantly increasing its size. These would increase the filtering
power of the mini-dimension and reduce the need for many queries to access the
much larger CUSTOMER dimension at all.
Add mini-dimension Figure 6-2 also shows that CURRENT DEMO KEY, a CV foreign key to
keys to their main CUSTOMER DEMOGRAPHICS, has been added to CUSTOMER. This creates a
dimensions, to single table containing the customer business key: CUSTOMER ID and the two
support efficient customer surrogate keys: CUSTOMER KEY and CURRENT DEMO KEY needed
ETL processing to load customer facts. ETL processes would use this to build an efficient lookup.
Dimensional Design Patterns for People and Organizations, Products and Services 169
Figure 6-2
Creating a
mini-dimension
The CURRENT DEMO KEY also allows queries to ask questions using current A mini-dimension
demographic descriptions; for example, the stakeholder question “how many high foreign key (CV, FK)
income customers are there?” (with no further qualification it must mean currently allows the mini-
high income) can be answered by joining CUSTOMER and CUSTOMER dimension to be
DEMOGRAPHICS directly without having to go through the fact table. This uses a used to answer
shortcut join which is not compatible with fact related queries. For BI tools that do current value
not support shortcut joins or for queries that need both current and historic questions
demographics, a view can be created on the mini-dimension to play the role of
CURRENT DEMOGRAPHICS as in Figure 6-2. This customer dimension outrig-
ger could be used to answer interesting questions like “How many customers who
are currently married, last purchased from us when they were single?”
A mini-dimension surrogate key should be added to all fact tables associated with
the main dimension, where it represents a historical value foreign key (HV, FK).
The mini-dimension surrogate key should also be added to the main dimension as
a current value foreign key (CV, FK) to support ETL processing and CV “as is”
reporting.
Consequences
Mini-dimensions increase the size of fact tables by adding foreign keys. If many
high cardinality HV attributes must be tracked, they may need to be separated into
multiple mini-dimensions, to control both main and mini-dimension size. Each
mini-dimension that you create will contribute an extra foreign key and index to
the fact tables.
170 Chapter 6 Who and What
Solution
Generally, it’s a good idea to denormalize as many descriptive attributes as possible
into a dimension to simplify the model and improve query performance. But
CUSTOMER dimensions, because of their size, are exceptional and can often
benefit from sensible normalization or “snowflaking”. The FIRST PURCHASE
DATE and GEODEMOGRAPHICS outriggers, shown in Figure 6-3, represent
“Snowflake” the sensible snowflaking because they avoid a large number of much lower cardinality
CUSTOMER date and geodemographic attributes being embedded in the CUSTOMER dimen-
dimension. Move sion. Keeping these attributes separate will make a worthwhile storage saving that
large collections of will improve query performance—especially for all the queries that are not inter-
low cardinality ested in the first purchase dates or geodemographics. In this specific case the use of
attributes to an outrigger for FIRST PURCHASE DATE makes even more sense as it can be
outriggers implemented as a role-playing view of the standard CALENDAR dimension.
Dimensional Design Patterns for People and Organizations, Products and Services 171
For commercially supplied geodemographic information there may be additional Attributes that
administrative or legal reasons for snowflaking. It may be supplied on a periodic are administered
basis and updated independently of customer data, or there may be licensing differently may
restrictions on the number of users who can access it, therefore it cannot be held in need to be
a customer dimension available to the entire BI user community. snowflaked
Figure 6-3
Useful customer
“snowflaking”
The outriggers in Figure 6-3 do not track history. This is not a problem for first
purchase attributes as they are fixed values that can only be corrected. For
geodemographic attributes, history could be tracked by defining CUSTOMER as
HV but this would lead to uncontrolled growth in the dimension. Alternatively, the
GEODEMOGRAPHICS outrigger could be used as a mini-dimension by adding
GEOCODE KEY to existing fact tables or a newly created customer demographics
fact table.
Consequences
Outriggers complicate dimensional models and are generally unnecessary for most
dimensions. Once you have introduced useful outriggers to one dimension, your
colleagues, especially those with a 3NF bias, may be tempted to define less useful
outriggers in other dimensions that might not have such a positive effect.
You should only model outriggers that have far fewer records than the monster
dimensions they are associated with. If any attributes of a proposed outrigger
have a cardinality that approaches that of the dimension, leave them there.
172 Chapter 6 Who and What
Solution
Dimensions that contain large groups of exclusive attributes (based on one or more
defining characteristic (DC) attributes) can be modeled as sets of swappable dimen-
sions to improve usability and performance. Swappable dimensions are so named
because they can be swapped into a query in place of (or in addition to) another
swappable dimension that shares the same surrogate key. Figure 6-4 shows swap-
pable sets of customer and product dimensions that would be useful for a mixed
business model data warehouse. The main CUSTOMER dimension contains
attributes that are common across the entire customer population, this includes the
defining characteristic CUSTOMER TYPE [DC1,2] which identifies which of two
exclusive groups of attributes are relevant: X1 consumer attributes or X2 business
attributes. The swappable CONSUMER and BUSINESS CUSTOMER dimensions
contain these common attributes and the exclusive attributes relevant to just their
customer type.
More efficient Swappable subset dimensions are easier for many BI users to navigate because they
swappable subset contain only the rows and columns that are relevant to them. For example, BI users
dimensions can working in corporate sales would use the BUSINESS CUSTOMER version of
be created based CUSTOMER—using database synonyms, it can be renamed to be their default
on defining CUSTOMER dimension. Because BUSINESS CUSTOMER only contains corpo-
characteristics rate customers they would see only the business attributes they want, and would
not have to add where Customer_Type=‘Business’ to every query. If
businesses made up only 10% of the customer base, the corporate sales analysts
would see a significant performance boost too.
Hot swappable Swappable dimensions that have identical column names are referred to as hot
dimensions can be swappable, because they can be used in place of each other without rewriting any
used in place of queries. Hot swappable dimensions can be used to implement restricted access
each other without (row-level security), study groups, sample populations, national language transla-
rewriting queries tion and alternative CV/HV reporting views (See the Hybrid SCD pattern covered
later in this chapter).
Dimensional Design Patterns for People and Organizations, Products and Services 173
Figure 6-4
Swappable
dimensions
Figure 6-5
CUSTOMER
dimension with
embedded who
attributes
174 Chapter 6 Who and What
Figure 6-6
Modeling an
embedded who
as a separate
HV dimension and
a CV outrigger
Recursive Relationship
A who within a who Looking at the PARENT COMPANY examples, in Figure 6-5, you can see that it
of the same type is contains companies that are present as customers in the CUSTOMER dimension.
a recursive This represents a recursive relationship which would be drawn in ER notation, as in
relationship Figure 6-7, with a M:1 relationship between the customer entity and itself. The
relationship documents that each customer may own one or more customers and
each customer may be owned by one customer.
Figure 6-7
M:1 recursive
relationship or
“head scratcher”
Figure 6-8
BEAM✲ recursive
relationship
Variable-Depth Hierarchies
Customer ownership is a classic example of a variable-depth hierarchy. Some Customer
business customers will be self-contained or privately-owned companies, repre- ownership is a
senting a hierarchy of only a single level. But other customers may be the top, classic example of a
middle, or bottom of a deep hierarchy of corporate ownership (stretching all the variable-depth
way to Liechtenstein or Delaware!). The Figure 6-9 organization chart reveals that hierarchy, best
all the Figure 6-8 examples customers are ultimately owned by Pomegranate. illustrated by an
organization chart
Figure 6-9
Variable-depth
hierarchy
Profile the data to Because of the challenges involved in making recursive relationships report-
see if it represents friendly, the first thing to do, when you spot one, is to profile the data to see
a simple balanced whether it actually represents a variable-depth hierarchy or not. If the data itself
hierarchy represents a balanced hierarchy with a fixed number of levels, or a hierarchy that is
only “slightly” ragged, then the design can be kept simple by flattening (denormal-
izing) the data into a fixed number of well-named hierarchical attributes within a
standard dimension.
Check that variable- If data profiling confirms that there is a variable-depth hierarchy, it is worth
depth is necessary double-checking that the variable-depth is truly required for analysis purposes. If it
and the hierarchy is and it cannot be simplified then the following hierarchy map techniques will
cannot be simplified, help, but they should also motivate you to, whenever possible, balance and fix the
before implementing depth of all hierarchies that are under your control! For customer ownership
a complex solution analysis, there is no opportunity to simplify the hierarchies involved. You cannot
tell a customer like Pomegranate that its ownership hierarchy is more complex
than your other clients and ask it to sort itself out. This hierarchy is external—
beyond your control—and must be represented as is.
Dimensional Design Patterns for People and Organizations, Products and Services 177
Solution
A hierarchy map is an additional table that resolves a recursive relationship by
storing all the distant parent-descendent relationships it represents. Recursive
relationships record only immediate parent-child relationships, whereas a hierar-
chy map stores every parent-parent, parent-child, parent-grandchild, parent-great-
grandchild relationship, and so on, no matter how distant. Its structure is best Hierarchy maps
explained by looking at the BEAM✲ diagram Figure 6-10 which shows store variable-depth
COMPANY STRUCTURE, a hierarchy map for the Figure 6-9 Customer owner- hierarchies in a
ship hierarchy. It is documented as CV, HM to denote that it is a current value BI-friendly format
hierarchy map: it records only the current ownership hierarchy because it is based
on the CV definition of PARENT COMPANY in Figure 6-5.
Figure 6-10
Hierarchy map table
The first thing you notice about COMPANY STRUCTURE is that it contains far Hierarchy maps
more rows than the original CUSTOMER dimension. This may explain why the explode all the
technique is sometimes referred to as a hierarchy explosion. But don’t worry—it’s hierarchical
not a very big bang! The row count is rarely an order of magnitude higher, and relationships
hierarchy maps are quite narrow—made up of a pair of surrogate keys and just a (it’s not a big bang)
few useful counters and flags. Table 6-1 describes these attributes for COMPANY
STRUCTURE.
178 Chapter 6 Who and What
A hierarchy map COMPANY STRUCTURE contains 11 rows where Pomegranate is the parent: one
treats each for each subsidiary customer on the organization chart in Figure 6-9. Explicitly
dimension member storing a relationship between all Pomegranate subsidiaries and their topmost
as a parent and parent makes it easy to answer any Pomegranate-related parent questions. If they
records all its child, were the only questions, these would be the only rows needed in the map but to
grandchild etc. support fully flexible ad-hoc reporting to any level of ownership, the map needs to
relationships contain additional rows where each of the subsidiary customers is treated as the
parent of its own small hierarchy.
Lowest [Y/N] Flag Y indicates that the Subsidiary Key is the lowest
Subsidiary company in an ownership hierarchy, it is not the
owner of any other customer.
Highest [Y/N] Flag Y indicates that the Parent Key is the highest
Parent company in an ownership hierarchy, it is not
owned by any other customer.
PARENT KEY and SUBSIDIARY KEY in Figure 6-11 are documented as SK. They
contain company names for model readability (in true BEAM✲ fashion). The
physical database columns will contain integer surrogate keys.
If you know how You can calculate the number of hierarchy map rows needed for a complete hierar-
many members chy by summing the number of members at each level times their level. For the
there are at each data shown on the organization chart in Figure 6-10 that would be 1×1 + 3×2 +
level you can 3×3 + 2×4 = 24 rows. COMPANY STRUCTURE has three more rows to handle
calculate the size slowly changing customer descriptions for customers (Pomegranate and PicCzar
of a hierarchy Movies) in the HV CUSTOMER dimension (Figure 6-8). They make the calcula-
tion 2×1 + 4×2 + 3×3 + 2×4 = 27 rows.
For the 11 Pomegranate related dimension members (only 9 customers but there
are 2 additional versions of the slowly changed customers) with 4 levels the
estimate would be 33. This simple formula always gives you an overestimate,
which is good! You will be pleasantly surprised when the map is populated.
Dimensional Design Patterns for People and Organizations, Products and Services 179
Figure 6-11
Hierarchy map with
Type 2 SCD
surrogate keys
When HV customer attributes change, their new surrogate key values must also
be inserted into the company ownership hierarchy map, as new SUBSIDIARY KEY
values even if their ownership remains unchanged.
Figure 6-12
Using a hierarchy
map table to rollup
revenue to the
parent customer
level
180 Chapter 6 Who and What
Descendent levels Queries can be further refined using COMPANY LEVEL and LOWEST
can be filtered using SUBSIDIARY. For example:
the LEVEL and
LOWEST columns To get the total revenue just for customers that are directly owned by
in the hierarchy map Pomegranate, change the constraint to Company_Level = 2
To get the total revenue for only the Pomegranate companies that do
not own other customers, add Lowest_Subsidiary = ‘Y’
One of the strengths of the hierarchy map is that all of these questions can be
answered without knowing (or caring) how many subsidiaries or levels there are.
A CV hierarchy map such as COMPANY STRUCTURE that does not track parent
history is not symmetrical for query purposes if its matching dimension contains
Type 2 SCD surrogate keys. You cannot reverse the joins in Figure 6-12 and use it
to roll up all the historical revenue for the parents of a selected subsidiary, be-
cause the map only contains the current surrogate key values for each parent. If
there is a requirement to roll up historical parent facts using current subsidiary
descriptions a different version of the hierarchy map must be built that contains
the full history of parent surrogate keys.
Displaying a Hierarchy
A hierarchy map The example queries described so far use the hierarchy map to aggregate facts to
can be used to the parent level. But the hierarchy map can also be used to display all the levels of a
display a hierarchy hierarchy on a report. To do this you join the customer dimension to the parent
by joining a parent customer view through the hierarchy map, as shown in Figure 6-13. This gives you
view of a dimension both a parent customer name and a (subsidiary) customer name to group on and
to the dimension display in your reports—allowing reports to display facts for each level in the
hierarchy. However, to make sense of the hierarchy itself on such reports, the
subsidiaries have to be displayed in the correct hierarchy sequence.
Dimensional Design Patterns for People and Organizations, Products and Services 181
Figure 6-13
Using a hierarchy
map to browse a
customer hierarchy
and report facts at
the subsidiary level
Hierarchy Sequence
Sorting on company name would destroy the hierarchical order. But sorting by To display a
hierarchy level is no better, because this would display all the level 1 customers, hierarchy correctly
followed by all level 2 customers, then all level 3 customers, and so on. You would the hierarchy map
not be able to tell which level 2 customer owns which level 3 customers. To solve must contain a
this problem the hierarchy map needs a Sequence Number attribute that sorts the hierarchy sequence
nodes in the hierarchy correctly “top to bottom before left to right” as shown in number that sorts
Figure 6-14. The Sequence Number can then be used to sort the decedents of a top to bottom before
customer (top to bottom) ahead of the next customer (left to right) at the same left to right
level; i.e., ensures that all the level 3 subsidiaries of a level 2 customer will be
displayed before the next level 2 customer is displayed.
Figure 6-14
Hierarchy
sequence
numbers
The report in Figure 6-15 shows how you use SEQUENCE NUMBER with Sort hierarchy
COMPANY LEVEL to display the hierarchy, by sorting down the page on reports by sequence
SEQUENCE NUMBER, and indenting across the page on COMPANY LEVEL. number and indent
The following snippet of Oracle SQL shows how an indenting Company Name using level
could be defined in a BI tool:
This will indent each level 2 customer name by three spaces, each level 2 by six
spaces, and so on. A level 1 customer would display on the left margin (indented by
zero spaces).
Figure 6-15
Hierarchy report,
indented to show
subsidiary level
Consequences
To populate the sequence number column correctly you have to build the hierar-
chy map in hierarchy sequence order. This precludes the use of SQL techniques
that populate the table “a whole level at a time”. It also means that maintenance is
more complicated—if nodes are moved their sequence numbers and the sequence
numbers of many others around them need to be updated. Because this involves
complex coding it is often easier (time permitting) to rebuild (truncate and reload)
the hierarchy map than update it.
MegaHard £27M, and so on; i.e., a report showing total revenue for all the top level Use the HIGHEST
clients without listing any of their subsidiaries. This is where the HIGHEST PARENT flag to
PARENT flag (see Figure 6-10) is useful. By constraining on Highest_Parent filter out partial
= ‘Y’ a query will include only the full hierarchy for each top most customer, and hierarchies and
the revenue figures for each of its subsidiaries will be summarized only once. avoid over-counting
For example SQL that handles subsidiaries distinctly while querying multiple
parents see: The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy
Ross (Wiley, 2002) page 166.
Parent History: tracking changes to the HV dimensional attributes of a parent Parent history is
level; for example, when a parent company is reclassified or a manager’s salary tracked by adding
grade changes. This involves populating the hierarchy map with every surrogate every HV parent key
key value for a parent and adding new rows with the new parent key value for value to the hierarchy
every level in its hierarchy every time a parent HV attribute changes. map
Hierarchy History: tracking changes to the hierarchical relationships; for example, Hierarchy history can
a parent company sells a subsidiary or an employee starts reporting to a new be tracked by adding
manager. This involves the effective dating of all the rows in the hierarchy map. effective dates to
New rows are added to the hierarchy map with the appropriate effective date when each hierarchy map
new children are added to a hierarchy and end dates are adjusted on existing record
hierarchy relationships when they are changed or deleted. A change/move—for
184 Chapter 6 Who and What
Child history is Child History: tracking changes to HV attributes of a child level; for example, the
tracked by adding location of a subsidiary company or an employee’s marital status changes. This
every HV child key involves populating the hierarchy map with every surrogate key value for a child
value to the hierarchy and adding new rows with the new child key value for every parent level above it,
map. This should be every time a child’s HV attribute changes. For a hierarchy map built from an HV
the default for most dimension, such as COMPANY STRUCTURE, it must at least track child history
hierarchy maps to correctly join to child level facts and rollup all their history.
Ripple effect growth Using an HV recursive key to track every parent or child change will cause a
can be manageable if dimension to grow more quickly, but the technique is still viable if hierarchies
a dimension contains make up a small amount of the data. For example, if only a small percentage of
a small number of customers are owned by another customer (PARENT KEY is mainly NULL) and
small hierarchies ownership hierarchies are typically only a couple of levels deep, the resulting
additional growth would be manageable.
Dimensional Design Patterns for People and Organizations, Products and Services 185
Figure 6-16
Recursive key
ripple effect
Employee Dimensions
HV Employee After customers, employees are the next most interesting who for BI. Thankfully,
dimensions are because there are usually far fewer employees than customers, the Type 2 SCD
typically Type 2 technique can work well for tracking the majority of employee HV attributes. But
SCDs employee dimensions are not without their challenges. More descriptive informa-
tion may be known and recorded about employees and the departments they work
in. If that information is tracked historically it can lead to additional BI require-
ments to analyze the organization, as represented by its employees, using current,
previous, historical and year-end descriptions.
In The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross
(Wiley, 2002) Chapter 8, “Human Resource Management” covers many of the
basic issues involved in supporting Type 2 SCD Employee dimensions.
What are the total expenses over the last 5 years for every
employee currently based in the London office?
!
doesn’t care where employees were based in the last 5 years, it only needs their CV
location to filter on.
Solution
Create and maintain an HV version of EMPLOYEE using Type 2 SCD ETL proc-
essing to satisfy the default reporting requirements. Create a separate current value
swappable dimension (CV SD) from the HV employee dimension. Figure 6-17
Create separate HV shows how a CV swappable version of EMPLOYEE can be defined by joining the
and CV swappable HV EMPLOYEE dimension to a copy of itself constrained to current employee
dimensions definitions. In the example every version of James Bond is joined to the current
James Bond. The resulting CV SD dimension initially appears rather wasteful,
containing 3 identical Bonds but on closer examination you notice that each Bond
has a different surrogate key value.
Dimensional Design Patterns for People and Organizations, Products and Services 187
The self-join, in Figure 6-17, picks the historical EMPLOYEE KEYs and the cur- A CV hot swappable
rent descriptive values. When this identical-looking EMPLOYEE dimension is dimension can be
used instead of the original HV version it will roll up all of Bond’s facts from his 3 built as a self-join
different era’s (EMPLOYEE KEY 1010, 2099 and 2120) to a single location of view of an HV
London or a single status of “Widowed”. Because the CV and HV swappable dimension
dimensions are identically described they can be “hot swapped” for each other to
change the temporal focus of a query without rewriting any SQL.
Figure 6-17
Defining a CV
swappable
dimension
CV swappable dimensions can be built as views but for better query performance
store them as real tables or materialized views. The small amount of “wasted”
space avoids having to do the self-join inside every query.
The HV and CV swappable dimensions are not mutually exclusive, both can be HV and CV
joined to a fact table in the same query, to group or filter on current and historical swappable
values simultaneously. If this is a common requirement, you can build a more dimensions can be
query efficient hybrid HV/CV dimension by selecting both the CV and HV ver- used in the same
sions of attributes and co-locating them in the same swappable dimension to query to provide
provide side by side attributes for easy comparisons and more ambitious queries; current and
for example, a hybrid EMPLOYEE dimension would allow a query to group by historical values
HISTORICAL CITY while filtering on CURRENT CITY.
188 Chapter 6 Who and What
The self-join pattern As well as creating CV swappable dimensions the self-join view technique can be
can be used to create used to create Year-End dimensions for “as at” reporting; for example, a Year-End
Year-End dimensions dimension for 4 April 2011 can be created by replacing the CV.Current = ‘Y’
for “as at” reporting constraint in the view definition with:
‘4/4/2011’ between CV.Effective_Date and CV.End_Date
Consequences
The CV swappable dimension is a “gold star” agile design pattern. As long as you
have tracked history for a dimension from day one, a CV view can be added at any
time, when CV reporting requirements emerge, without increasing ETL program-
ming effort (if implemented as a materialized view) or rewriting any existing
queries, because the view is hot swappable.
Having said that, it is often the case that CV reporting is the first choice.
Stakeholders will define attributes as CV/HV rather than HV/CV (with the CV
default first) because they initially want BI solutions to mimic the CV only per-
spective of existing operational reporting systems. Or perhaps, stakeholders just
define CV attributes because they simply cannot see a need for history—yet. Either
Don’t trust CV only way, unless you are dealing with a very large dimension (customer), you should
requirements. Build default to HV ETL processing, hide the HV dimension and make a CV view
HV dimensions and available. That way, when stakeholders finally demand HV reports, you can simply
deliver CV views swap views rather than reload the entire warehouse.
Solution
CV/PV attributes are implemented by defining additional PV columns (also
known as Type 3 SCDs). Figure 6-18 shows PREVIOUS TERRITORY PV1 added
to the EMPLOYEE dimension. This is marked PV1 to link it to the current value
TERRITORY attribute marked CV1. During ETL processing, whenever
TERRITORY is updated its existing value is saved into PREVIOUS TERRITORY
Implement separate prior to storing the new value. PV attributes work well for small numbers of
CV and PV attribute attributes that must be tracked but do not change frequently, because they only
columns allow users to go back one version. Multiple PV attributes would allow for more
versions; for example TERRITORY LAST YEAR PV1, TERRITORY 2YR AGO
PV1, all linked to the current TERRITORY CV1 but can soon become unwieldy.
Dimensional Design Patterns for People and Organizations, Products and Services 189
Figure 6-18
Implementing a
previous value
attribute
PV attributes can be used to hold initial or “as at specific date” values; for exam-
ple, INITIAL TERRITORY PV1 or YE2011 TERRITORY PV1.
Figure 6-19
HR hierarchy
with a dotted line
relationship
Figure 6-20
M:M recursive
relationship
Solution
M:M recursive relationships can be recorded in a multi-valued hierarchy map (MV,
HM) simply by storing additional rows for the multiple parent relationships but
will require additional attributes (Role Type and FTE) to describe the meaning and
value of each parent-descendent relationship correctly. Figure 6-21 shows the
multi-valued hierarchy map REPORTING STRUCTURE [CV, MV, HM] popu-
lated with all the employee relationships documented on the Figure 6-19 organiza-
tion chart. The first notable thing about this hierarchy map is the number of Bond
Dimensional Design Patterns for People and Organizations, Products and Services 191
records. Hierarchy maps always contain more records for the lowest levels because A multi-valued
they are repeated for all the parent levels above them. But in the case of Bond the hierarchy map
number is inflated by his dual roles. The easiest way to understand why this occurs (MV, HM) is used to
is to imagine that there are two Bonds, one directly under each of his managers. represent a multi-
This gives you an idea of how the hierarchy map for a large organization with parent hierarchy
highly interconnected staff and a deep reporting structure can grow (especially if
you were to track its history).
Figure 6-21
HR hierarchy map
Multi-parent who hierarchies are by no means limited to HR. The earlier customer
relationship hierarchy would also be multi-parent if it had to support fractional
company ownership or joint ventures. Family trees are multi-parent hierarchies!
192 Chapter 6 Who and What
Figure 6-22
Rolling up
weighting factors
Distant relationship In the recursive source data this change would create one new association record in
weighting factors the EMPLOYEE_EMPLOYEE table (Figure 6-20) for Bond-Moneypenny, and
are calculated by update the existing M-Moneypenny record to 50% FTE. In the unraveled hierarchy
multiplying the map more work is required, because records for each new distant relationship need
weighting factors to be inserted and all the existing distant relationships need to be updated with the
of the intermediate appropriate weighting factors. Figure 6-22 shows how the new direct Bond-
direct relationships Moneypenny permanent relationship also creates an distant temporary relationship
between Smiley and Moneypenny, with a weighting factor of 10%. This is calcu-
lated by multiplying all the weighting factors of the direct relationships between
the two distant employees: 0.2 × 0.5 = 0.1 or 10%.
Figure 6-23
Updating a
hierarchy map
Figure 6-24 shows an HV version of the REPORTING STRUCTURE hierarchy Effective date, end
map that includes the effective dating attributes EFFECTIVE DATE, END DATE date and a current
and CURRENT typically found in HV dimensions. These attributes allow BI Users flag should be added
to browse both the current hierarchy (Where CURRENT =‘Y’) and any point in to HV, MV, HM tables
time hierarchy (e.g., where ‘31/3/2011’ Between EFFECTIVE_DATE and
END_DATE). To understand how this hierarchy map records changes to an
employee, a manager and their relationship, take a look at the timelines in Figure
6-25 for Bond, M and Bond’s HR relationships during the first six months of 2011.
Consequences
With its effective dated relationships, the REPORTING STRUCTURE [HV] map
can be used to rollup employee facts using historically correct hierarchy descrip-
tions, but its join to fact tables is more complex than before. When the hierarchy
map did not contain historical relationships the join to SALES FACT was simply:
Figure 6-24
HV hierarchy map
with effective dating
Effective dating must Now the multiple historically correct versions of the hierarchical relationships
be used to correctly must be joined to the correct historical facts to avoid over-counting. This requires
join the hierarchy a complex (or theta) join involving the hierarchy map effective dates and the
map to fact tables primary time dimension of the fact table:
Where Reporting_Structure.Employee_Key=Sales_Fact.Employee_Key
and
Sales_Date is Between Reporting_Structure.Effective_Date and
Reporting_Structure.End_Date
To avoid this, the This is likely to be a very expensive join. To get round this problem for HR fact
hierarchy map can tables that must be constantly joined to the hierarchy map, a surrogate key must be
be given its own added to the hierarchy map such as HR HM KEY in Figure 6-24. This surrogate
surrogate key which key works like any other dimensional surrogate key to avoid effective dated joins. It
is then added to HR can be added to any specialist HR fact tables to simplify the join to:
fact tables
Where Reporting_Structure.HR_HM_Key = Salary_Fact.HR_HM_Key
This creates a new Implementing this surrogate key would require the REPORTING STRUCTURE
dependency: the table to be built and updated ahead of the SALARY FACT table, like any normal
hierarchy map must HR dimension, because the HR HM KEY values must be ready before the fact table
be built and updated load begins. The simpler CV version of REPORTING STRUCTURE without its
before any HR fact specialist surrogate key does not require this dependency and can be maintained
tables independently of the facts.
Alternatively, large An alternative approach, to avoid effective dating joins or a hierarchy map surro-
HR hierarchies could gate key, is to break the single organization hierarchy into a number of far smaller
be split into a departmental hierarchies by removing the executive level(s) from the hierarchy
number of smaller map; in this case: Eve Tasks. The smaller hierarchies would be less susceptible to
hierarchies that can macro change from above and its resulting ripple effect, which would enable the
be tracked using employee surrogate key to be used to track all HV changes to employees and their
surrogate keys HR relationships.
Dimensional Design Patterns for People and Organizations, Products and Services 195
Figure 6-25
HR timelines
Product hierarchies are important for BI reporting because businesses are often Product hierarchies
closely organized around them. Yet despite their importance, they may not be well are typically the most
designed from a BI perspective. Thankfully, product hierarchies are fixed-depth important hierarchies
rather than variable-depth, but they can still be difficult to define. Established but they can often be
product hierarchies often represent the single biggest conformance issue for agile ragged and difficult to
data warehouse design, because they have become ragged and full of conflicting conform
definitions, through years of ad hoc growth and redefinition by many different
departments.
Another challenge unique to what dimensions is the need to ask BI questions about Product “bill-of-
what is going on inside a product or service. This is rare for who dimensions— materials” is another
unless you are dealing with medical data. For products, “what is going inside” may type of variable-depth
be other products and services, in the case of product bundle sales, or components hierarchy that the
and parts, in the case of design and manufacturing. To answer questions about data warehouse may
these, a data warehouse design must handle “bill-of-materials” information—and need to support
that is another example of a variable-depth hierarchy.
196 Chapter 6 Who and What
Ragged hierarchies, as described in Chapter 3, can look similar to variable-depth Ragged hierarchies
hierarchies but the important distinction that makes them easier to deal with is have a fixed number
that they have a known maximum number of named levels. This means that they of uniquely nameable
can be implemented in a dimension by simply defining the missing or unused levels. They can be
levels as nullable. Figure 6-26 shows an example of a product dimension contain- implemented in a
ing a ragged hierarchy (that matches the hierarchy chart of Figure 3-8). It contains dimension by defining
the product “POMServer”, which does not have a subcategory, perhaps because it non-mandatory
is the only product of its type. This simple “flattening” of the hierarchy, into a fixed attributes for the
number of columns within the dimension, is in stark contrast to the complexity of hierarchy levels that
building a separate hierarchy map, but it can result in a “Swiss cheese” dimension, have missing values
full of “NULL holes” that show up as gaps on reports. Even if these holes are filled
in with the stakeholders’ preferred label, such as “Not Applicable”, they can cause
problems for drill-down analysis: all the “Not Applicable” values are grouped
together and cannot be further drilled on. Also, it does not inspire stakeholder
confidence in the data warehouse when BI applications use such a common level as
a product subcategory and return “Mobile,” “Desktop,” and “Not Applicable” .
Figure 6-26
Balancing a
ragged
hierarchy
In most cases, the best approach is to balance a ragged product hierarchy by filling Balance slightly
in the missing values with the stakeholders during a modelstorming workshop as ragged hierarchies
part of conforming these dimensional attributes. Where filling in the gaps with the with the help of
stakeholders is not possible—stakeholders cannot agree on the appropriate new or stakeholders: by
existing values or there are just too many missing values to tackle in the available asking them to fill in
time—there are three methods for automatically generating usable interim values the missing values
for the missing levels (that will induce the stakeholders to create their own):
198 Chapter 6 Who and What
Temporary balancing Top down balancing: A value is copied down into the missing level from the level
can be achieved by directly above it. For example, the POMServer CATEGORY value “Computing” is
filling in a missing copied into SUBCATEGORY.
level with values
from the level Bottom up balancing: A value is copied up into the missing level from the level
directly above or directly below it. For example the POMServer PRODUCT TYPE value “Server” is
below copied into SUBCATEGORY.
Top down and bottom up balancing: Gaps are filled with new unique values
created by concatenating the values directly above and below the missing level. For
example, “Computing/Server” might be used to fill the SUBCATEGORY gap for
“POMServer”.
When a ragged hierarchy is discovered during data profiling, if only a very small
percentage (1-2%) of it is ragged (skips a level) this usually indicates errors in the
data rather than intentional design. The errors should be corrected and a simple
balanced hierarchy defined.
Solution
It is possible to attach a product description to the majority of page visits recorded
on the Pomegranate’s websites, especially the online store pages. But not every
page refers to products; some pages describe multiple products: whole product
categories, subcategories, or specific brands. You can easily handle non-product
pages by using the “special” zero surrogate key that represents “missing product”,
as discussed in Chapter 5. In a similar way, you can use other “special” surrogate
key values to help you describe the page visits that relate to the higher levels in the
product hierarchy by designing a multi-level dimension.
Dimensional Design Patterns for People and Organizations, Products and Services 199
Multi-level dimensions contain additional rows that represent all the multiple A multi-level
levels within their hierarchies that are needed to describe mixed-level facts. For dimension (ML)
example, a multi-level product dimension contains records for each product and contains additional
additional records for each brand, subcategory, and category if facts need all these rows that represent
levels. Figure 6-27 shows a multi-level PRODUCT dimension, denoted by the code level values within its
ML, that contains example additional rows that represent entire categories (SKs -1 hierarchy
and -2), a subcategory (SK -3), and a brand (SK -4). The complete table would
contain one additional row for every value at every level needed.
Figure 6-27
Multi-level
Product
dimension
Multi-level dimensions also contain an additional attribute LEVEL TYPE that Multi-level
documents the meaning of each row in the dimension. The majority of rows will be dimensions contain
normal members that represent individual products (or employees in the case of a a LEVEL TYPE
multi-level employee dimension). Their LEVEL TYPE defaults to the name of the attribute which
dimension itself, whereas the additional rows will be labeled after the level attribute documents the
in the hierarchy they represent. LEVEL TYPE is useful for ETL processes that meaning of each
manage the use of these additional records, and can also be used by queries that row
want to constrain on specific level facts only. LEVEL TYPE can be ignored by most
queries that simply want to roll up all the facts to a particular level. For example, a
query using PRODUCT [ML] could group by CATEGORY and count web page
visits to get the total pages viewed for each category; the figures would automati-
cally include pages for individual products, brands, and subcategories within each
category, as well as pages for the categories themselves.
Do not use a multi- The additional flexibility of multi-level dimensions can be confusing, so they
level dimension to should never be used where their flexibility is unnecessary. Create separate single
describe fixed-level and multi-level versions of a dimension to make their usage explicit. If a star
facts schema has a fixed level of dimensional detail, use normal (single-level) dimen-
sions with no LEVEL TYPE attributes. The presence of a LEVEL TYPE in the star
implies that facts are multi-level when that is not the case. If a fact table truly needs
a multi-level dimension you should explicitly document it by marking the dimen-
sional foreign key as ML in the fact table, as in Figure 6-28.
Figure 6-28
Documenting single
and multi-level fact
tables
Consequences
You should never use a multi-level dimension to change the meaning of a fact. For
example, do not store target revenue at the brand level and actual sales revenue at
the product level in the same fact table. Sales and planning are two very different
business processes, two different verbs. How would you name and easily describe
Never use a multi- the resulting fact table? Even sticking to a single business process, do not store
level dimension to summary sales for a category in a product sales fact table. Performance enhancing
create facts with summaries require their own aggregate fact tables (described in Chapter 8). If you
mixed meanings used a multi-level dimension to store targets, summaries and actuals, the resulting
revenue fact would not be additive across LEVEL TYPE. To avoid over-counting,
every query would have to remember to constrain to a single LEVEL TYPE—a
recipe for disaster. The multi-level product dimension works perfectly with the
page visit fact table, in Figure 6-28, because it does not change the meaning of the
facts; they are all page visits no matter if they are for a product or a category. That
is why dwell time and total pages viewed remain additive across LEVEL TYPE.
Dimensional Design Patterns for People and Organizations, Products and Services 201
Figure 6-29
Bill of materials
for a POMCar
Solution
A BOM can be represented by the PARTS EXPLOSION hierarchy map in Figure 6-
30. This is a reverse hierarchy map which joins to product facts (and the product
dimension) by its parent key (PRODUCT KEY) as in Figure 6-31, allowing the
facts to be rolled down to or filtered on child components. It contains a SUB A reverse hierarchy
ASSEMBLY flag that indicates “Y” if a component is made up of other identifiable map joins to fact
components and QUANTITY, which records the number of components that go tables by its parent
into the finished product. This is similar to a distant weighting factor in that it key and allows facts
needs to be adjusted in the hierarchy map based on its parent quantities. For to be allocated to
example, a single defense system contains 4 motion sensors, but a POMCar con- child levels
tains 2 defense systems, so the quantity of motion sensors it contains is 8.
Figure 6-30
PARTS
EXPLOSION
hierarchy map
Figure 6-31
Component
Analysis
Consequences
Don’t try to use the PARTS EXPLOSION hierarchy map pattern to describe the bill
of materials for anything as complex as a real car, submarine or aircraft—unless
you are prepared for a very large table.
Summary
Mini-dimensions track historic values for very large volatile dimensions, like CUSTOMER, that
cannot use the Type 2 SCD technique. Volatile HV attributes are moved to a separate mini-
dimension to keep the size of the main dimension under control and the historical values are
related back to the main dimension via fact table relationships. Mini-dimensions typically band
high cardinality values to control their size and volatility and to provide better report row
headers and filters.
Snowflaking makes sense for very large dimensions when a large set of lower-cardinality,
seldom used attributes can be normalized into outriggers. The calendar dimension can be a
particularly useful outrigger for any dimension that contains embedded dates.
Swappable dimensions (SD) are used to break up large mixed type dimensions into specialist
subsets that are easier to use and faster to query. Swappable dimensions can be swapped into a
star schema in place of one another because they share a common surrogate key.
Hybrid SCD requirements for current value and historical value reporting are best handled by
creating separate hot swappable CV versions of HV dimensions. These CV dimensions can be
created as material views using simple self-joins of HV dimensions.
Hierarchy maps (HM) are used to store variable-depth hierarchies in a report-friendly format
and avoid recursive structures that cannot easily be queried by BI tools.
Multi-valued hierarchy maps (MV HM) are used to represent multi-parent hierarchies that are
typically stored in source systems as M:M recursive relationships.
Multi-level dimensions (ML) describe business events that vary in their level of dimensional
detail. A multi-level dimension will contain additional special value members that represent
higher levels in the dimension’s hierarchy.
7
location-specific
Every business event happens at a point in time or represents an interval of time. Time is the most
Time is the primary way that BI queries group (“show me monthly totals”), filter frequently used
(“show me sales for Financial Q1”), and compare business events (“How are we dimension for BI
doing year to date, versus last year?”). That is why every fact table has at least one analysis
time (when) dimension.
Most business events occur at a specific geographical or online location. Many Location dimensions
interesting events represent changes of location. Hence, a large number of fact and attributes are
tables have distinct where dimensions in addition to the location attributes that can frequently used too
be found in who and what dimensions, such as customer and product.
Although when and where are separate dimensions, they can influence one an- Time and location
other: Time zones, holidays and seasons, are all examples of location-specific time are separate
attributes that are affected by event geography. Similarly, analytically significant dimensions but can
locations such as the first and last locations in a sequence of events are timing- affect one another
specific location dimensions, affected by event chronology.
In this chapter, we describe dimensional design patterns for efficiently handling This chapter
time and location, in particular, patterns for correctly analyzing year-to-date facts, describes when and
and journeys—facts that represent changes in space and time, that are all about where patterns
where and when.
Efficient date and time reporting Chapter 7 Design
Correct year-to-date analysis Challenges
Time zones, international holidays and seasons At a Glance
National language support
Trip and journey analysis
203
204 Chapter 7 When and Where
Time Dimensions
When details Every event contains at least one when detail, which should always be modeled as a
are modeled as time dimension, rather than left as a timestamp in the fact table. But why do you
physical time need a time dimension when you have datetime data types, date functions and date
dimensions arithmetic built into data warehouse databases and BI tools?
Physical date Descriptive time attributes, such as day of week, month, quarter and year, are
dimensions help constantly used to group and filter the information on BI reports. Deriving them
to simplify the most from raw timestamps in every query is woefully inefficient and puts an unnecessary
common grouping burden on BI users and BI tools, that cause mistakes and inconsistencies. Why
and filtering decode the month or day of week every time they are needed, when they could be
requirements of BI stored once in a dimension and reused consistently and efficiently, like any other
queries dimensional attribute? Also, many commonly used time attributes—such as fiscal
periods, holidays, and seasons of the year—simply cannot be derived from time-
stamps alone because they are organization or location specific.
Date and time of Time is actually best modeled dimensionally by splitting it into date and time of
day are modeled day. This may seem odd at first, but it does reflect how time is queried. Almost
as separate every query will group or filter on sets of days (Years, Quarters, Months or Weeks).
dimensions to Many queries will do the same with periods within a day (AM/PM, work shift, peak
match their periods). But very few queries will use arbitrary periods that span dates and times
dimensional usage, (e.g., “sales totals by periods of 2 days and 8 hours”). Financial queries are grouped
manage their size by the date-related fiscal calendar, ignoring time of day all together, while opera-
and make the time tional and behavior queries can group months of data together by time of day to
granularity of facts see peak and average activity levels. In recognition of these query schisms, when
explicit details (logical time dimensions) should be implemented as two distinct and
manageable physical dimensions: a calendar date dimension, and a clock time of
day dimension, each with its own surrogate key. Separating date and time, like this,
also makes the time granularity of facts explicit. If time of day is not significant (or
not recorded) for a business event, its fact table design simply omits the clock
dimension, and includes only the calendar dimension.
Calendar and clock Figure 7-1 shows typical examples of Calendar and Clock dimensions related to an
are role-playing order fact table. Each of these dimensions play two roles: representing distinct
dimensions Order and Delivery dates and times. Although only one physical instance of each
Dimensional Design Patterns for Time and Location 205
dimension will exist, BI tools should present the two roles as separate dimensions
using views that rename each time attribute based on its role. For example,
ORDER TIME, ORDER DATE, ORDER MONTH and DELIVERY TIME,
DELIVERY DATE and DELIVERY MONTH.
Figure 7-1
CALENDAR and
CLOCK dimensions
used to play two
roles with ORDER
FACT
The ORDERS FACT table in Figure 7-1 documents Delivery Time (duration) as a
fact. This is the elapsed time in hours taken to fulfill the order. This duration
would be difficult to calculate and aggregate using the separate time dimensions
alone, and is best stored as a fact.
Calendar Dimensions
Calendar dimensions should support all the groupings of day, week, month, A good calendar
quarter, year, fiscal period, and season that are needed as report row headings and dimension should
query filters; for example, CALENDAR in Figure 7-1 contains the commonly used include all the date-
calendar attributes: DAY (Sunday–Saturday), DAY IN WEEK (1–7), MONTH related attributes
(January–December), MONTH IN YEAR (1–12), and YEAR. It also contains that stakeholders
several “Overall” attributes such as DAY OVERALL and MONTH OVERALL. need. Ideally BI
These are epoch counters that increment for each new Day, Week or Month from tools should never
the earliest date in the data warehouse (the epoch date). Overall values are used for have to decode a
calculating interval constraints that can cross year boundaries, such as: “last 60 date to provide a
days” or “last 4 weeks”. The BEAM✲ excerpt of Pomegranate’s CALENDAR, in good report row
Figure 7-2, shows that the company has 13 fiscal periods per year, and that its fiscal header or filter
year runs February to January—not January to December. The full dimension
would make all of Pomegranate’s calendar information available for reporting, so
that BI users do not have to decode any dates or remember which fiscal periods
contain 29 days rather than 28 or even the name of the current period.
206 Chapter 7 When and Where
Figure 7-2
Pomegranate
CALENDAR
dimension
(excerpt)
The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley,
2002) pages 38–41 provide further examples of useful calendar attributes.
Date Keys
Date keys are Calendar dimensions, like every other dimension, should be modeled with integer
integer surrogates surrogate keys. But unlike other surrogate keys, date keys should have a consistent
but they should be sequence that matches calendar date order. Sequential date keys have two enor-
in calendar date mous benefits:
order
Date key ranges can be used to define the physical partitioning of
large fact tables. Chapter 8 discusses the benefits of doing this.
Date keys can be used just like a datetime in a SQL BETWEEN join
to constrain facts to a date range.
Historic value hierarchy maps (HV, HM) that use effective dating to track history
can use sequential date keys (EFFECTIVE DATE KEY and END DATE KEY) rather
than datetimes to improve efficiency when joining to fact tables. For example,
Where Reporting_Structure.Employee_Key = Sales_Fact.Employee_Key
and
Sales_Date is Between Reporting_Structure.Effective_Date_key and
Reporting_Structure.End_Date_key
This will join the historically correct version of the reporting structure HR hierar-
chy to the salary facts. Joins like this are complex (or theta) joins that are hard to
optimize and need all the help they can get.
Dimensional Design Patterns for Time and Location 207
ETL benefits: Date keys can be derived from the source date values directly, ISO keys are easy
rather than with a surrogate key lookup table. This can be significant when to generate
processing events that contain many when details that all need to be converted
into fact table date keys.
DBMS benefit: The readable ISO format makes it easier to set up fact table ISO keys are easy
partition ranges. to read, which can
be good (for ETL)
BI benefits: None! BI queries should not use the YYYYMMDD format as a and bad (for BI)
quick way of filtering facts and avoiding joins to the Calendar dimension with
its consistent date descriptions. Best not to tell BI developers or users—keep
this little secret between the ETL team and the DBAs.
Create a version of your CALENDAR dimension that is keyed on a date rather than
a surrogate key (as a materialized view). This can be useful as an outrigger that
can be joined to date data type dimensional attributes such as FIRST PURCHASE
DATE in CUSTOMER or HIRE DATE in EMPLOYEE. This allows them to be
grouped and filtered using all the rich CALENDAR attributes for very little extra
ETL effort. A date-keyed CALENDAR can also be useful for prototyping BI queries
using sample data that has not yet been loaded into fact tables and converted to
DATE KEYs. But it should never be used in place of a surrogate key-based
CALENDAR for querying fact tables.
208 Chapter 7 When and Where
Spreadsheets, database functions, stored procedures, and ETL tools are all appro-
priate for populating the calendar—any of these can quickly generate the standard
calendar attributes from any origin date. Search online for “date dimension genera-
tor” to find SQL code and spreadsheets that you can reuse. Table 7-1 includes
additional, less automated, sources for enriching the calendar.
Period Calendars
A day granularity calendar is not the only calendar you will need. Periodic snap- Do not use the
shots and aggregate fact tables hold weekly, monthly, or quarterly facts and will standard day
require rollup (RU) calendar dimensions. Theoretically you could attach the CALENDAR
CALENDAR dimension to these higher granularity fact tables by using the last dimension with
date of the period that the facts represent; for example, a monthly sales snapshot higher level periodic
could join to the CALENDAR using the last day of the month—the date on which snapshots and
the snapshot was taken. However, this is not a good idea, because it does not aggregate fact
explicitly document the time granularity of the facts, and could lead BI users to tables
incorrectly believe they can analyze the monthly sales facts using any calendar
attribute, including day-level attributes like DAY OF WEEK or WORKDAY flag.
Month Dimensions
Instead of using a day calendar with monthly fact tables you should create a Monthly fact tables
MONTH rollup dimension similar to the one shown in Figure 7-3, and define should use a
monthly fact tables using MONTH KEY foreign keys. This makes the granularity MONTH dimension
of these fact tables explicit, and limits queries to using only the valid monthly to make their time
attributes. You can build the MONTH dimension from the CALENDAR dimen- granularity explicit
sion using a materialized view. MONTH KEY can be created using the DATE KEY
for the last day of each month: the MAX(DATE_KEY) if the materialized view
groups by month.
Even though CALENDAR and MONTH dimensions have different time granularities
they are still conformed dimensions because they use common attribute values:
they are conformed at the attribute level.
Figure 7-3
Period calendars
Some BI tools find it difficult to cope with separate day and month calendar tables
and prefer all common date dimension attributes to be defined using a single
table. If this is the case, having a MONTH KEY that matches the last day of the
month DATE KEY can be useful. In that way, BI tools that need to, can use the
CALENDAR dimension instead of MONTH at query time.
210 Chapter 7 When and Where
Offset Calendars
An offset calendar Events such as insurance claims or policy payments can benefit from having their
dates facts from a own specialized calendar dimension in addition to the standard calendar. A
fact-specific origin POLICY MONTH dimension like the one shown in Figure 7-3 would be used to
date; e.g. policy offset the facts from the creation date or last renewal date of the policy rather than
facts are dated from January 1 or the first day of the financial year as the normal calendar dimension
a policy start date would. For example, if a policy renews on April 1, an August claim fact for that
policy would be labeled as MONTH “August” or MONTH NUMBER 8 by
CALENDAR but POLICY MONTH 4 by the POLICY CALENDAR.
An offset calendar, like POLICY MONTH, can be use in conjunction with a stan-
dard MONTH dimension to define a MONTHLY POLICY SNAPSHOT with a granu-
larity of POLICY by MONTH by POLICY MONTH. This fact table will contain exactly
twice as many rows as a standard monthly snapshot will allow the facts to
be queried by either calendar or policy month or a combination of both.
Year-to-Date Comparisons
Problem/Requirement
To perform year-to-date (YTD) comparisons—such as YTD Sales 2011 versus
YTD Sales 2010—the following needs to be known about the date range:
What is the “year to The “from date” when the year began. This seems obvious but are we talking
date” date for valid about the beginning of the calendar year, or the organization’s fiscal year or
comparisons with the tax year?
previous years
The “to date.” Are you running the YTD calculation up to now or some
specific date in the past. If you are defaulting to “up to now”, what does “now”
mean? Do you have complete data loaded right up to today or yesterday?
Which days to include. Should YTD figures from previous years include facts
up to the same “to date” in those years, or the same number of days (this copes
with the extra day caused by the February 29 in leap years)? If it is based on
the number of days, is that calendar days or workdays (for example, the same
number of weekdays excluding public holidays)?
CALENDAR The CALENDAR dimension can support consistent year-to-date (YTD) calcula-
dimensions support tions by providing conformed definitions for the beginning of each year (calendar
YTD comparisons and fiscal) and which workdays to include. The attributes needed to do this are:
by providing
conformed DAY (NUMBER) IN YEAR
definitions of DAY (NUMBER) IN FISCAL YEAR
workday and WORKDAY IN YEAR
fiscal year WORKDAY IN FISCAL YEAR
WORKDAY FLAG
Dimensional Design Patterns for Time and Location 211
While these calendar attributes help tremendously, there is still the question of You need to know
what date the “year to date” should be. For data warehouses that are loaded nightly, when the YTD facts
common sense might suggest a “year to date” of yesterday (SYSDATE –1). How- were last loaded to
ever, not every business process runs on the same schedule, and therefore not every make valid
fact table is loaded nightly. Some fact tables may be loaded weekly, monthly, or on- comparisons with
demand when source data extracts becomes available—a common requirement for previous years
external data feeds. This causes problems when trying to compare YTD figures for
this year with YTD figures for last year. YTD figures for this year may not contain
data up to yesterday whereas the YTD figures for last year will contain data right
up to yesterday minus one year.
Even when fact tables are loaded nightly, they may not be loaded completely. ETL Because of ETL
errors will occur from time to time, and complete data will not be available for errors or “late-
reporting until these errors are fixed. It may also be quite normal for some ETL arriving data”, you
processes to encounter “late-arriving data” where the complete set of events for a also need to know
particular date will not be fully available until several days (or weeks) after that the last complete
date; for example, roaming call charges from international mobile networks, or load for YTD facts
medical insurance claims submitted long after treatments were given. Comparisons
between the current year and last year are inaccurate whenever data is complete for
last year and the current year is still a work in progress.
Solution
Information about the status of each fact table—when it was last loaded and the
last complete day’s worth of data it contains—should be stored in the data ware-
house rather than in the heads of ETL support staff or BI users. It should be avail-
able as data in a format that BI tools can readily use.
The FACT STATE table (shown in Figure 7-4) supports valid YTD comparisons by A FACT STATE
storing the recency and completeness of each fact table in a format that can easily table holds the most
be used with the CALENDAR dimension. It contains the most recent load date and recent load date and
the last complete load date of each fact table. The most recent load dates should be the last complete
updated automatically by all fact-loading ETL processes. For ETL processes that load date of each
are subject to unpredictable late-arriving data you may have to manually set the fact table
LAST COMPLETE LOAD DATE.
Figure 7-4
FACT STATE table
212 Chapter 7 When and Where
A FACT STATE To use FACT STATE information, you add FACT STATE to your fact table query
table contains all and filter it on the fact table name you are using. You can then use any of its
the necessary YTD attributes in place of a SYSDATE-based calculation. Unfortunately, because the
information but it FACT STATE table is not “properly” joined to any other table in the query, many
can be difficult to BI tools complain about a possible Cartesian product. Even if your BI tool doesn’t
use for BI queries complain, using FACT STATE in this manner can be confusing for both BI users
and developers, not to mention dangerous—if it is not properly constrained to the
correct fact table. To overcome this issue, you can provide the FACT STATE
information as part of a fact-specific calendar dimension.
Figure 7-5
SALE DATE:
a fact-specific
calendar
with added
FACT STATE
information
A fact-specific At first sight, it seems wrong or at the very the least wasteful, to repeat the same
calendar makes FACT STATE information on every row in the new calendar, but remember this
ETL load dates as calendar is still tiny by fact table standards and now it is simple to compare its
easy to use as attributes to their equivalent FACT STATE attributes. Because the fact-specific
SYSDATE calendar will always be present in every meaningful query involving its specific fact
table, the MOST RECENT and LAST COMPLETE attributes can be used just as
easily as the DBMS system variable SYSDATE, without having to worry about
constraining FACT STATE on the right fact table or a BI tool (or developer)
complaining about a missing join. For example, to compare 2011 (the current year)
YTD sales with 2010, based on the most recent load date, a query would contain
the following simple SQL:
Dimensional Design Patterns for Time and Location 213
To select the last three complete weeks of facts, the constraint would be:
WHERE Week_Overall BETWEEN (last_complete_week_overall - 3
AND last_complete_week_overall)
You should create a fact-specific calendar for each fact table that is used for YTD Create a fact-
comparisons, ideally as (materialized) views so that they will be updated automati- specific calendar
cally whenever the FACT STATE table is updated. If a fact table has a single time view for each fact
dimension, its fact-specific calendar can be given a unique role-specific name, such table used for YTD
as SALE DATE (shown in Figure 7-5). If a fact table has multiple date dimensions, analysis
each one must use the same (more generically named) fact-specific calendar as its
role-playing (RP) time dimension. It is possible for all fact-specific calendars to
share the same conformed dimension name if each one is defined within a separate
fact-specific database schema (that also contains its matching fact table). The
naming approach you can adopt will depend on how your BI toolset qualifies
tables when accessing multiple star schemas simultaneously.
To help keep the SQL that builds fact-specific calendars simple, the YTD compari- FACT STATE
son attributes within CALENDAR should be mirrored in FACT STATE; for attributes should
example, if there is a QUARTER IN FISCAL YEAR attribute in CALENDAR there mirror calendar
should be a MOST RECENT LOAD QUARTER IN FISCAL YEAR and a LAST attributes to keep
COMPLETE QUARTER IN FISCAL YEAR in FACT STATE. view building simple
You can expand fact-specific calendars to hold additional Y/N indicator flags—
such as MOST RECENT DAY, MOST RECENT MONTH, PRIOR DAY, and PRIOR
MONTH—that are based on the MOST RECENT LOAD DATE. Some BI tools may
also find it useful to have a MOST RECENT DAY LAG column that numbers every
date in the calendar relative to the MOST RECENT LOAD DATE; i.e., the most
recent date is 0, the previous day is –1, the following day is +1.
The bold values above are derived from SYSDATE, MOST RECENT LOAD
DATE, LAST COMPLETE WEEK IN YEAR, and MOST RECENT LOAD WEEK
IN YEAR, respectively. FACT STATE tables can be expanded to hold additional
audit and data quality information, such as whether the latest facts have been
signed off or not. This information, too, is handy stuff to print in a report footer.
214 Chapter 7 When and Where
Clock Dimensions
A clock dimension A clock dimension contains useful time of day descriptions, such as Hour of Day,
contains time of day Work Shift, Day Period (Morning, Afternoon, Evening, Night), Peak and Off-Peak
descriptions, periods. Its granularity is typically one row per minute, half hour, or hour of the
typically at the day—whatever level of detail is needed to provide the row headers and filters that
minute granularity BI users need. Figure 7-6 shows a typical CLOCK dimension with minute granu-
larity. It contains 1,440 rows—one for each minute in a day—plus a zero Time key
row for unknown or not applicable time of day. You should avoid defining clock
Time down to the dimensions with a granularity of one row per second unless there really are useful
second is best rollups of less than a minute. For most business processes, time at the precision of
treated as a fact a second or less is not useful as a dimension (as a report row header or filter), but it
may be useful as a fact for calculating exact durations. Storing precise timestamps
as facts allows the time dimensions to remain small and concentrate on being good
dimensions—sources of good descriptions for report row headers and filters.
Figure 7-6
CLOCK dimension
CLOCK in Figure 7-6 is an HV dimension because work shifts and peak time can
change but their historical names and times must still be used to describe historic
facts. A standard CALENDAR is an FV dimension because date descriptions are
fixed and do not change. A fact-specific calendar is a CV dimension because it
must contain the current ETL status dates for its specific fact table.
Dimensional Design Patterns for Time and Location 215
Solution
Thankfully not! Date and time don’t have to be combined to solve this problem.
Time of day descriptions, like work shift or peak/off peak, are seldom dependent
on the actual date (March 27 or March 28) but on the day type (weekday, weekend,
holiday, or unusual day.) You can handle this level of variation in the CLOCK
dimension by using the TIME KEY to represent versions of a minute. Figure 7-7 A DAY CLOCK
shows a DAY CLOCK dimension, with a granularity of one record per minute, per contains a version
day of the week, per day type. It holds 14 versions of each minute—one for each of each minute for
day of the week, plus an additional version for each day of the week when it falls on each day type; e.g.
a holiday. This results in 20,160 rows in total. If CLOCK attributes vary only by weekday, weekend
weekday, weekend, and holiday then you would just need three versions of each
minute, cutting the table down to 4,320 rows.
Figure 7-7
DAY CLOCK
dimension with
weekend and
holiday variations
Resist any temptation to combine CALENDAR and CLOCK dimensions into one.
The resultant dimension would be unnecessarily large and difficult to maintain,
having 525,600 records (365×1440) for each year at the granularity of minute.
Don’t even think about it down to the second.
216 Chapter 7 When and Where
Time of day If work shift start times, or any other CLOCK attributes, change on a specific date
attributes that vary rather than “on Saturdays”, infrequent change can be handled by defining CLOCK
based on actual as an HV dimension with a Type 2 SCD TIME KEY. If date-specific change is
dates can be occurring on a more regular basis it may be seasonal; e.g., summer descriptions
handled by a and winter descriptions. Check that values don’t cycle back before you treat them
seasonal or HV as normal HV changes that would grow the dimension year on year. You may just
CLOCK dimension need a few seasonal versions of a minute as well as day versions.
Clock dimensions Day type, which can be looked up from the CALENDAR dimension.
that contain special
versions of minutes Location type, which can come from an explicit where dimension such as
require more STORE, or the implicit where details embedded within dimensions such as
complex TIME KEY CUSTOMER, EMPLOYEE, or SERVICE
ETL lookups
The current version of the minute, where CLOCK.CURRENT=‘Y’ unless the
ETL processes are loading late-arriving facts and older versions of the time de-
scriptions would be valid.
Time Keys
TIME KEYS are TIME KEY in Figure 7-7 is a normal surrogate key with no implicit time meaning.
normal surrogate Unlike DATE KEY it is not derived from time and is not in time sequence (though
keys that are not the first 1440 are). By keeping time keys “meaningless” you can start with a simple
based on time clock dimension and expand it (by creating new rows) to cope with attribute
sequence. This variations as they arise. For example:
allows them to cope
with change and Time of day attributes that vary by location. For example, certain branch
variation when it types may have longer operating hours than others or different TV channels
arrives may have different advertising slot names and lengths.
Time of day descriptions may simply change. The standard attributes of time
such as hour and minute cannot change (unless everyone gets new decimal
watches) and are defined as fixed value (FV) attributes. But an organization
may decide to change the start time of its peak service. You can define the
Peak/Off Peak attribute as HV to preserve the peak/off peak status of historical
descriptions. The TIME KEY can act like any other HV surrogate key and allow
an ETL process to create new versions of the minutes that are moving from
peak to off peak and vice versa.
Dimensional Design Patterns for Time and Location 217
International Time
Problem/Requirement
To analyze global business events, a data warehouse needs to handle international
time correctly. For customer (or employee) behavioral analysis, local time of day,
weekday status, holiday status, and season are important. While an organization- International events
wide standard time perspective—irrespective of event location—is equally impor- must be analyzed
tant, for measuring simultaneous operational activity and accounting for financial by local and
transactions in the correct fiscal period. standard time
Regardless of how events are originally recorded—using local time or the standard Converting between
time of a central application server set to Greenwich Mean Time—converting time zones is not
between the two requires an understanding of event geography, time zones, and trivial
“daylight saving” that is beyond individual queries. Just how many time zones are
there? It’s not 24!
Solution
If standard organization and local customer time are important, the data ware-
house should provide both as readily available dimension roles to avoid inconsis-
tent and inefficient time zone calculations within reports. For consistency, a shared
ETL process should perform all time zone conversions, and the results should be Overload the facts
used to overload international facts with additional time dimension keys. Figure 7- with additional time
8 shows how local time is modeled in a star schema—by overloading a global sales dimensions to
fact table with extra date and time of day keys (LOCAL DATE KEY and LOCAL provide dual time
TIME KEY) so that the CALENDAR and CLOCK dimensions can play the dual perspectives
roles of Standard Sale Time and Local Sale Time.
Figure 7-8
Sales fact table
overloaded with
local and standard
time dimensions
Consequences
All dimensional overloading patterns require additional ETL processing and make
fact tables larger but the trade off is faster, simpler, more consistent BI queries.
218 Chapter 7 When and Where
A national calendar However, if the data warehouse is expected to cover more than a few countries, you
table holds will need a more robust solution. NATIONAL CALENDAR in Figure 7-8 attempts
geopolitical time to solve the geopolitical attribute problem by using a composite key of date and
attributes keyed on country to record holiday information for each date and country combination as
a combination of separate rows. Unfortunately, this design demands that BI users and developers
date key and remember to constrain NATIONAL CALENDAR to a single country when query-
country which can ing the facts, otherwise their answers will be overstated by the number of countries
lead to over- they “let into” the query. For example, if NATIONAL CALENDAR holds holiday
counting information for ten countries and a busy sales manager forgets to correctly con-
strain the calendar, an ad-hoc analysis of holiday sales revenue will be overstated
ten times. The figures would be wrong even if the query filters sales to just one
branch for one holiday, because even a single sales transaction on that date will be
joined to, and over-counted by, the multiple countries that observe that holiday.
Commissioned sales staff may be happy by this oversight—few other BI stake-
holders will be so enthusiastic.
Country-specific A safer solution for ad-hoc queries is to provide country-specific calendar views
calendar views are that pre-join CALENDAR to NATIONAL CALENDAR constrained to a single
safer to use but they country. BI users can then choose (or be defaulted to) the most appropriate calen-
limit analysis to one dar view. Unfortunately, this solution limits analysis to one country at a time, and
country at a time. even then, BI users must still take care to constrain the geography of their queries
They are not a good to precisely match their chosen calendar, otherwise the geopolitical time attributes
match for they use will not actually match the facts. Country-specific calendar dimensions are
international facts an international data warehousing anti-pattern: they do not match international
fact tables.
Dimensional Design Patterns for Time and Location 219
Solution
To overcome the “one country at a time” query limitation and prevent calendar
and fact mismatch you need a different calendar design that truly matches multi-
national fact tables. MULTINATIONAL CALENDAR in Figure 7-9 looks re-
markably like a standard calendar dimension, but it handles date descriptions that
vary geographically by storing multiple versions of the dates that have varying
descriptions, each with a unique DATE KEY; for example, Figure 7-9 shows the
three versions of March 17, 2010 needed to support the different combinations of
SEASON and HOLIDAY in the UK, U.S., South Africa, and Ireland on that date.
Figure 7-9
Multinational
calendar dimension
showing 3 versions
of March 17th 2011
But how do these multiple versions of a date behave in fact queries? The answer is A multinational
“just like a single version of the date” when you ignore multinational attributes. calendar uses a
For example, all sales for March 17, 2011 will roll up to a single line on a report if date key that
they are grouped solely on CALENDAR_DATE. Only if sales are grouped by represents a
SEASON or HOLIDAY (attributes that vary internationally) will the report contain geopolitical version
any additional lines, which is exactly what you want. In this way, the multinational of a date to match
calendar is similar to an HV employee dimension that uses surrogate key values to multinational facts
represent historical versions of an employee, except here the surrogate keys repre-
sents geopolitical versions of a date.
The benefit of the multinational calendar is that it keeps both the model and With a multinational
queries simple while handling the complexity of the geopolitical attributes. BI users calendar, simple
are totally unaware of the multiple versions of a date, they do not have to think queries can safely
about which national calendar to use, their queries can cross national boundaries, cross national
and they can use whatever calendar attributes interest them. boundaries
Consequences
BI user interfaces that provide date lists driven from a multinational calendar must
do a Select Distinct. But that should be the default for all value lists anyway!
ETL fact loading processes must know how to assign the correct DATE KEYs
based on the when and where details of business events. You also need to think
carefully about how ETL processes create DATE KEYs for multiple versions of a
day, in the first place.
220 Chapter 7 When and Where
Building version numbers into your date keys is a good idea even if your data
warehouse or data mart will never go international. You never know when an extra
version of a date will come in handy.
You can create a The number of date versions needed depends on your multinational business
date version for requirements. You can create a date version for every country (200+). This might
every country or just be appropriate if there are many geopolitical attributes and the combination of
one for each possible values is greater than the number of countries. Alternatively, if the only
variation on a date attribute that varies by location is HOLIDAY (Y or N), then you need only two
versions of a day: one for HOLIDAY = ‘Y’ and one for ‘N’. Only one version would
be needed for any date that is globally a holiday or non-holiday. A financial organi-
zation might use a calendar with six versions of each day, one for each of its global
markets.
Needing a date version for each country is unlikely, because many will share
common geopolitical attribute values. Create a single “00” standard version for
each day, and then add versions as needed when you encounter regional or
international variations.
Consequences
Because CALENDAR is the most commonly occurring role-playing dimension, it
is important to keep DATE KEYs small when modeling for multinational versions.
If you really need more than ten versions of a date and you have chosen
YYYYMMDD format date keys, adding a two digit version number will require an
8-byte integer. If you can live with ten versions or less—or use an epoch-based date
key—a 4-byte integer will suffice. Smaller date keys are always a good thing—
especially for larger fact tables!
Dimensional Design Patterns for Time and Location 221
International Travel
To enable BI carbon footprint analysis, Pomegranate stakeholders have modeled Events with pairs of
the national and international flights taken by their global sales and consulting when and where
force. The resulting EMPLOYEE FLIGHTS event table, Figure 7-10 contains 6 details are typically
event stories—6 flights taken by employee Bond during July 2011. These are typical movements with
movement stories containing pairs of when and where details that give rise to interesting distance,
interesting when and where related measures, such as distance, duration and speed, duration and speed
in addition to other explicit facts such as their associated costs; e.g., CO2 emissions. measures
Figure 7-10
Flight events for
employee James
Bond
Figure 7-11 shows the flight events modeled as a star schema using the
CALENDAR, CLOCK, and AIRPORT dimensions to play multiple roles of depar-
ture and arrival times and locations. This design can easily be used to answer many
of the stakeholders’ questions:
The default from ARRIVAL AIRPORT will tell you where airlines are flying employees to—but
and to details may that’s not quite the same thing. Figure 7-10 shows that Bond took three flights on
not answer the most July 18th, each with the REASON of attending a conference. He did not, of course,
important where attend three conferences in one day, nor did he actually have to go to Amsterdam
questions or Minneapolis. He simply chose that route from London to the one conference he
needed to attend in Phoenix. Apparently that routing had lower CO2 emissions per
passenger than a direct alternative because it used a larger, newer aircraft.
Figure 7-11
Flight star schema
The most interesting Bond’s first multi-flight journey can be worked out manually by browsing all his
where details are flights and spotting the short gaps between the connecting flights and the longer
typically the first and gap that precedes his flight on July 21—which represents a different journey from
last points in a Phoenix to New York for a consulting engagement. But getting a journey-level
journey perspective on all of the flights in a large fact table via BI queries is difficult, be-
cause it involves comparing pairs of flights in the correct order on a per-employee
basis. DW/BI designers don’t want to hear that a query is difficult.
If stakeholders use the prepositions “from” and “to” to connect where details to
the main clause of an event, it is an obvious clue that the event represents move-
ment. Ask stakeholders for related stories such as those in Figure 7-10 to dis-
cover if individual movement events are part of a sequence that describes a
greater journey from an origin to a final destination.
Dimensional Design Patterns for Time and Location 223
Solution
The FLIGHT FACT table, shown in Figure 7-12, has been modified to contain two
extra airport foreign keys, representing the journey origin and journey destination
locations not found in the original EMPLOYEE FLIGHT event details. With these
additional AIRPORT roles, it suddenly becomes trivial to answer questions about
where frequent flyer employees are located (Journey Origin) and where they really
have to go to (Journey Destination). These incredibly useful first and last locations Overload every fact
are hidden amongst all the flight information but can be found by applying a time- with the first and last
based business rule: “all flights taken by the same employee, no more than four hours locations within a
apart, are legs of the same journey”. This test would be difficult for BI tools using meaningful
non-procedural SQL but relatively simple for ETL processes with access to full sequence
procedural logic.
Figure 7-12
Flight fact table
with dimensional
overloading
Often, the location of first and last events represent something even more interest- First and last events
ing than additional where dimensions; they represent why and how; for example, often contain why
the first web log entry for a visitor arriving at a website contains the URL previ- and how details that
ously clicked on—this is usually a search engine or banner ad. In which case it describe the cause
represents why the visit took place and contains referral information, such as the and effect of all the
advertising partner or search string. Similarly, the last URL visited is significant movements within a
because it can describe the outcome of the visit—how it went. For example, if the sequence
last URL is a purchase checkout confirmation page then the visit was a successful
sales transaction and each click leading up to the purchase can be labeled as such.
Because Timing-specific first and last locations are so significant they should be
attached to all the events in a sequence to help describe events more fully. Do this
by overloading the fact table with additional location foreign keys or brand new
why and how dimension keys.
Consequences
Adding useful dimensions from related events is another example of dimensional
overloading that requires extra ETL processing and additional fact table storage. In
this case, ETL must make multiple passes of the input data to read ahead, decide
which events are related and then go back and load the facts with this extra infor-
mation. However, this is well worth doing, so that common BI questions can be
answered without resorting to complex and inefficient SQL.
224 Chapter 7 When and Where
If actual and If stakeholders asked for flights to be summarized by ACTUAL ARRIVAL DATE
scheduled dates dimension rather than the SCHEDULED ARRIVAL DATE dimension, it would
vary by very little make little difference to the answers they saw, unless many flights arrive a day (or
there may be no more) late. Even then comparing the two sets of dates dimensionally would pro-
value in defining duce skewed measures of airline performance; for example, a flight scheduled to
the actuals as arrive at 23:59 on March 31st could be only two minutes late but would be reported
dimensions as arriving in a different fiscal quarter. In contrast, a flight scheduled to arrive at
8:55 a.m. could be just over 15 hours late and still roll up to the same day, when
compared using ACTUAL ARRIVAL DATE and SCHEDULED ARRIVAL DATE.
It would appear that the actual arrival and departure dates separated from their
time of day components have no value as dimensions. Creating and indexing
additional foreign keys for them in the fact table would be a waste.
Actual timestamps However the actual timestamps values themselves could be held in the fact table
make good facts because they are valuable for calculating delays that can used to measure airline
that can be used to performance (perhaps filtering to ignore two-minute delays but looking for any-
calculate additional thing over two hours). Better still, the FLIGHT DELAY could be calculated during
delay and duration ETL and stored as an additive fact along with FLIGHT DURATION, as shown in
facts Figure 7-12. Both of these facts should be pre-calculated—rather than force the BI
users to perform the time arithmetic, especially because the timestamps involved
are in different time zones!
Fact tables can be Figure 7-12 shows one more time-related fact called Layover Duration, which is the
usefully overloaded time spent at the arrival airport (or city) before taking the next flight. This is an
with facts calculated example of fact overloading, again performed by ETL reading ahead and picking up
using the next event details from the next related event.
The actual departure and arrival dates do not make good additional time dimen-
sions in this particular example because they do not vary significantly from the
scheduled dates—they are usually the same date or one day later. For many other
business processes where actuals do vary significantly from targets or schedules,
actual dates would make very useful time dimensions indeed.
Dimensional Design Patterns for Time and Location 225
If BI users and developers require national language support for reporting element
names while constructing ad-hoc queries (for example, Italian users want to select
“Mese Fiscal” and “Motivo per il Volo” rather than “Fiscal Month” and “Flight
Reason”), attribute name translation should be handled by the BI tool semantic
layer rather than database views. This keeps the SQL or OLAP query definitions
portable across boarders.
Solution
Instead, a more scalable design is to create separate hot swappable dimensions
(SD) for each language. Each language version would be identical in structure
(identical table name, identical column names, and identical surrogate key values)
but with its descriptive column contents translated as required. These language-
specific dimensions would then be selected based on the schema the BI user logs
into. For example, Italian user IDs would default to the schema with Italian ver- Use hot swappable
sions of the PRODUCT and FLIGHT REASON dimensions. dimensions
226 Chapter 7 When and Where
Create separate With this approach standard reports can be developed once and run unaltered (as
hot swappable long as they do not filter on translated descriptions) in multiple offices with local-
dimensions for ized results. For example, a CO2 footprint report in the London office that catego-
each reporting rizes travel reasons as “Conference,” “Consulting,” and “Return Home”, would
language display “Congresso,” “Consulto,” and “Casa di Ritorno” when in Rome.
Using separate hot swappable dimensions for national languages means that you
can add new languages at any time without affecting the existing schemas and
reports. This allows you to deliver an agile solution with a single language initially
and then go global, without incurring technical debt.
Consequences
When translating dimensional attributes, care must be taken to preserve their
cardinality; for example, 50 distinct product descriptions in English must remain
50 distinct product descriptions in French and Italian—so that reports contain the
same number of rows with the same level of aggregation when translated.
Preserve sort order National language versions of a dimension sort differently. Cryptic business keys
and cardinality (BK) are often stripped from dimensions if they are never required for display
purposes. However, they can be used (without being displayed) to provide consis-
tent sort order when standard reports are delivered in multiple languages.
Summary
Time is modeled dimensionally by separating date and time of day into CALENDAR and
CLOCK dimensions which should contain all the descriptive time attributes BI users need.
Period Calendars, such as MONTH are built as rollups of the standard CALENDAR. They are
used to explicitly define the time granularity of higher level fact tables.
Fact-specific calendars, built using ETL fact state information, are used to ensure valid YTD
comparisons.
International facts should be overloaded with additional time keys to support standard and local
time analysis.
Location-specific date descriptions and day-specific time descriptions can be handled by using
the time surrogate keys DATE KEY and TIME KEY to represent versions of a date or minute.
Journey analysis can be enhanced by overloading movement facts with additional location keys
and why and how dimensions based on the first and last locations in a meaningful sequence.
Separate hot swappable language-specific dimensions are used to support national language.
8
HOW MANY
Design Patterns for High Performance Fact Tables and Flexible Measures
In this chapter we describe how the three fact table patterns—transaction fact This chapter
tables, periodic snapshots, and accumulating snapshots—are implemented to covers techniques
efficiently measure discrete, recurring and evolving business events. We particu- for incrementally
larly focus on the agile design of accumulating snapshots, by describing how the designing and
requirements for these powerful but complex fact tables can be visually modeled as developing high-
evolving events using event timelines, our final BEAM✲ modelstorming tool. We performance fact
also describe the BEAM✲ notation for capturing fact additivity and fully docu- tables and flexible
menting the limitations of semi-additive facts, such as balances. We conclude with measures
techniques for optimizing fact table performance and multi-fact table reporting by
concentrating on design patterns for aggregates and other derived fact tables that
accelerate and simplify BI queries
227
228 Chapter 8 How Many
Figure 8-1
Transaction fact
table
Design Patterns for High Performance Fact Tables and Flexible Measures 229
Financial transaction fact tables often have an additional “book date” or applica-
ble financial period dimension to handle late-arriving transactions and adjust-
ments. The generic version of this is an audit date dimension, which can be added
to any fact table to record when facts are inserted.
Transaction fact tables are insert only because all the information about their Transaction fact
transactional facts is known at the time they are loaded into the data warehouse, tables are insert
and does not change—unless errors occur. Even then, if the errors are operational only – which speeds
rather than ETL, they are often handled as additional adjustment transactions that up their ETL
must be inserted. This helps to keep ETL processing as simple and efficient as processing
possible—an important consideration when loading hundreds of millions of rows
per day. Although transaction fact tables can be extremely deep, they are generally
narrow—containing only the small numbers of facts captured by operational
systems on any one transaction.
Consequences
Transaction fact tables are the bedrock of dimensional data warehouses. Because
they do not summarize operational detail, they provide access to all the dimensions
and facts of a business process. In theory, this means they can be used to calculate Transaction fact
any business measure. However, in practice—due to their size and the complexity tables often need to
of many business measures—they can’t be used directly to answer every question. be supplemented
For example, transaction fact tables are impractical for repetitively calculating with snapshots for
running totals over long periods of time. For efficiency, cumulative facts, such as BI usability and
balances, are best modeled as recurring events and implemented as periodic query performance
snapshots.
Periodic Snapshot
Periodic snapshots (PS) are used to store recurring measurements that are taken at Periodic snapshots
regular intervals. Recurring measurements can be atomic-level facts that are only store regularly
available on a periodic basis (such as the minute by minute audience figures for a recurring facts
TV channel), or they can be derived from more granular transactional facts.
Most data warehouses use daily or monthly snapshots to store balances and other Periodic snapshots
measures that would be impractical to calculate at query time from the raw trans- can contain atomic-
actions. For example, compare the cost of calculating product revenue and product level facts but are
stock level for April 1st 2011 using atomic-level sales and inventory transactions. typically used to
Product revenue is calculated by summarizing that one day’s worth of sales trans- hold measures
actions, whereas the product stock level calculation requires every inventory derived from more
transaction prior to April 1st 2011 to be consolidated. To efficiently answer stock granular
questions you need a periodic snapshot, such as STOCK FACT shown in Figure 8- transactions
2. This is a daily snapshot of in-store product inventory that records the net daily
effect of inventory transactions, rather than the transactions themselves.
230 Chapter 8 How Many
Figure 8-2
Periodic snapshot
fact table
Periodic snapshots Although periodic snapshots share many dimensions with their corresponding
have fewer transaction fact tables, they will generally have fewer of them—because some will
dimensions than be lost when transactions are rolled up to a daily or monthly level. Periodic snap-
transaction fact shots will typically have more facts than transaction fact tables. Their design is
tables but more more open-ended—limited not by what is captured on a transaction, but only by
facts the imagination of the BI stakeholders. Adding new facts to a transaction fact table
is rare—the operational systems would have to be updated to capture more infor-
mation. But periodic fact tables are more frequently refactored with additional
facts as BI stakeholders become more creative in defining measures and key per-
formance indicators (KPIs).
Periodic snapshots Like transaction fact tables, periodic snapshots are typically maintained on an
are typically loaded insert-only basis. For example, daily stock levels for each product at each location,
on an insert-only shown in Figure 8-2 will be inserted into the STOCK FACT table at the end of each
basis day. Most monthly snapshots are maintained the same way—with new facts
inserted at the end of each snapshot period (month). However, for some monthly
snapshots, such as a customer account snapshot for a bank, there are benefits in
updating them on a nightly basis:
Some monthly Stagger the ETL Workload: If ETL processing waits until the end of the
periodic snapshots month it has to aggregate a whole month’s worth of transactions for each ac-
can be updated on count. This makes the last night of the month a particularly heavy night: if
a daily basis, to ETL fails, information for the whole of the last month will be unavailable.
improve ETL However, if ETL is run nightly for the snapshot, it has only to insert or update
processing and a day’s worth of transactions for only the accounts that had activity on that
provide period-to- day and if it fails the table is only one day out of date.
date measures
Provide Month-to-Date Facts: Although a monthly snapshot can be useful
for trending historical customer activity, it is on average 15 days out of date. If
it contains an extra month-to-date row for each customer account it can be
used to support additional operational reporting requirements.
Load monthly (and quarterly) snapshots on a nightly basis to improve ETL per-
formance and support period-to-date reporting.
Design Patterns for High Performance Fact Tables and Flexible Measures 231
Consequences
Many end-of-period measures are complex and time consuming to calculate from
raw transactional facts. If the necessary measures are already available from a
reliable operational system it is often better to load a periodic snapshot directly
from an additional source rather than attempt to reproduce the operational busi-
ness logic with ETL processing by loading from a transaction fact table.
Accumulating Snapshots
Accumulating snapshots (AS) are used to store evolving events: longer running Accumulating
events that represent business processes with multiple milestone dates and facts snapshots store
that change over time. They are so named because each evolving event accumulates evolving events
additional fact and dimension information over time, typically taking days, weeks,
or months to become complete.
Unlike transaction fact tables, and most periodic snapshots, accumulating snap- Accumulating
shots are designed specifically to be updated. Facts are inserted into an accumulat- snapshots are
ing snapshot shortly after events begin and are updated whenever event statuses updated with
change. This leaves the fact table containing the final status of every completed ongoing event
event and the current status of all open events. activity
Figure 8-3 shows an accumulating snapshot for library book lending. It contains Accumulating
examples of books that have been borrowed and returned (completed events), snapshots have
books that are overdue (evolved events), and books that just been borrowed (new multiple milestone
events). LENDING FACT has multiple time dimensions—like all accumulating time dimensions
snapshots—representing the milestones that a book loan can go through. Only two
of these (LOAN DATE and DUE DATE) are available when a loan is created.
Figure 8-3
Accumulating
snapshot fact table
232 Chapter 8 How Many
Accumulating LENDING FACT also contains a duration (OVERDUE DAYS) and a state count
snapshots can (OVERDUE COUNT). Durations are typical accumulating snapshot facts. If there
usefully contain are a small number of interesting durations, they can be stored as explicit facts. If
duration and state there are many possible durations because there are a number of milestone dates,
count facts that the fact table should physically store the milestones as timestamp facts and BI
match their applications should access it through a view that calculates the durations. State
milestone time counts are another characteristic of an accumulating snapshot fact. They typically
dimensions match the milestones dates and simply record a 1 if a milestone has been reached
or 0 if it has not. They allow queries to quickly sum the number of events at each
milestone in a single pass without decoding dates or applying complex filters.
LENDING FACT could be extended with additional state counts for returned, lost
and on loan books.
Consequences
Accumulating snapshots that support end-to-end business process measurement
are some of the most valuable fact tables, and are very popular with stakeholders,
but they can be extremely difficult to build. Many ETL nightmares are caused by
Accumulating trying to merge multiple operational sources and transaction types in one pass into
snapshots are the perfect accumulating snapshot. The code involved is complex and difficult to
difficult to build, quality assure, often resulting in delays. And when the snapshot is finally delivered,
especially when while it may answer the initial questions perfectly, all too soon stakeholders can hit
they merge events a BI brick wall when they need to drill into missing details. This happens because
from multiple source accumulating snapshots typically summarize a process from the perspective of the
systems initial event and only record the current status of the overall event. For example, an
order processing snapshot that summarizes deliveries for each order line would
help to spot problems with fulfillment performance, but would lack the delivery
details needed to explain why the problems are occurring.
The Data Warehouse Toolkit, Second Edition, Ralph Kimball, Margy Ross (Wiley,
2002) contains four interesting accumulating snapshot case studies:
Granularity is documented in the model by recording the combination of granular- Granularity can be
ity dimensions (GD) that uniquely identify each fact. For most transaction fact stated in business
tables and accumulating snapshots the list of GD columns will include a degener- terms or
ate transaction ID dimension; for example, a call detail fact table with a business dimensionally by
granularity of “one row per phone call”, can use a degenerate CALL REFERENCE listing GD columns
NUMBER [GD] to uniquely identify each row. This succinct granularity definition
is very useful for ETL processing, but for BI purposes it can be helpful if the granu-
larity can also be defined using dimensions that are more likely to be queried—
such as customer and call timestamp (assuming a customer can only make one call
at a time). These alternative granularity definitions can be documented using
numbered GD codes. For example, CALL REFERENCE NUMBER [GD1] and
CUSTOMER KEY [GD2], CALL DATE KEY [GD2], CALL TIME KEY [GD2].
Adding milestone Whichever route you come to it, modeling an evolving event involves adding
details is straight- multiple milestone details to an event table, as in the Figure 8-4 example which
forward when there shows shipment, and delivery milestones added to CUSTOMER ORDERS. Adding
is a 1:1 relationship these milestone details is straightforward when there is a 1:1 relationship between
between events all the events, because their granularity is unchanged by merging them; for exam-
ple, if each order is associated with exactly one shipment, followed by exactly one
delivery, all the details would naturally align and no information is lost by aggre-
gating multiple events, or “made up” by allocating portions of events.
When an evolving However, if an evolving event story can have repeating milestones (multiple occur-
event can have rences of a specific milestone) there is a 1:M relationship between events, and
repeating something has to be done to bring everything to the same granularity. For example,
milestones, the if a single order line item for 100 units results in 4 staggered shipments from the
most recent or total warehouse that are then batched up in 2 deliveries by the carrier, the 4 shipment
milestone details events and 2 delivery events need to be reduced to a single record to match the
are stored as part of order event. The simplest way to align the multiple milestones is to record the
the event totals for their additive quantities and the most recent values for all other details.
For example, DELIVERY DATE and CARRIER, in Figure 8-4, hold the last deliv-
ery date and the last carrier (if more than one carrier was used) and DELIVERED
QUANTITY holds the running total number of items delivered, so far, for each
order line item.
Ask how many You discover the cardinality of milestones by asking how many questions about
questions to each milestone verb—these are hidden in the prepositions for the milestone details,
discover repeating particularly in the milestone when details. For the evolving CUSTOMER ORDERS
milestones event you would ask the following questions based on SHIP DATE:
When there is more than one ship date for an order, which
one will you use to measure the order process?
The best way to ask this type of question, about multiple milestone values, is to get
stakeholders to fill out the evolving event table with example stories.
Design Patterns for High Performance Fact Tables and Flexible Measures 235
Figure 8-4
Evolving orders
event
236 Chapter 8 How Many
If all the repeated Typically, BI queries will use the most recent values for a repeating milestone but if
milestone values stakeholders say they need all the values then you will have to model the milestone
are needed they as a separate discrete event at its atomic-level of detail. If you have already done so
must be modeled you can point the stakeholders at its event table. Either way, you still want to push
as discrete events for a single value definition for each repeating detail so that you can add it to the
evolving event. To help stakeholders to understand why the most recent value
would be useful, remind them that the role of the new event table is to summarize
the current progress or final state of each evolving event story.
If milestone events If stakeholders continue to struggle to give you a single value definitive answer for
have a M:M a detail, then it probably does not belong in the evolving event. This can happen
relationship it may where there is a M:M relationship between milestones and more complex alloca-
not be appropriate tions are needed. In which case, it may not be appropriate to combine any of the
to combine them in details from the milestone. If the initial event and a milestone turns out to have a
the same evolving M:1 relationship, this is not so problematic but some allocation of additive quanti-
event ties will be needed. For example, if 2 different orders for 100 units are partially
fulfilled by a single shipment of 190 units, a SHIPPED QUANTITY of 100 must be
assigned to the first evolving order event and 90 assigned to the second.
Tell process stories If you determine that a milestone' detail belongs in the event, you should use its
that describe the examples to tell interesting process stories. For milestone dates, ask stakeholders to
typical, min and max give you examples that will represent typical, minimum, and maximum intervals
intervals between between milestones. If a detail has already been modeled as part of a discrete event,
milestones you may be able to reuse values from its event table, but these must make sense in
combination with the examples already present in the evolving event. For example,
if you have used a relative time value like “Today” for ORDER DATE you might
leave SHIP DATE as missing to show that initially there is no shipment yet for an
order loaded into the data warehouse today.
Use missing values As you add new details, you may have to alter some of the existing examples dates
to describe the initial to bring out interesting scenarios, such as the initial and final states of the event.
state of an event. The initial state will have missing values for all the milestone details that have not
You also want to happened yet. This means some details that are mandatory in discrete events must
describe its final become optional in the evolving event. For example, CARRIER is always present
states, completed on a CARRIER DELIVERY event but will be “Unassigned” in the evolving
or otherwise CUSTOMER ORDERS event if an order has not been shipped yet. If there can be
more than one final state, try to capture additional process stories for each possible
outcome; for example, ask stakeholders to give you stories of successfully com-
pleted orders and cancelled orders.
When you have finished modeling an evolving event, it is a good idea to reorder
the details after the main clause in W and process order—keeping all the whens,
whos, whats, wheres, and so on together in the order they appear chronologically.
Doing so can make a complex evolving event much easier to read.
Design Patterns for High Performance Fact Tables and Flexible Measures 237
Event Counts
When an evolving event has a 1:M relationship with its milestone events, define Event counts record
additional event measures—such as (number of) SHIPMENTS or (number of) the number of
DELIVERIES, in Figure 8-5—to record the number of aggregated/repeated events. repeated milestones
Figure 8-5
Event and status
counts
State Counts
Each milestone date or embedded verb within an evolving event represents a state State counts record
that the event can reach. Stakeholders will often have questions about how many if an event has
orders, applicants, claims, etc. have reached a particular state. Answering these completed a
questions can be greatly simplified by adding state counts, such as SHIPPED and milestone. They are
DELIVERED in Figure 8-5. These counts are 1 or 0 depending on whether a state useful because
has been achieved or not. They can be incredibly useful because state logic can repeated milestones
often be more complex that you think; for example, you might imagine that mean you cannot
count(DELIVERY_DATE)would be an efficient way to count order items that use milestone dates
have reached the DELIVERED state but due to partial deliveries, it’s not quite that alone to evaluate
simple. Instead, you have to test that DELIVERED QUANTITY = ORDERED progress
QUANTITY.
Event status business rules can become complex. They should be evaluated once
during ETL processing and the results stored as additive state counts to provide
simple, consistent answers for all BI queries.
238 Chapter 8 How Many
Durations
Milestone The multiple when details of an evolving event can be used dimensionally like any
timestamps can be other details (for grouping and filtering), but they can also be used in pairs to
used in pairs to calculate the elapsed time between milestones. Some of these durations will be key
create duration measures of process performance. It’s not always obvious which ones are signifi-
facts. These should cant or what they should be called by looking at the raw when details. You find out
be named by by using timelines to modelstorm the durations with stakeholders. You should also
modeling them with discover the appropriate unit of time measurement (day, hour, or minute), and the
stakeholders acceptable minimum and maximum intervals between events that can be used as
alert thresholds to drive conditional reporting applications.
Event Timelines
Use event timelines The best way to discover the important milestone when details and duration
to visually model measures of an evolving event is to use an event timeline—like Figure 8-6. You
milestones and should draw a timeline showing each of the milestone dates of an evolving event in
durations chronological order, so that you can examine each milestone pairing visually, and
ask stakeholders for business names for the intervals between them. The most
important intervals are likely to have pre-existing names—a sure sign that they
have business value and should be modeled as facts—but new and significant
intervals can quickly be discovered and named in this way too.
You should try to get a name for each significant duration but when you have
several milestone dates, you can end up with a lot of potential durations—too
many to name in some cases. The number of durations is equal to: (Number of
timestamps x (Number of timestamps – 1) / 2 ). So if an evolving event has six
milestones, you have 6 x 5 / 2 = 15 possible durations.
Start by modeling Typically, the most interesting durations will be those measured from the initial
the fixed points on event date (Order Date) or from a target date (Delivery Due Date). Start by adding
the timeline: the these fixed points on the timeline. These are the fixed value (FV) dimensions of the
initial when detail evolving event. With these in place use the white space on the timeline to prompt
and any target dates stakeholders for the other milestones events and their chronology.
Design Patterns for High Performance Fact Tables and Flexible Measures 239
Once you have all the events on the timeline (you may have copied the event Model durations by
sequence from your event matrix), you can then start discovering durations by naming the gaps
pointing at the milestone pairs and asking stakeholders to name the gaps between between milestone
them. Any meaningful duration you discover should be added to the timeline, as in events
Figure 8-6, which shows three important durations for an evolving order.
Figure 8-6
CUSTOMER
ORDERS
event timeline
showing repeating
milestones
After naming the durations on the timeline, add them to the evolving event table Add durations to the
with example values. As mentioned in Chapter 2, you may question the wisdom of evolving event table
adding so many derived facts (DF) to the event table, but the event table is still a BI as derived facts, to
requirements model, not yet a physical accumulating snapshot design. Its purpose document their UoM
is to document the measures stakeholders will need, not dictate a physical struc- and range of values
ture. By adding a duration to the event table you are documenting its name, unit of
measurement (UoM), and value range. You are not making a decision about how,
if at all, it will be physically stored. Duration definitions can be implemented as
database views or report items in BI tool metadata layers.
Figure 8-7 shows three durations—PACKING TIME, DELIVERY TIME, and Number milestone
DELIVERY DELAY—added to the event as derived facts. Their definitions can be dates DT1-DTn to
recorded using simple spreadsheet-like formula by numbering the event milestones reference them in
DT1 to DT4. For example, PACKING TIME is defined as DT2 – DT1: the interval duration definitions
between ORDER DATE [DT1] and SHIP DATE [DT2]. The DTn numbering can
also be used to record the chronological order of the milestones.
Figure 8-7
Duration measures
240 Chapter 8 How Many
All the durations within an event should be defined using the same unit of meas-
ure; for example, all [days] or all [hours]. This avoids errors when durations are
compared or used in calculations.
Timelines are part of Timelines are to evolving events as hierarchy charts are to dimensions. Dimension
the definition of an tables need hierarchy charts to document the levels of their conformed hierarchies.
accumulating Accumulating snapshots need timelines to document their event sequences and
snapshot. They also durations. Just as you can use relative spacing on a hierarchy chart to show relative
make great training aggregation, spacing on a timeline can show the relative durations of stages within
material for BI users a process, and highlight the most time-consuming events that must be carefully
monitored. Timelines are an essential part of the training material for stakeholders
who need to work with complex evolving events.
The Back of the Napkin, Dan Roam (Portfolio, 2008) Chapter 12, “When can we fix
things” contains some great ideas on drawing timelines to solve when problems.
Other chapters describe how to draw pictures to solve other 7Ws (who, what,
where, how many, why and how) related business problems.
Design Patterns for High Performance Fact Tables and Flexible Measures 241
Figure 8-8
Timeline dashboard
ORDER FACT [AS] has a more complex 1:M or M:M relationship between its If milestone events
milestones which are handled by multiple sales and logistics systems managed by have more complex
Pomegranate, its distributors and carriers. Attempting to populate this fact table relationships or are
directly is a high risk, “big development up front” strategy. (Another form of BDUF handled by different
that you should avoid). It is unlikely that an accumulating snapshot with such operation systems,
complex sourcing issues could be successfully delivered in one or two normal accumulating
length development sprints. So, while it’s being developed what would be demon- snapshots should
strated, or validated by stakeholder prototyping? Nothing? That’s not particularly be developed
agile. Instead each of the accumulating snapshot’s milestones can initially be incrementally
developed and delivered, on a more agile basis, as separate transaction fact tables.
If an evolving event is modeled retrospectively, you will already have all (or most) Transaction fact
of the discrete event definitions for its milestones (you may have discovered tables are used to
additional milestones while modeling the evolving event). These are the blueprints stage accumulating
for transaction fact tables that can stage the milestone events prior to merging snapshot data,
them in the ultimate accumulating snapshot. If you don’t yet have these, you can validate the design
model them using the techniques described in Chapters 2-4 by pulling out their and deliver early BI
verbs from the milestone prepositions and asking: “who does each of these?” value
242 Chapter 8 How Many
Figure 8-9
ORDER FACT
Accumulating
Snapshot
Staged For real-time DW/BI, the latency introduced by staging each milestone in its own
accumulating fact table first may prevent an accumulating snapshot being updated urgently
snapshot ETL enough for current day reporting requirements. If streamlining the ETL process
processing will need becomes paramount and the milestone fact tables are not needed for queries, they
to be streamlined if can become un-indexed staging tables that are truncated at the end of every load
real-time DW/BI cycle or be replaced by ETL processes that act as virtual tables, piping their inserts
requirements exist or updates directly to the inputs of the accumulating snapshot process. If a real-
time snapshot and queryable detailed fact tables are required, the staging tables can
be implemented as un-indexed real-time partitions (covered shortly) that are fully
indexed and merged with their fact tables by conventional overnight ETL.
Fact Types
Additivity describes If the most important property of a fact table is its granularity, the most important
how easy or difficult property of a fact is its additivity—which tells you whether or not its values can be
it is to sum up a fact summed to produce meaningful answers. This is important because stakeholders
and get meaningful almost never want to see individual fact values. Instead they want to summarize
results them, and the easiest way to do that is to sum them. Facts are divided in three types
based on their additivity: Fully additive, non-additive and semi-additive.
Design Patterns for High Performance Fact Tables and Flexible Measures 243
The first rule for defining an additive fact is to use a single unit of measure. For Additive facts
example, while modeling an event you may identify a quantity that is recorded in must use a single
multiple currencies that are documented as [£, $, ¥]. The corresponding fact needs standard unit of
to be converted into a standard currency, otherwise the fact will not be additive measure
across currency.
Store facts in a single unit of measure to make them additive and avoid aggrega-
tion errors. If BI applications need to view facts in different units of measure—e.g.,
report sales in local and standard currency, or product movements in shipping
crates rather than retail units—provide conversion factors. These should be
stored centrally in the data warehouse (as facts) rather than in the BI applica-
tions—because they can change.
Non-Additive Facts
Non-additive (NA) facts cannot be summed, even if they are in the same unit of Non-additive
measure. For example, UNIT PRICE cannot be summed to produce a meaningful facts can never be
total—even if all unit prices are recorded in dollars. Instead UNIT PRICE can be summed to produce
averaged, or used to create an additive SALE VALUE (UNIT PRICE × SALE meaningful answers
QUANTITY) fact. BI users will likely want to use this additive measure more often
than UNIT PRICE, so it should be stored in the fact table, and if storage is an
overriding concern, the non-additive fact should derived, at query time, by just the
reports that need it.
Percentages are non-additive; two product purchases with a discount of 50% do Percentages are
not equate to a 100% discount. Because of this, percentages make terrible facts, but non-additive. Only
they do make great measures and KPIs that BI users will want to see on reports and their additive
dashboards. Facts like DISCOUNT should be stored as an additive monetary components should
amount (as in Figure 8-1), allowing BI tools to calculate the correct percentages be stored as facts
within the context of a report.
Non-additive facts Percentages and unit prices can easily be converted into additive facts, but other
can be aggregated quantities cannot. These facts have to be clearly documented as non-additive along
using other functions with their compatible alternative methods of aggregation for creating useful meas-
such as min, max or ures. For example, TEMPERATURE NA is a non-additive fact that can be aggre-
average gated using functions such as min, max, and average.
Semi-Additive Facts
Semi-additive facts Additive facts are easy to work with—they can be summed with impunity. Non-
are harder to work additive facts require a little more creativity to aggregate, but after an appropriate
with than additive or measure formula has been found they too are relatively straightforward to deal
non-additive facts with: you simply never sum them up. Semi-additive facts are more problematic.
Semi-additive facts A semi-additive (SA) fact can be summed up some of the time but you can’t sum it
can be summed but up all of the time. To be more precise: a semi-additive fact cannot be summed
not across their non- across at least one dimension: its non-additive dimension. For example, yesterday’s
additive dimension(s) STOCK LEVEL cannot be added to today’s STOCK LEVEL. It is non-additive
across the time dimension. But STOCK LEVEL is additive across other dimen-
sions. It can be summed for all stores and/or all products (apples and pomegran-
ates?) to give a correct total stock level, as long as the query is constrained to a
single day—a single value of the non-additive dimension.
Semi-additive facts are fully documented by marking them as SA and their non-
additive dimension(s) as NA. If there is a single semi-additive fact in a fact table or
if all semi-additive facts have the same non-additive dimension(s) this is sufficient.
To fully document a However, if there are multiple semi-additive facts with differing non-additive
semi-additive fact dimensions, the SA and NA codes are linked by numbering, to pair each SA fact to
the SA fact code is its NA dimension(s). For example, Figures 8-2 and 8-10 show the BEAM✲ table
used in conjunction and matching enhanced star schema for STOCK FACT, a daily periodic snapshot
with at least one NA of in-store inventory. Both show STOCK LEVEL SA1 is non-additive across
dimension code STOCK DATE KEY NA1, whereas ORDER COUNT SA2 is non-additive across
PRODUCT KEY NA2. This semi-additive fact documentation can be used to
correctly define measures in BI tools and some multidimensional databases. SQL
doesn’t natively understand that some numbers are semi-additive, this can cause
averaging and counting issues for the unwary BI developer.
Averaging Issues
Semi-additive facts Although semi-additive facts cannot be summed over their non-additive dimen-
can be averaged but sions, they can often be averaged (carefully) over them. Unfortunately the SQL
not by using AVG( ) AVG() function may not be up to the job; for example, if stakeholders ask:
Figure 8-10
Periodic snapshot
containing
semi-additive facts
and the stock data is as follows: the product category “Advanced Laptop” contains
two products: POMBook Air and POMBook Pro; The SW region contains 10
stores; Each day last week, every SW store stocked 20 POMBook Airs and 60
POMBook Pros (let’s keep it simple); Last week had 7 days (like every other week).
AVG(Stock_Level)will return 40, which is the wrong answer to the stake– Periodic semi-
holders question. 40 is the average of 60 and 20 which is what you get when half additive facts, such
the data has a value of 60 and the other half has a value of 20. The AVG() func- as balances, must
tion—the equivalent of SUM(Stock_Level)/COUNT(*)—sums up 70 store/day use a time average
records with 20 laptops and 70 with 60 (5,600) and divides by the number of which divides the
records (140). To get the correct average for a category in a region, you must not total by the number
divide by the number products (in the category) or the number of stores (in the of non-additive time
region). Instead you must only divide by the number of non-additive dimensional periods in the query
values (7 days). The correct SQL for this is: SUM(Stock_Level)/
COUNT(DISTINCT Stock_date). The correct answer is: 800.
Counting Issues
ORDER COUNT in Figures 8-2 and 8-10 is yet another example of a semi-additive Unique counts are
fact that you must handle carefully. As long as queries are constrained to a single semi-additive or
product, ORDER COUNT can be summed across days and locations to give a total non-additive facts
number of unique orders. But if a query needs the total number of orders for the
“Advanced Laptop” category it’s in trouble because it will over-count any orders
that contain both POMBook Airs and POMBooks. Unfortunately, there is no way
to get the correct answer from STOCK FACT.
Atomic-level fact snapshot they may not be as additive as you hope, often turning out to be semi-
tables are required additive or non-additive when you try to sum them further. If so, the only way to
to provide fully calculate a correct unique count is to go back to the transactions and count them
additive unique distinctly within the context of the query. The status counts SHIPPED and
counts DELIVERED, in Figures 8-5 and 8-9, do not suffer from this problem because they
count order item states uniquely at their atomic-level of detail, whereas the event
counts SHIPMENTS and DELIVERIES do, because they count shipments and
deliveries aggregated to the order item level. If stakeholders want the total number
of deliveries this month vs. last month they cannot get the answer from ORDER
FACT [AS] using sum(Deliveries). Instead they need to use DELIVERY
FACT [TF] to count(distinct Delivery_Numbers).
Problem/Requirement
However, in certain businesses—such as banking—heterogeneous products will
have heterogeneous facts: very different ways of being measured. This can make fact
table designs that attempt to provide an integrated view of the business, very
inefficient. For example, Figure 8-11 shows a small portion of a monthly account
Heterogeneous snapshot that will allow all major product types (checking, saving, mortgages,
products that have loans, and credit cards accounts) to be analyzed. Unfortunately this “one size fits
heterogeneous facts all” fact table will be very wide and sparsely populated. The dimensional keys and a
can give rise to small set of core facts that measure all account types (ACCOUNT BALANCE and
inefficient “one size TRANSACTION COUNT) would always be present, but the majority of the facts
fits all” fact table are marked as exclusive (Xn) with their validity based upon the defining character-
designs istic PRODUCT CATEGORY [DC]. These will be null most of the time, making
the table “fact rich but data poor”. Depending on the database technology used, the
null facts may take up far less storage space than valid facts, but if there are hun-
dreds of facts in total across all lines of business this design will still be extremely
difficult to manage and likely to perform poorly.
Design Patterns for High Performance Fact Tables and Flexible Measures 247
Figure 8-11
Exclusive facts
Solution
Limit MONTHLY ACCOUNT SNAPSHOT to the common facts (and possibly a
few frequently used specialist facts like INTEREST CHARGED and CHARGES)
and create a small set of custom fact tables, one for each major product family Create a core fact
based on the exclusive fact sets as in Figure 8-11. The core fact table will contain a table for cross-
row for every account each month, and the custom fact tables—such as product analysis
MONTHLY CHECKING FACT and MONTHLY MORTGAGE FACT shown in and custom fact
Figure 8-12—would contain rows for their account types only. The custom fact tables for each
tables will contain the common facts, too—so that BI users don’t have to query exclusive fact set
multiple fact tables.
Figure 8-12
Core and custom
fact tables for
exclusive facts
248 Chapter 8 How Many
When heterogeneous products have many heterogeneous facts, even if they share
a common granularity, a monolithic fact table design may not be ideal. If the facts
come from different operational systems with different access methods and
maintenance cycles, separate core and custom fact tables will be easier to build
and maintain, and can be a better fit with BI user groups.
Figure 8-13
Factless fact table
Factless fact tables Factless fact tables are also used as coverage tables to track dimensional relation-
can be used as ships in the absence of other events; for example, a promotion coverage table that
coverage tables to records products on promotion—regardless of sales, or a monthly healthcare
record dimensional eligibility snapshot that records the fact that a person is covered by a medical plan
relationships in the that month. Coverage fact tables are often used in combination with transaction
absence of other fact tables to answer questions about what didn’t happen (but should have); e.g.,
events “Which products were promoted but didn’t sell?” or “How many people were
covered but didn’t claim?”
Design Patterns for High Performance Fact Tables and Flexible Measures 249
To answer the “Who didn’t attend but should have?” question about seminars, If the number of
there is a case for making SEMINAR ATTENDANCE FACT a normal fact table by non-events is not
adding an ATTENDANCE fact. This would be 1 if an invited prospect attends and too high, a 0/1 fact
0 for a “no show”. Normally fact tables don’t record events that didn’t happen, can be added to
because there are just too many of them. Airlines don’t record all the flights you count “what didn’t
didn’t take today—even if you are their best frequent flyer. But in the case of sales happen”
seminars Pomegranate didn’t invite the whole world, so the number of extra
records for invitees that did not attend would be manageable.
A dummy fact (always equal to 1) can be added to factless fact tables to provide an A dummy additive
additive fact that can be summed. This makes it easier to build aggregates of large fact (equal to 1) can
factless fact tables that can be used “invisibly” by aggregate navigation (see Aggrega- be added to support
tion later in this chapter). The aggregate will have the same fact but it will hold aggregate
values other than 1. Also, some BI tools only recognize a table as a fact table if it navigation
has at least one fact.
Downsizing
The first way to improve performance is to design fact tables that are as compact as Improve fact table
possible without compromising their usability. The following checklist gathers performance by
together techniques for reducing fact table row width: reducing row width
Use integer surrogate keys as dimensional foreign keys. Keep business keys in
dimensions.
Reduce the number of dimension keys—combine small why and how dimen-
sions (see Chapter 9).
250 Chapter 8 How Many
Move free text comments and lengthy sets of degenerate flags into their own
physical dimensions and replace them with short foreign keys (See Chapter 9).
Don’t store a large number of facts that can easily be calculated intra-record;
e.g., don’t store all the durations that can be calculated from a smaller number
of milestone timestamps.
Limit fact history to The next thing to consider is the length of each fact table. You should try to limit
only the data that is history to what the BI users really need. Don’t use fact tables as an expensive
useful for BI archival strategy. If the auditors need more history than the BI users, they should
get that from the operational system of record, not the data warehouse. Regulatory
requirements are not analytical requirements, so don’t automatically load 20 years
of transactional history just because it exists. If the business has changed substan-
tially in that time how far can queries go back and make valid comparisons? Also,
the further back you go the harder it becomes to load the data because data quality
challenges tend to increase.
When modeling business events with stakeholders, ask for event stories describ-
ing the earliest when details that BI users will need to work with.
The most interesting data is the most recent. If you have years of history to load,
start with the current year and work backwards—partitioning can help to do this
efficiently. Don’t bother loading the oldest data until stakeholders ask for it.
Indexing
Create query After you have done all that you can to control the size of a fact table the next issue
indexes on foreign to consider is how to index it for query performance. Here you should seek your
keys to support “star DBMS vendor’s advice on defining some form of “star join index.” This generally
join optimization” involves creating a bitmap index on each dimensional foreign key—but techniques
vary by DBMS and by version, with new data warehousing index strategies being
added all the time (we hope).
More query indexes Whatever indexing techniques you use, there is inevitably a trade-off between
can improve BI query performance and ETL processing time. Your priorities should be heavily
performance but biased towards query performance—but BI users can only query what you can
slowdown ETL manage to load in the available time—so index thoughtfully!
Accumulating and In addition to query indexes, accumulating snapshots and period-to-date (PTD)
period-to-date periodic snapshots need an ETL index to support efficient updates. This will be an
snapshots also OLTP-style unique index using GD columns such as the ORDER ID degenerate
need an ETL update dimension in CUSTOMER ORDERS. Transaction fact tables and most periodic
index snapshots are insert-only so they do not require a unique index, as long as ETL
processes can guarantee fact uniqueness.
Design Patterns for High Performance Fact Tables and Flexible Measures 251
If a fact is frequently used for ranking or range banding you should consider
indexing it to speed up sorting, and joining to a Range Band dimension (described
in Chapter 9).
Partitioning
Partitioning allows large tables to be stored as a number of smaller physical Large fact tables
datasets based on value ranges. If your DBMS supports table partitioning you can be partitioned
should consider partition large fact tables on the surrogate key of their primary on date key ranges
date dimension. Partitioning on date can be made simpler by carefully designing
your calendar dimension surrogate keys (see Chapter 7, Date Keys for details).
Partitioning has a number of benefits for ETL, query performance and administra-
tion:
ETL performance: Partitions with local indexes that can be dropped and Loading into empty
rebuilt independently allow ETL process to use bulk/fast mode loads into an partitions speeds up
empty partition while they are un-indexed. If only the most recent partitions of ETL processing.
accumulating and PTD snapshots are being updated, unique update indexes Partition swapping
(that are used for ETL, not queries) can be dropped on historic partitions. Par- enables 24/7 BI
tition swapping allows ETL to update the data warehouse while queries con- access
tinue to run.
Fact table pruning: Many fact tables need a fixed amount of history (24 Partitions can be
months, 36 months). Monthly partitions allow older data to be efficiently re- truncated to rapidly
moved by truncating a partition rather than row-by-row deletion of millions of delete unneeded
records. history
Real-time support: Fact tables that need to be refreshed frequently throughout Un-indexed “hot
the day can be implemented using real-time “hot partitions”. These are special partitions” can
un-indexed in-memory partitions that are trickle-fed from the operational support real-time
source. During the day queries use these like any normal partition, and at night ETL inserts
their data is merged with the fully indexed historical partitions.
Query performance: DBMS optimizers will ignore partitions that are outside Query optimizers
of a query’s date range, and some can read multiple partitions in parallel. But can use partition
splitting a table into too many small partitions can also hurt performance, es- pruning and parallel
pecially for broad queries that must “stitch” many partitions together. This can access to speed up
be avoided by creating aggregates to answer the broad queries. certain queries
Some DBMSs allow you to partition on more than one dimension. This can be
useful when a particular dimension is frequently used to constrain queries or
represents the way source data extracts are organized for ETL processing; for
example, by organization, geography, or data provider.
Aggregation
Aggregates act as An aggregate (AG) (fact table) is a stored summary of a base fact table. It acts like a
group by indexes for group by index on the base facts—speeding up queries that do not need to return
existing fact tables detailed figures. They are an essential complement to traditional where clause
indexes. A star-join index optimizes highly constrained queries that need to sum-
marize smaller quantities of data, whereas aggregates optimize broad, loosely
constrained queries that need to summarize large quantities of data. Aggregates are
derived fact tables that are very similar to periodic snapshots, dimensionally and in
terms of granularity. They differ from periodic snapshots in that they do not
provide any new facts. Instead, they simply contain summarized versions of the
additive facts from base fact tables.
DBMS aggregate Historically, data warehouse queries were written to use specific aggregates in the
navigation form of summary data marts. Today, many DBMSs provide aggregate navigation
automates that automatically redirects queries to the best (smallest) aggregate. When this
aggregate usage happens, the aggregates are invisible to the BI users and query tools.
Small high Aggregates must be designed so that they match the GROUP BY and WHERE
performance clauses of the most popular queries, or they will be not be used. They also must be
aggregates can be designed so that they are many times smaller than existing fact tables—to provide
designed using the performance improvements that justify the cost of maintaining them. Twenty
lost, shrunken and times smaller is a useful guideline—which can lead to a corresponding query
collapsed patterns performance boost. The three types of aggregate design in a dimensional data
warehouse are lost, shrunken, and collapsed.
Figure 8-14
Lost dimension
aggregate
Design Patterns for High Performance Fact Tables and Flexible Measures 253
Lost dimension aggregates are the easiest aggregate type to build, because no
dimensional joins are needed; for example, a lost aggregate can be built by:
Figure 8-15
Shrunken
dimension
aggregate
Notice that the Customer dimension has been dropped. This is not uncommon for Aggregates can
shrunken dimension aggregates—because dropping the most granular dimension shrink and lose
is often needed to significantly reduce the aggregate’s size. Sales by Month, Region, dimensions
Product, and Customer would contain nearly as many rows as the base fact table—
thereby negating its performance benefits.
Materialized views can be used to build aggregates and their matching rollup Materialized views
dimensions. When building rollup dimensions, you can reuse their base dimension can be used to build
key to create rollup keys if you carefully select the first or last key value that shrunken
matches the rollup dimension granularity. For example, you can use the DATE aggregates and
KEY of the last day of each month as the MONTH Dimension’s MONTH KEY, or their matching rollup
use the STORE KEY of the first store in a region as the REGION Dimension’s dimensions
REGION KEY. The actual surrogate key value selected does not matter, as long as
it is used consistently in the rollup dimension.
254 Chapter 8 How Many
Figure 8-16
Collapsed
dimension
aggregate
Aggregation Guidelines
The following guidelines will help you get a good set of aggregates in place:
Create aggregates that are approximately 20 times smaller than their base fact
tables. Spread aggregates, by designing aggregates of aggregates (400 times
smaller than the base fact tables).
Design invisible aggregates that the DBMS will automatically redirect queries
to. Don’t allow BI users, reports, or dashboards to become directly dependent
on an aggregate. Hide them from query and reporting tools.
Trust base star schemas to handle highly constrained queries using star join
indexes—and focus aggregates on addressing broad summary queries.
Monitor aggregate utilization, drop those that are seldom used, and add new
aggregates as query patterns change.
Make sure that you initially build aggregates that will speed up comparisons
against budgets, targets, and forecasts. These are the most obvious quick-win
aggregates.
Design Patterns for High Performance Fact Tables and Flexible Measures 255
Figure 8-17
Querying multiple
fact tables
Because the two fact tables SALARY FACT and ABSENCES FACT share con- Joining two fact
formed EMPLOYEE and CALENDAR dimensions it appears straightforward to tables – don’t try
join them using their common surrogate keys, as in the following query: this at home!
While the above SQL appears perfectly valid, it will not produce the correct totals
for James Bond or any other employees even if the “007” constraint was removed.
256 Chapter 8 How Many
Queries that attempt Report 3: 2011 Employee Analysis, in Figure 8-18, shows the results of the previous
to directly join fact query—but first take a look at the two smaller reports that preceeded it. Report 1
tables using single shows that employee James Bond has received three salary payments totalling
SQL select clauses £160,000. Report 2 shows that he has been absent 6 days. Now look at Report 3. It
can overstate the shows that James earned £320,000 and was absent 18 days. Something is clearly not
facts! right here: his salary has doubled and his absences have tripled!
Figure 8-18
Overstating the
facts
Joining across a This over-counting is known as the “many to one to many problem”, “fan trap” or
M:M relationship “chasm trap”. It occurs when the tables being joined have a M:M relationship. SQL
causes over- has to evaluate the WHERE clause, which performs the joins ahead of the GROUP
counting because BY clause, so in the example the many Bond salaries (3) are joined to the many
SQL joins first, then bond absences (2) creating too many rows, which are then summed up. Even if the
aggregates the “too fact tables have a 1:M relationship, any facts from the 1 side of the relationship will
many rows” created be overcounted. This is an insidious problem because the aggregation that’s
by the join inherent in most BI queries will hide the “too many rows”. The only totally safe
join between fact tables is when there is a 1:1 relationship. This is very rare and
hard to guarantee. Even then performance can be poor when millions of facts are
joined.
Solution
BI applications can avoid the M:M problem by performing drill-across queries.
Drilling across means lining up measures from different business processes using
conformed row headers. A drill-across query does this by issuing multi-pass SQL:
Multiple fact tables sending separate SELECT statements to each star schema. These separate queries
should be accessed aggregate the facts to the same conformed row-header level before they are merged
using drill-across to produce a single answer set. Drilling across would provide the correct answer to
queries that issue Report 3 by running a query to summarize salaries by Employee ID and another to
multi-pass SQL summarize absences by Employee ID and then merging (full outer join) the two
correctly aggregated answer sets.
Design Patterns for High Performance Fact Tables and Flexible Measures 257
Drill-across or multi-pass query support is a key feature of BI tools. It helps to Multi-pass SQL
manage query performance by keeping individual queries simple. By accessing fact summarizes fact
tables one at a time the queries can be optimized as star joins and take advantage of tables one at a time,
aggregate navigation. They may also be run in parallel by the DBMS. then joins the results
Drill-across also supports distributed data warehousing. You can scale a dimen- Drill-across enables
sional data warehouse by placing star schemas and OLAP cubes on multiple distributed data
database platforms in multiple locations. Multi-pass queries allow these to be warehousing. Stars
accessed as a single data warehouse. Distributed data warehouses can use different can be placed on
hardware, operating systems, and DBMSs for each database server—as long as they different DBMSs
contain stars or cubes with conformed dimensions that can be queried by a com-
mon BI toolset using drill-across techniques.
As a general rule, fact tables shouldn’t be directly joined. Most fact tables have a
1:M or M:M relationship, which results in the facts being overstated when meas-
ures are calculated. Instead they should be queried by drilling across.
Drill-across works very well when queries need to combine summarized facts; for Drill-across queries
example, when business processes are compared at a monthly or quarterly level, work well for
individual multi-pass queries will access millions of facts but the answer sets will be summary-level
aggregated to a conformed row-header levels before they are returned, and BI tools process
will only have to merge reports’ worth of data—a few hundred rows. comparisons
Consequences
However, drill across doesn’t work for every type of cross-process or multi-event
analysis. For example, Figure 8-19 shows the sad state of a BI user who is trying to
compare orders and shipments. He is trying to ask questions such as “What was
the average delay on shipping an order item over the last six months?” and “How Drill-across queries
many unshipped items are there YTD this year vs. last year?” but his queries never become inefficient
seem to finish, or perhaps even start. The problem is these questions require when cross-process
individual line items from each transaction fact table to be compared before they analysis involves
are aggregated. This can result in multi-pass SQL that returns millions of rows that atomic-level
the BI tools must attempt to merge. Even when smart BI tools can construct the comparisons
correct in-database joins, performance can still be poor.
Figure 8-19
Unhappy BI user:
difficult drill-across
analysis
258 Chapter 8 How Many
Solution
Figure 8-20 shows what the user really needs: an orders accumulating snapshot
that can be queried using simple single-pass SQL. Following the agile approach to
Developing Accumulating Snapshots, outlined earlier in this chapter, this snapshot is
delivered as a derived fact table (DF), by merging the two existing order and
shipments transaction fact tables.
Figure 8-20
Happy BI user:
derived fact table
to the rescue
Derived fact tables Derived fact tables are built from existing fact tables to simplify queries. They use
solve difficult BI with additional ETL processing and DBMS storage, rather than more complex BI and
simple ETL rather SQL, to answer difficult analytical questions. In addition to aggregates, there are
than complex SQL three other types of derived fact table: sliced, pivoted, and merged.
A sliced fact table Sliced fact tables contain subsets of base fact tables; for example, UK sales
contains a subset of derived from a global sales fact table. Sliced fact tables can support restricted
a base fact table row-level access and data distribution needs as well as enhanced performance
for users who only need a subset of the data. They are often used in conjunc-
tion with swappable dimensions (SD) that contain matching subsets of dimen-
sional values.
A pivoted fact table Pivoted fact tables transpose row values in a base fact table into columns; for
transposes base fact example, a fact table with nine facts derived from a base transaction fact table
table rows into fact with a single fact that records nine transaction types. Pivoted fact tables make
columns fact comparisons and calculations simpler. The same rows-to-columns ap-
proach can also be used to create bitmap dimensions (see Chapter 9) that sup-
port combination constraint queries.
Design Patterns for High Performance Fact Tables and Flexible Measures 259
Merged fact tables combine facts and dimensions from two or more base fact A merged fact table
tables, summarized to a common granularity; for example, a fact table that combines multiple
combines targets with summarized actual sales, or an accumulating snapshot base fact tables,
derived from milestone transaction fact tables. Merged fact tables simplify summarized to a
cross-process analysis by replacing complex drill-across queries and expensive common granularity
joins with single star queries.
DF: used as a table code to identify a derived fact table constructed from one or
more existing fact tables. Used as a column code to identify derived facts that can
be calculated (possibly in a view) from other facts.
Data warehouse designs routinely fail to take full advantage of derived fact tables. DW/BI
Often there is a false impression that once the fact tables on the matrix have been retrospectives
loaded: “That’s the data stored now, the major ETL development is done, and should re-examine
everything from here on out is BI”. This can leave BI users and developers strug- the design
gling to answer increasingly complex business questions. In all forms of agile periodically, to see if
development, project teams hold end of sprint meetings, known as retrospectives, additional ETL or
to discuss what was successful and what could be improved. For DW/BI, retrospec- derived fact tables
tives should include BI developers sharing their most common reporting com- can simplify difficult
plexities with the team to see whether these queries can be simplified by derived queries
fact tables and other ETL enhancements.
Merged fact tables are often referred to as consolidated data marts when they are Consolidated data
used to combine and summarize facts from several different business processes on marts are the
a periodic basis. These “one-stop shop” data marts are incredibly popular with periodic equivalent
stakeholders because they provide high performance fact access in a format suit- of accumulating
able for simpler BI tools. Common consolidated data marts include: snapshots
Profitability data marts that combine revenue with all the elements of cost, to
support product or service profitability analysis.
Consequences
There is often pressure from business stakeholders to dispense with the details and
build highly summarized consolidated data marts directly from operational
sources to provide “quick win” key performance indicator (KPI) dashboards. Don’t attempt to
Unfortunately, data marts that summarize many different business processes and build consolidated
consolidate multiple operational sources are literally the last thing you should data marts before
build. Apart from the ETL risks, the lack of detail rapidly undermines confidence you have loaded
in the KPIs when users cannot drill-down deep enough to explain the figures and atomic detailed star
view actionable information. Instead, consolidated data marts should be developed schemas
incrementally as derived fact tables: derived from atomic-level fact tables.
260 Chapter 8 How Many
Summary
Transaction fact (TF) tables record the atomic-level, point-in-time facts associated with discrete
events.
Periodic snapshots (PS) provide additional atomic-level facts by sampling continuous business
processes and new aggregated facts by summarizing atomic transactional facts at regular
intervals.
Accumulating snapshots (AS) bring together the milestone events of a business process and
combine their transactional facts to provide additional performance measures
Both periodic and accumulating snapshots provide high performance access to measures that
would be impossible or impractical to calculate at query time, from atomic transaction fact
tables alone.
Apart from its type, the most important definition of a fact table is its granularity, which must
precisely state, in business terms or dimensionally (GD), the meaning and uniqueness of each
fact table row.
Event timelines are used to visually model the milestone events and duration measures of
evolving events that can be implemented as accumulating snapshots.
Accumulating snapshots that need to be sourced from multiple operational systems or contain
repeating milestones (with 1:M or M:M relationships) should be developed incrementally—by
first implementing transaction fact tables for their individual milestone events.
Fact additivity describes any restrictions on how a fact can be summed to produce a meaningful
value. Fully additive (FA) facts can be summed with no restrictions, using any combination of
available dimensions. Semi-additive (SA) facts must not be summed across their non-additive
(NA) dimension(s). Non-additive (NA) facts must not be summed.
Fact tables can be optimized by appropriate downsizing, indexing, partitioning and aggregation.
Cross-process analysis should be handled by drilling-across multiple fact tables one at a time
using multi-pass SQL or by building derived fact (DF) tables that merge commonly compared
fact tables.
WHY AND HOW
9
Dimensional Design Patterns for Cause and Effect
How am I doing?
— Ed Koch, Mayor of New York 1978–1989
Some of the most valuable dimensions in a data warehouse attempt to explain why Why and how
and how events occur. Why dimensions are used to describe direct and indirect dimensions are
causal factors. They are often closely linked to the how dimensions that provide all closely linked: they
the remaining event descriptions that are not related to the major who, what, when describe cause and
and where dimension types. Together why and how represent cause and effect and effect
complete the 7W dimensional description of a business event.
In our final chapter we cover dimensional design patterns for describing how This chapter
events occur and why facts vary. We focus particularly on bridge table patterns for describes why and
representing multiple causal factors and multi-valued dimensions in general. We how dimension
describe how bridge table weighting factors are used to preserve atomic fact granu- design patterns
larity and avoid ETL time fact allocations. We also describe how bridge tables can
be augmented with multi-level dimensions and pivoted dimensions to efficiently
handle barely multi-valued reporting and complex combination constraints. We
conclude with step, range band and audit dimension techniques for analyzing
sequential events, grouping by facts and handling ETL metadata.
Why Dimensions
Why details become The why details of an event become causal dimensions, such as promotion,
causal dimensions weather, or just reason. Causal dimensions explain why business events occur
that help to explain when they do, in the way that they do. They describe what stakeholders believe are
why facts occur in the influential factors for a business event; for example, price discounts driving up
the way they do sales transactions, or storms triggering home insurance claims. Causal factors fall
into two categories: direct and indirect.
Direct causal factors Promotional discounts are examples of causal factors that are directly related to the
have a recorded facts. You know with absolute certainty when they are or are not related to a sale
influence on the because the promotional code (or discounted product code) and the discounted
facts price are recorded or not recorded as part of the sale transaction.
Indirect causal Other causal factors—such as weather conditions, sporting events, or advertise-
factors may or may ment campaigns—are only indirectly related to facts. Stakeholders may know that
not have influenced these took place at the same time, in the same location as the facts they want to
the facts measure, but they can only speculate that they had an effect on them.
Causal factors can Causal factors can also be described as external or internal. Weather and sporting
be internal: under events are examples of external causes that an organization has no control over
the control of the (unless it is sponsoring the sporting event). Whereas, price discounts and advertis-
organization, or ing are examples of internal causal factors which the organization does control.
external: beyond Some internal causes—like seminars, sales calls and advertising—can be significant
its control business events in their own right, and warrant dedicated fact tables to analyze
their associated costs and activities by who, what, where, and when. In these cases
causal dimensions may be conformed across multiple cause and effect star schemas
that typically represent process sequences.
The special “No Promotion” record (PROMOTION KEY zero) will be the most used
record in the PROMOTION dimension, if most products are not on promotion
every day.
Dimensional Design Patterns for Cause and Effect 263
Figure 9-1
PROMOTION
dimension
While a PROMOTION dimension may be a small dimension, with only of few Indirect causal
hundred promotional condition combinations, it can be challenging to build and values are often
assign to the facts because of its mix of direct and indirect causal factors. Direct more difficult to
causal factors are usually straightforward to assign because they are captured by the source than direct
operational system but many of the interesting indirect causes may not be. For causal values
example, a sales system will not (reliably) record whether discounts are also pro-
moted by TV ads or special (in-store or on-website) product displays because this
information is not needed to complete each sale transaction and print a valid
invoice/receipt (which must show any direct discount details). A richly descriptive
promotion dimension will require this information to be sourced from elsewhere—
typically from less formal data sources, like spreadsheets and word processing
documents, and its ETL processing will need to be sophisticated enough to assign
the full combination of promotional conditions correctly.
The DW/BI team may have to build small data entry applications to capture causal
descriptions and timetables when this information is “known to business but not
known to any system.”
If BI users need to analyze promotion return on investment (ROI), the data ware- PROMOTION may
house will need an additional Promotion Spend Fact table—using the same con- be conformed
formed PROMOTION dimension. BI users can then run drill-across queries across sales and
against both PROMOTION SPEND FACT and SALES FACT to compare promo- promotion cost star
tion costs to sales revenue uplift. schemas
Figure 9-2
COMMENT
Dimension
In the absence of structured why details, a simple COMMENT dimension will still
allow BI users to search events using causal keywords and display comments on
reports when they find exceptional transactions. COMMENT dimensions can be
improved in future iterations by adding additional attributes and using “text-
mining” ETL routines to tag comments.
Figure 9-3
WEATHER
Dimension
Multi-Valued Dimensions
One of the challenges of causal dimensions—especially with external indirect Causal factors are
causes—is that there may be more than one cause for any given fact. For example, multi-valued
Figure 9-4 shows an EVENT CALENDAR table that documents several sporting dimensional
events that may have influenced product sales in July 2010. This table makes it easy attributes where
for BI users to answer the question: “How much did we make during the World there is more than
Cup?”, because they don’t have to remember the dates, just pick the single event one cause of the
from a drop-down list. As such this table is WHERE clause friendly and could be same type for a fact
used to store other event types with dynamic date ranges for which consistent date
range filters would be useful; for example, “Business” events like “Last 60 days,
Current year”, “Last 90 days, Current Year” and the same ranges from the previous
year.
Figure 9-4
Event Calendar
Solution
For BI users who need to group by multiple events, Figure 9-5 shows an alternative
version of the sporting schedule that is more GROUP BY friendly. EVENT DAY
CALENDAR stores each (sporting) event and date combination—a 14 day event Over-counting can
like Wimbledon will be stored as 14 rows. This may appear a little wasteful but it be avoided by
has two benefits: providing a
weighting factor
Each date/event combination can be given a weighting factor to allow facts to
be allocated amongst the multiple events that occur on the same day.
266 Chapter 9 Why and How
It simplifies the fact table join—this is now a simple inner join on the single
EVENT DATE KEY just like a standard CALENDAR dimension instead of a
BETWEEN join on START DATE KEY and END DATE KEY.
Figure 9-5
Event Day Calendar
Weighting factors If you take a look at the example data in Figure 9-5 you will see that the weighting
for each multi- factors for any one date add up to 1 (100%). For example, on June 21 2010 both the
valued group (e.g., World Cup and Wimbledon are taking place so they both receive 50% of the sales
all the events on a activity by giving them a weighting factor of 0.5 (50%). Whereas, on July 14 2010
day) must total to 1 the Tour De France is the only significant sporting event taking place (perhaps this
is the only event that Pomegranate has sponsored that day) so it gets a weighting
factor of 1 (100%). Now when sales are grouped by EVENT, sales revenue can be
“correctly weighted” by multiplying each atomic revenue fact by the sporting event
weighting factor for the day it was recorded, as shown in the SQL below:
Consequences
Of course these may not be the “correctly weighted” figures at all—if a business
The “correct” sells more tennis rackets than soccer balls, the Wimbledon/World Cup split should
weighting factor be quite different. Allocation is usually problematic because different stakeholder
split can depend groups have different ideas about how the atomic facts should be split. However,
on the facts being the one thing no one can argue about is the weighted total. If the weighting factors
queried and who is always adds up to 1 for any day, the grand total for all the days covered by a report
doing the querying will be correct—so long as no events are filtered out.
Useful impact reports can be constructed by querying both the weighted facts and
unweighted versions of the facts. The unweighted facts can be displayed in the
body of the report for each row header; for example, World Cup $30M and Wim-
bledon $10M. The weighted facts can be aggregated within the BI tool to produce
a correct grand total for the report; for example, $30M for the two events (because
they completely overlap).
Dimensional Design Patterns for Cause and Effect 267
By working through the 7Ws you discover that a DOCTOR (who) claims an
amount for a TREATMENT (what) given to a PATIENT [Employee] (who) on a
specific TREATMENT DATE (when), as shown in the BEAM✲ event table, Figure
9-6. These are all single-valued details that convert readily to dimensions. But a
problem arises when you come to the why question:
When you ask for examples for the resulting DIAGNOSIS why detail, you discover
that a claim contains multiple diagnosis—there is typically more than one thing
wrong with a patient—and the diagnosis codes (ICD10 codes) submitted as part of
every claim are not linked to the specific treatments.
Figure 9-6
MEDICAL
TREATMENTS
event table
containing
group themed
examples
You capture this business knowledge about multiple diagnoses by marking Multi-valued (MV)
DIAGNOSIS as MV to denote a multi-valued detail. Generally you discover MV and multi-level (ML)
details by getting stakeholders to tell group themed stories. You do this after they event details are
have told all their other themed stories (typical, different, missing, repeat) by point- discovered by telling
ing at each detail and asking stakeholders if they can give you an example of the group themed
same type of event that would contain groups of that detail; e.g. a group of custom- stories
ers or a group of products. For most events and most details they won’t be able to
because multi-valued details (and multi-level (ML) details, which you also find this
way) are the exception rather than the rule—thankfully.
268 Chapter 9 Why and How
Solution
Fact allocation problems can be avoided by leaving the fact table granularity
unaltered, and using a multi-valued bridge table (MV) instead, to resolve the M:M
Use a multi-valued relationship. For example, DIAGNOSIS GROUP [MV], shown in Figure 9-7, can
bridge table (MV) to be used to join unaltered claim facts to a DIAGNOSIS dimension. It does this by
resolve a M:M storing the multiple DIAGNOSIS KEYs of a claim as separate rows of a diagnosis
relationship group, each with a now familiar WEIGHTING FACTOR. Diagnosis groups are
between a fact table created and assigned a surrogate key (DIAGNOSIS GROUP KEY) as unique claim
and a dimension diagnosis combinations are observed during ETL. These bridge table keys are
added to the facts as they are loaded so that tables can be joined as in Figure 9-8.
Figure 9-7
DIAGNOSIS
GROUP
multi-valued
bridge table
Bridge tables avoid Not only does the bridge table resolve the technical problem of the M:M relation-
the political issues ship, it sidesteps the political issues of how to split the atomic facts and provide
of hard-coding fact greater reporting performance and flexibility. By not increasing the number of
allocations facts and altering their values, queries that stick to the normal single-value dimen-
sions to analyze busiest doctors, sickest patients or most expensive treatments run
Dimensional Design Patterns for Cause and Effect 269
as fast as possible and produce “unarguable” answers. Even queries that filter on Bridge tables
one specific ICD10 code produce similar fast “unarguable” answers. Only when BI provide flexible
users want to analyze by multiple diagnoses do they have to consider the weighting multi-valued
factors and argue about allocations. When they do, they can choose to ignore reporting. Users
weighting factors and look at the unweighted treatment costs, use the default can choose how
(crude) weighting factors in DIAGNOSIS GROUP (which add up to 100% for each they weight the
diagnosis group) or model their own weighting factors in swappable versions of facts at query time
DIAGNOSIS GROUP and use those instead.
Figure 9-8
Using a
multi-valued
bridge table
Consequences
When you discover a potential multi-valued dimension you should first check that
the granularity of the facts is correct before complicating the design with a bridge
table. If this is an aggregation of the available operational details you may be able to
turn the multi-valued dimension into a normal dimension by going down to the
atomic level of detail. For example, if you modeled an invoice fact table with a Check that fact
granularity of one row per invoice then PRODUCT would be a MV dimension. granularity is correct
Modeling the atomic invoice line items easily solves this. However, if you are (atomic) before
already at the atomic-level you can avoid “splitting the atom” and creating mean- using a bridge table
ingless (unstable subatomic) measures by using a bridge table. design
270 Chapter 9 Why and How
Solution
When a dimension is barely multi-valued, a bridge table can be avoided by making
the dimension multi-leveled (ML) so that it contains additional records for the
Use a multi-level small number of multi-valued groups needed. For example, the multi-level
(ML) dimension to EMPLOYEE [HV, ML] dimension, in Figure 9-9, holds normal employee records
avoid joining for sales consultants and additional records for sales teams made up of two or more
through a bridge consultants. It contains example dimension members for two employees Holmes
table and Watson, and handles facts where they have worked together (when the game is
afoot) by treating their team “Holmes & Watson” as a pseudo-employee. This
allows EMPLOYEE to join directly to the sales fact table (as in Figure 9-10) and
rollup all their individual and joint sales to the appropriate branch at the time of
sale. For example, Watson’s individual sales will be rolled up to Afghanistan or
London depending on when they occurred. His joint sales with Sherlock Holmes
will always be rolled up to London.
Figure 9-9
Multi-level
EMPLOYEE
dimension
containing additional
team rows
All SK column examples values will be replaced by integer surrogate keys in the
physical model. The (2) records show the effect of an HV attribute change for
employee John Watson (moving back to London).
Dimensional Design Patterns for Cause and Effect 271
Figure 9-10
Joining the multi-
level dimension
directly to the facts
For most queries that need to total sales the efficient direct join will be ideal, but A bridge table will
those queries that calculate team sales splits will still need to treat EMPLOYEE as still be needed for
multi-valued. They can do so by joining through an optional bridge table, such as queries that must
TEAM shown in Figure 9-11, that provides the team split percentage (the equiva- use a multi-valued
lent of a weighting factor) for each team member. The presence of the optional weighting factor
bridge effectively makes the direct join a shortcut that can be used whenever
PERCENTAGE is not needed.
Figure 9-11
Joining through the
TEAM optional
bridge table
To be able to optionally join through a bridge, or directly to the facts, both the An optional bridge
optional bridge table and the ML dimension must use the same surrogate keys, table must use the
effectively making them swappable dimensions. For example, the Figure 9-12 same surrogate key
BEAM✲ diagram for TEAM shows that the bridge table key TEAM KEY is actually values as its multi-
a foreign key role of EMPLOYEE KEY. TEAM uses the special pseudo-employee level dimension
key values, shown in Figure 9-9, to record the members and percentage splits for
each team. It also uses normal employee key values on the records where TEAM
KEY and MEMBER KEY are the same and PERCENTAGE is 100. These act like
teams of one, allowing the bridge table to join employees to 100% of their individ-
ual sales facts—the equivalent of a direct join.
Figure 9-12
Multi-valued and
Multi-leveled
bridge table
272 Chapter 9 Why and How
Adding multiple TEAM contains an additional attribute MEMBERSHIP TYPE to describe these
levels to the bridge “Team Split” and “Employee” records. It also records a third membership type of
table increases “Team” for some of the 100% records (highlighted in bold in Figure 9-12). These
reporting flexibility records allow the bridge to join facts to the team level records (e.g. “Holmes &
Watson”) in EMPLOYEE as well as normal employee records. This makes TEAM
[HV, MV, ML] a multi-level (ML) as well as multi-valued bridge table, enabling it
to be used to flexibly query both team sales and employee sales in a single pass. For
example, consider the following query:
Bridge table levels This returns both team sales and employee total sales (including their team sales)
must be carefully — a very useful report — but care must be taken not to add a grand total because it
filtered to avoid would double-count team sales. Filtering on MEMBERSHIP TYPE removes this
double-counting limitation and makes the following additive reports available:
team sales:
Where Membership_Type = "Team"
teams sales and employees individual sales (excluding their team sales):
Where Membership_Type <> "Team Split"
This last filter is the equivalent of the shortcut join that avoids the bridge table.
Consequences
Multi-level bridge tables are complex. Including the multiple levels provides
complete reporting flexibility, but at a price. Queries must filter the bridge table
Multi-level bridge correctly to avoid double-counting or misinterpreting the results. Keeping the
tables are complex multiple levels in the dimension and bridge synchronized also requires significant
to build and additional ETL processing. For example, one change to Watson’s location requires
complex to use 2 new rows in EMPLOYEE and 4 new rows in TEAM to keep Watson and the
correctly Holmes & Watson team in sync. All these new rows have been marked (2) in
Figures 9-9 and 9-12 to highlight the second versions of Watson, his team, and his
team splits in the two tables created by his return from Afghanistan to London.
Dimensional Design Patterns for Cause and Effect 273
Figure 9-13
OPTION PACK
bridge table added
to ORDER FACTS
Unfortunately, even with the role-playing bridge table and a unique degenerate ID
the proposed design does not easily answer their third question type:
Combination The AND logic of the option combinations complicates matters. The users cannot
analysis can involve answer this question with simple SQL that might contain: “WHERE Option=2
complex set logic and Option=3…” because OPTION can be equal to both 2 and 3 at the same
SQL time! Instead they must:
These 9 subqueries can be executed as a single SQL SELECT but users would not
be able to construct them (or other combination questions) using simple ad-hoc
query tools. Even if they could, the queries would not necessarily perform well.
Solution
If the number of options available across all customizable products is limited (e.g.
200 in total) and relatively static (e.g. new options are only added once a year) this
row problem can be turned into a column solution with a bit of lateral thinking.
Use a pivoted Figure 9-14 shows an OPTION PACK FLAG dimension. This is a pivoted dimen-
dimension to turn sion (denoted by the code PD) that stores the same option combinations as the
complex row bridge table, but as columns rather than rows. It requires 200 columns to do so but
constraints into these columns are just bit (or single byte) flags and they make combination con-
simple column straints very easy to build in SQL. For example, the filter for the previous user
constraints question becomes:
Dimensional Design Patterns for Cause and Effect 275
WHERE Option2 = “Y” and Option3 = “Y” and Option14 = “Y” and
Option4 = “N” and Option5 = “N” and Option190 = “N”
The example data in Figure 9-14 shows that option pack the users are looking for is A bridge table and
OPTION PACK KEY 1. This is the same value as the more complex set based pivoted dimension
queries would eventually find in the bridge table because the pivoted dimension are swappable
and the bridge table use the same surrogate key—they are swappable versions of dimensions that can
each other. This means that the fact table does not need to be altered to add the be used together
pivoted dimension if the bridge table key is already present. There is value in
having both tables because the bridge table and OPTION dimension combination
is GROUP BY friendly and single value WHERE clause friendly while the pivoted
dimension is combination WHERE clause friendly. To make the pivoted dimension
user-friendly as well it should be built with meaningful names for each option
column; for example, MEMORY UPGRADE, CPU UPGRADE, RAID
CONFIGURATION etc.
Figure 9-14
OPTION PACK
pivoted dimension
In Figure 9-14, the pivoted dimension has an additional OPTION PACK attribute Add comma
containing comma separated lists of option codes. This can be used in a query separated list
GROUP BY clause or displayed in a report header/footer to describe the filters that attributes to flag
have been applied. In the user-friendly version of the pivoted dimension this would dimensions to make
be a long text column containing a list of descriptive option names (sorted in them more report
alphabetic order). It can be useful to provide both versions; e.g., an OPTION display-friendly
PACK NUMBER list of codes and an OPTION PACK list of descriptions.
If you need to build a column flag dimension, create the multi-valued bridge table
version first. Maintaining this type of table is easier with standard ETL routines
and simple SQL. After the bridge table is in place you can then create more
elaborate ETL routines that pivot its rows to create and maintain the column-
orientated version with meaningful column names generated for the descriptive
row values using SQL generated SQL.
While bridge tables and pivoted dimensions often go together, the need for a Even without a
pivoted table is not limited to multi-valued dimensions. For example, if the granu- bridge table, a
larity of ORDERS FACT was one record per product option order line item (the pivoted dimension is
product plus each of its custom options as fact rows), this would avoid the multi- needed to cope with
valued bridge but the pivoted dimension would still be needed to easily answer the complex ad-hoc
combination questions. It would just be more work to build the pivoted dimension combination queries
276 Chapter 9 Why and How
from scratch without the bridge and the fact table would still need to be altered to
add the OPTION PACK KEY.
If bridge table rows If the business problem was more complex and varying quantities of each option
contain quantities, could be chosen to configure a product, the OPTION PACK bridge table would
a pivoted dimension need to contain an OPTION QUANTITY attribute and the OPTION PACK [PD]
can contain count pivoted dimension would contain option count columns rather than [Y/N] flags.
columns Similarly, if small numbers of options and options quantities were handled as
separate fact rows (to avoid a bridge table) and comparisons or combinations were
constantly used then a pivoted fact table might be created with option count facts.
Consequences
Pivoted dimensions are limited by the maximum number of columns available in a
Pivoted dimensions database table (usually between 256 and 1024) and the ETL involved in automating
are limited to the maintenance of volatile combination values is complex. A pivoted dimensions
relatively small and works well for Pomegranate because there are only a few hundred relatively stable
stable value options (with several new ones being add manually each year) but it could not cope
populations with a possible 155,000 ICD10 diagnosis codes.
How Dimensions
How dimensions are How dimensions document any additional information about facts that are not
often degenerate captured by other dimensions. The most common how dimensions are degenerate
(DD) IDs that (DD) transaction identifiers stored in fact tables. These dimensions describe how
provide useful facts come to exist by tying them back to the original source system transactions.
links to operational They can also be invaluable for providing unique transaction counts. For example,
source records and an ORDER ID in an ORDERS FACT table can be used to count how many orders
unique count contained at least one laptop product line item. Using COUNT(DISTINCT
measures Order_ID) ensures that individual orders with several line items for different
laptops will not be over-counted. As mentioned earlier, a degenerate ID that can be
uniquely counted is essential if a star schema has one or more multi-valued dimen-
sions.
Look out for conformed degenerate dimensions and add them to the event matrix
to help you discover event sequences and milestone dependencies that can be
modeled as evolving events.
Dimensional Design Patterns for Cause and Effect 277
If a degenerate [Y/N] flag will be frequently counted it can be remodeled as a Some degenerates
low cardinality additive fact with the values 0, 1 that can be summed. This is can be remodeled
especially useful as aggregates can be built that use this fact. as useful additive
and non-additive
If the degenerate is high cardinality and will be counted distinctly it should facts
remain in the fact table where it will act as a non-additive fact.
If a degenerate flag describes the type of value in an adjacent fact it may repre-
sent data that would be better modeled as separate additive fact columns
without the flag. For example, a REVENUE fact and a flag REVENUE TYPE
with the values: ‘E’ for estimate and ‘A’ for actual, should instead be modeled
as two facts: ESTIMATED REVENUE and ACTUAL REVENUE.
Sometimes a degenerate meets more than one of these criteria. For example, a
flag may be frequently counted, and used for grouping and constraining. In
which case, you can model it as both a fact and a dimensional attribute.
Don’t tell stakeholders that any of their data is “junk”, especially when you are
modelstorming with them. If you are looking for a less pejorative term for your
non-conformed how dimensions, call them miscellaneous dimensions.
Solution
Numeric range band dimensions such as RANGE BAND, shown in Figure 9-15,
are another type of how dimension. They are how many dimensions or “How do
you turn a fact into a dimension?” dimensions that convert continuously valued
Provide a range high cardinality facts into better discrete row headers. Chapter 6 described how
band dimension to high cardinality dimensional attributes should be stored as range band labels that
“turn facts into are more useful for grouping by. Range band dimensions allow this to be done
dimensions” dynamically at query time to facts and other numeric dimensional attributes.
Figure 9-15
RANGE BAND
dimension
Range band Figure 9-15 is an example of a general-purpose range band dimension that can
dimensions convert store any number of range band groups. The example data shows two groups: “5
high cardinality facts Money Bands” that would be used to group REVENUE into 5 bands and “4 Age
into useful low Bands” that can be joined to a customer or employee age to group a population
cardinality report into 4 bands. Figure 9-16 shows how the RANGE BAND dimension is joined to
row headers SALES FACT to count the number of products sold in each of the 5 revenue
ranges—effectively converting the REVENUE fact into a dimension on-the-fly.
The SQL for the query would be:
Dimensional Design Patterns for Cause and Effect 279
SELECT range_band, SUM(quantity_sold)
FROM sales_fact, range_band
WHERE range_band_group = “5 Money Bands”
AND revenue BETWEEN low_bound AND high_bound
GROUP BY range_band
Range band dimensions allow BI users to define new bandings at any time—by Index facts that are
simply adding or changing dimension rows. The price for this flexibility will be frequently used for
slower query performance because SQL between joins are difficult to optimize. If range banding
certain facts are frequently used for range banding they can be indexed to improve
join and sort processing. Normally only the dimensional foreign keys are indexed.
Facts are usually not indexed because indexes do not speed up their aggregation.
But for range banding queries, the facts are acting like dimensional foreign keys.
Figure 9-16
Range banding
a fact
Consequences
RANGE BAND GROUP, LOW BOUND, and HIGH BOUND form the primary
key (PK) of the RANGE BAND dimension, and must therefore be unique. You
should set up the LOW BOUND and HIGH BOUND values for each range band Range bands must
with care: they should not overlap, and no gaps should exist. In addition, the be carefully defined.
RANGE BAND names must be unique within each RANGE BAND GROUP. The They must be
short code ND1 (No Duplicates) in Figure 9-15 has been added to these columns unique with no gaps
to indicate that they form a no duplicates group (number 1)—the combination of and no overlaps
column values within the group must be unique.
Solution
The humble looking STEP dimension, in Figure 9-17, helps BI users understand
sequential behavior. It allows ETL processes to explicitly label events with their
position in a sequence (from its beginning and from its end), along with the length
of the sequence. For example, a web browsing session of four page views by the
same visitor (IP address) within an agreed timeframe would be represented as four
A step dimension rows in a PAGE VIEWS FACT table. The first page view event would be labeled as
numbers each event step 1 of 4 by assigning it a STEP KEY of 7 (see Figure 9-17). The next page view
in a sequence would be labeled as step 2 of 4 using STEP KEY 8, and so on.
Figure 9-17
STEP dimension
A STEP dimension BI users can use the STEP dimension to easily identify page views belonging to
enables positional sessions of any length, rank pages by position within sessions, and answer ques-
analysis (better tions about the beginning, midpoint and ending of sessions for any interesting
story telling) using subset of customers, time, and products. They can quickly find the good and bad
simple single-pass (“session killer”) last page visits of a session (LAST STEP = “Y”), or those that
queries precede session killers (STEPS UNTIL LAST = 1) using simple, single-pass SQL.
Answering questions like these without a STEP dimension would be too difficult
for all but the most SQL-savvy BI users.
Step dimensions A STEP dimension can also play multiple roles for an event; for example, Figure 9-
can play multiple 18 shows a PAGE VIEWS FACT table with two STEP dimension roles: STEP IN
roles to describe SESSION which describes page position within the overall session, and STEP IN
sequences within PURCHASE which describes how close each page is to a purchase decision. Each
sequences time a visitor clicks on a link to place a product in a shopping cart, STEP IN
PURCHASE would be reset and the next mini-sequence length calculated.
Figure 9-18
Using the STEP
dimension to
describe web page
visits
Dimensional Design Patterns for Cause and Effect 281
The STEP IN PURCHASE dimension role lets BI users analyze page visit Events that are not
sequences that lead to product purchases and ones that don’t: page views that don’t part of a sequence
lead to a purchase would have a STEP IN PURCHASE KEY that points to the "Not use STEP row 0
Applicable" row 0 in STEP.
Consequences
STEP dimensions are relatively simple to populate from spreadsheets, but they
grow surprisingly quickly as the maximum number of steps increases. The formula
for calculating the number of rows needed for n total steps is: n(n+1)/2. There- STEP dimensions
fore, 200 steps = 20,100 rows, and 1,000 steps would be more than half a million grow in size quickly.
rows! If 200 steps are more than adequate for 99% of all sequences, pre-populate You should set a
your STEP dimension accordingly, and create special step number records greater maximum number
than 200 if/when they are needed. These records would use special STEP KEY of steps for the
values (e.g. the negative step number) and would contain the STEP NUMBER but majority of
have missing values for the other attributes to denote that they are steps in "excep- sequences
tionally long" sequences. Often exceptionally long sequences are the result of ETL
processing errors or poorly defined business rules that fail to spot the end of a
normal sequence.
Although designing and creating a STEP dimension is straightforward, attaching it STEP dimensions
to the facts can require significant additional ETL processing. The events that require additional
belong to the same sequence have to be identified by an appropriate business rule ETL processing to
(for example, all the page visits from the same IP address that are no more than 10 make two passes of
minutes apart) and counted in a first pass of the data; only then can the correct the data
STEP KEYs be assigned to each fact row in a second pass.
Overloading facts with STEP information and other richly descriptive why and how
dimensions takes significant additional effort from the ETL team. You should
make sure you take them to lunch—on a regular basis.
Too often the answers to these questions are locked away in an ETL tool metadata Stakeholders want to
repository—inaccessible to BI users who need this information the most. query data lineage
282 Chapter 9 Why and How
Solution
Figure 9-19 shows an AUDIT dimension that presents ETL statistics and data
quality descriptions in a dimensional form—tied directly to the facts—where they
can be queried by BI users, and used to provide additional context within the body
of reports or as header or footer information. The AUDIT dimension surrogate
key—AUDIT KEY—represents each execution of an ETL process. For example, if
An audit dimension there are five different ETL modules that support the nightly refresh of the data
provides summary warehouse, there would be at least five new rows added to the AUDIT Dimension
ETL metadata in a each night. Each of these rows would have a unique AUDIT KEY, which would
dimensional format appear in the fact table (and dimension) rows that were created or updated by the
given ETL execution—providing basic data lineage information on each fact (and
dimension): where it came from, and how it was extracted and loaded or last
updated.
Figure 9-19
AUDIT dimension
Audit dimensions Figure 9-19 also shows additional indicator attributes (in bold) that describe data
can be expanded to quality and completeness. The Audit dimension would contain additional rows for
provide basic data each ETL module so that unusual facts records can be explicitly flagged if they
quality indicators contain out of bounds (defined by example data, data profiling, or historical
norms), missing, adjusted or allocated values.
Audit dimensions Audit dimensions leverage the value of ETL metadata. By making it available
turns metadata into within each star schema they elevate metadata to the position of “real” data—
normal data that can another how or why dimension that BI users can use to group or filter their reports
be used to query the to help explain the figures they see.
facts
You can find additional information on creating and populating Audit dimensions
in The Data Warehouse ETL Toolkit, by Ralph Kimball and Joe Caserta (Wiley,
2004) pages 128–131
Dimensional Design Patterns for Cause and Effect 283
Summary
Why dimensions are used to store direct and indirect causal reasons. Direct causal factors such
as price discounts are typically easier to implement and attribute to facts than indirect factors
because they are captured as part of a business event and do not need to be inferred from addi-
tional internal or external sources.
Unstructured why details are often captured as free text comments. These should be stored in a
COMMENT why dimension rather than as degenerate dimensions within fact tables.
Multi-valued (MV) bridge tables are used to resolve multiple causal factors and other multi-
valued dimension relationships. Bridge tables avoid having to change the natural atomic granu-
larity of a fact table and hard-coding fact allocations at ETL time. Using a bridge table allows BI
users to choose how to weight the facts at query time. They also avoid multi-valued issues alto-
gether when queries do not use the multi-value dimension.
Optional bridge tables and multi-level dimensions that share common surrogate keys can be
used to efficiently handle barely multi-valued dimensions. Queries that do not need to deal with
a multi-valued dimension level and its weighting factor can attach the multi-level dimension di-
rectly to the facts to rollup to single-valued hierarchy levels.
Pivoted dimensions (PD) are built by transposing row values into column flags or column
counts. They are used to simplify combination constraints that would otherwise be difficult to
place across multiple-rows. Pivoted dimensions are often implemented as swappable versions of
multi-valued bridge tables. For query flexibility it is useful to have both the row-oriented bridge
table for grouping and the column-oriented pivoted dimension for combination filtering. It is
also easier to build a pivoted dimension once the bridge table is in place.
Degenerate how dimension (DD) transaction IDs ensure that facts are traceable back to source
systems. They also provide unique event counts for use in multi-valued queries.
Physical how dimensions are typically non-conformed dimensions that are specific to a single
fact table. These miscellaneous dimensions provide a home for the unique combinations of de-
generate dimensions that are too numerous to leave in the fact table. They reduce the size of fact
tables and make it easier for users to browse the dimensional values combinations.
Range Band dimensions support the ad-hoc conversion of continuously variable facts and di-
mensional attributes into report-friendly discrete bands for grouping and filtering.
Step dimensions allow facts to be analyzed using their relative position within event sequences.
They enable BI users to discover events that closely follow or precede other significant cause
and effect events. The help the data warehouse to tell better stories.
Audit dimensions make ETL data lineage and data quality metadata available within star sche-
mas so that it can easily be used with BI reports.
I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.
I send them over land and sea,
I send them east and west;
But after they have worked for me,
I give them all a rest.
…
That is, while there is value in the items on the right, we value the items on the left more.
Our highest priority is to satisfy the customer through early and continuous delivery of
valuable software.
Welcome changing requirements, even late in development. Agile processes harness change
for the customer's competitive advantage.
Deliver working software frequently, from a couple of weeks to a couple of months, with a
preference to the shorter timescale.
Business people and developers must work together daily throughout the project.
Build projects around motivated individuals. Give them the environment and support they
need, and trust them to get the job done.
The most efficient and effective method of conveying information to and within a
development team is face-to-face conversation.
Working software is the primary measure of progress.
Agile processes promote sustainable development. The sponsors, developers, and users
should be able to maintain a constant pace indefinitely.
Continuous attention to technical excellence and good design enhances agility.
Simplicity—the art of maximizing the amount of work not done—is essential.
The best architectures, requirements, and designs emerge from self-organizing teams.
At regular intervals, the team reflects on how to become more effective, then tunes and
adjusts its behavior accordingly.
© 2001 Agile Alliance 285
APPENDIX B: BEAM✲ NOTATION AND
SHORT CODES
Table Codes
Event Story and Fact Table Types
CODE MEANING/USAGE CHAPTERS
[DE] Discrete Event. Event represents a point in time or short duration 2, 8
transaction that has completed. Implemented as a transaction fact table.
[RE] Recurring Event. Event represents measurements taken at predictable 2, 8
regular intervals. Implemented as a periodic snapshot fact table.
[EE] Evolving Event. Event represents a process that takes time to complete. 2, 4, 8
Implemented as an accumulating snapshot fact table.
[TF] Transaction Fact table. Physical equivalent of a discrete event [DE]. Typi- 5, 8
cally maintained by insert only.
[AS] Accumulating Snapshot. Physical fact table equivalent of an evolving event 8
[EE]. Maintained by insert and update. Typically contains multiple milestone
date/time dimensions with matching duration and state count facts.
[PS] Periodic Snapshot. Physical fact table equivalent of a recurring event [RE]. 8
Typically contains semi-additive facts.
[AG] Aggregate. Fact table that summarizes an existing fact table. 8
[DF] Derived Fact table. Fact table that is constructed by merging, slicing, or 8
pivoting existing fact tables.
{Source} Data source. Default source system table or filename. 5
287
288 Appendix B
Column Codes
General Column Codes
CODE MEANING/USAGE CHAPTERS
MD Mandatory. Column value should be present under normal conditions. 2
Column is defined as nullable so it can handle errors.
NN Not Null. Column does not allow nulls. All SK and FK columns are not null by 5
default.
ND No Duplicates. Column must not contain duplicate values. The numbered 9
NDn version is used to define a combination of columns that must be unique.
Xn Exclusive. A dimensional attribute that is not valid for all members of a 3, 8
dimension. Used in conjunction with a DC defining characteristic.
Number coded to identify mutually exclusive attributes or attribute groups and
identify the defining characteristics it is paired with.
Also used to denote exclusive facts that are only valid for certain dimensional
values.
DC Defining Characteristic. Column value dictates which exclusive attributes or 3, 8
DCn,n facts are valid. For example, Product Type DC defines which Product
attributes are valid. Number coded when multiple defining characteristics
exist in the same table.
[W-type] Dimension type or name. The W (who, what, when, where, why, how) type 4, 6
[dimension] of an event detail or the dimension name when a detail is a role; for example,
Salesperson [Employee] where Salesperson is a role of the
Employee dimension. Also used to show a recursive relationship within a
detail table.
{Source} Data source. The name of a column or field in a source system. Can be 5
qualified with a table or filename if necessary (when different from the table
default).
Unavailable Unavailable or incorrect. Column name or column code annotation denot- 5
MD ing that source data is unavailable or does not comply with the current
column type definition. For example, MD denotes that the source system
does not treat the data as mandatory as it contains null or missing values.
Gender denotes that Gender is not available.
Data Types
CODE MEANING/USAGE CHAPTERS
C Character data type. The numbered version is used to define the maximum 5
Cn length. Overrides the default length.
N Numeric data type. The numbered version is used to define precision. 5
Nn.n Overrides the default precision.
DT Date/Time data type. The numbered version is used in duration formulas for 4, 5, 8
DTn derived facts; for example, Delivery Delay DF=DT2-DT1. Numbering
can also denote default chronological order of milestones within an evolving
event.
290 Appendix B
D Date data type. The numbered version is used in duration formulas for 5, 8
Dn derived facts; for example, Project Duration DF=D2-D1. Numbering
can also denote chronological order of milestones within an evolving event.
T Text. Long character data used to hold free format text. The numbered 5
Tn version is used to define the maximum length. Overrides the default length.
B Blob. Binary long object used to hold documents, images, sound, objects, 5
and so on.
Key Types
CODE MEANING/USAGE CHAPTERS
PK Primary Key. Column or group of columns that uniquely identify each row in 5
a table.
SK Surrogate Key. integer assigned by the data warehouse as the primary key 5
for a dimension table. Used as a foreign key in fact tables.
Used to denote that example data in a BEAM✲ table column will be replaced
by an integer foreign key in the physical model.
BK Business Key. A source system key. 3, 5
NK Natural Key. A (source system) key used in the real world 5
FK Foreign Key. A column that references the primary key of another table. 5
RK Recursive Key. A foreign key that references the primary key of its own 6
table. Often used to represent variable-depth hierarchies. Stores information
needed to build hierarchy maps; for example, Parent Company Key in
Company.
time.
FV Fixed Value attribute. A dimensional attribute that should not change; for 3
example Date of Birth. FV attributes however can be corrected. When
FV attributes are corrected they behave like CV attributes: the previous
incorrect value is not preserved.
PVn Previous Value attribute. A dimensional attribute that records the previous 6
value of another current value attribute. Also known as a type 3 slowly
changing dimensional attribute.
PVn is always used in conjunction with a matching CVn to relate the previous
value to the current value; for example, Previous Territory PV1 and
Territory CV1.
PV attributes can also be used to hold initial or “as at specific date” values;
for example, Initial Territory PV1 or YE2010 Territory PV1.
NA Non-Additive fact. A fact that cannot be aggregated using sum; for example, 8
NAn Temperature NA. Non-additive facts can be aggregated using other func-
tions such as min, max, and average.
A non-additive dimension of a semi-additive fact. The numbered version is
used to relate non-additive dimensions to specific semi-additive facts when
multiple SAs exist in the same table.
DF Derived Fact. A fact that can be derived from other columns within the same 8
DF= table. May be followed by a simple formula referencing other facts or
formulae date/time details by number; for example, Unit Price DF=Revenue
/Quantity.
[UoM] Unit of measure. Unit of measure symbol or description; for example, Order 2, 4
[UoM1, Revenue [$] or Delivery Delay [days].
UoM2,...] Lists multiple units of measure required for reporting, with the default stan-
dard unit (UoM1) first. All quantities are stored in the standard UoM to pro-
duce an additive fact.
APPENDIX C: RESOURCES FOR AGILE
DIMENSIONAL MODELERS
Here is our list of recommended resources to help you implement the ideas contained in the book.
The Wi-Fi–enabled cameras in smartphones and tablets can be tremendously useful for capturing
modelstorming results and quickly transferring them to a laptop for further review (stick to black ink on
your whiteboards and flipcharts to help with that). There are many apps that can automate the workflow
of cleaning up whiteboard images and moving them to shared folders for group viewing.
Go Large
Digital projectors are our number one high-tech collaborative modeling tools. It’s amazing how quickly
everyone can spot opportunities to improve a data model when it’s blown up large on the wall. Invite
your colleagues to the screening of your latest data model. Perhaps they can stay for a movie afterwards!
293
294 Appendix C
Books
Agile Software Development
Scrum and XP from the Trenches, Henrik Kniberg (InfoQ.com, 2007)
Not why do agile (like so many books) but how Henrik did agile.
Gamestorming: A Playbook for Innovators, Rulebreakers, and Changemakers, Dave Gray, Sunni
Brown, James Macanufo (O’Reilly Media, 2010)
Visual Meetings: How Graphics, Sticky Notes and Idea Mapping Can Transform Group Productivity,
David Sibbet (Wiley, 2010)
Books to help you facilitate modelstorming sessions and improve upon the collaborative techniques
we have introduced.
Business Model Generation: A Handbook for Visionaries, Game Changers, and Challengers,
Alexander Osterwalder, Yves Pigneur et al. (Wiley, 2010)
Check out the Business Model Canvas for more high-level collaborative modeling ideas.
Dimensional Modeling
Star Schema The Complete Reference, Christopher Adamson (McGraw-Hill, 2010)
ETL
The Data Warehouse ETL Toolkit, Ralph Kimball, Joe Caserta (Wiley, 2004)
Mastering Data Warehouse Aggregates, Christopher Adamson (Wiley, 2006) Chapters 5 and 6 also
provides excellent coverage of dimensional ETL.
Websites
decisionone.co.uk : DecisionOne Consulting, Lawrence Corr’s training and consulting firm.
modelstorming.com : The companion website to this book where you can download the BEAM✲
Modelstormer spreadsheet, the BI Model Canvas (inspired by the Business Model Canvas) plus other
useful BEAM✲ tools and example models from the book and beyond. It also contains links to our rec-
ommended books, articles, websites, and training courses.