0% found this document useful (0 votes)
24 views39 pages

Ais Prof 1 Chapter 5

The document discusses the importance of dimensional modeling in data warehousing and business intelligence, highlighting its advantages over normalized 3NF models in terms of user understandability and query performance. It outlines the goals of DW/BI systems, the structure of dimensional models including fact and dimension tables, and the ETL process for data integration. Additionally, it addresses various architectures for data warehousing, including the Kimball and Inmon approaches, and dispels common myths about dimensional modeling.

Uploaded by

Kei Camba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views39 pages

Ais Prof 1 Chapter 5

The document discusses the importance of dimensional modeling in data warehousing and business intelligence, highlighting its advantages over normalized 3NF models in terms of user understandability and query performance. It outlines the goals of DW/BI systems, the structure of dimensional models including fact and dimension tables, and the ETL process for data integration. Additionally, it addresses various architectures for data warehousing, including the Kimball and Inmon approaches, and dispels common myths about dimensional modeling.

Uploaded by

Kei Camba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

INTELLIGENCE, AND DIMENSIONAL dimensional models can be represented in ERDs

MODELING PRIMER because both consist of joined relational tables; the key
difference between 3NF and dimensional models is the
One of the most important assets of any organization is degree of normalization. Because both model types can
its information.

This asset is almost always used for two


purposes: operational record keeping
and analytical decision making

Goals of Data Warehousing and


Business Intelligence
1. The DW/BI system must make
information easily accessible
2. The DW/BI system must
present information
consistently.
3. The DW/BI system must adapt to change.
be presented as ERDs, we refrain from referring to 3NF
4. The DW/BI system must present information
models as ER models; instead, we call them normalized
in a timely way.
models to minimize confusion
5. The DW/BI system must be a secure bastion
that protects the information assets Normalized 3NF structures are immensely useful in
6. The DW/BI system must serve as the operational processing because an update or insert
authoritative and trustworthy foundation for transaction touches the database in only one place.
improved decision making. Note: A dimensional model contains the same
7. The business community must accept the information as a normalized model, but packages the
DW/BI system to deem it successful. data in a format that delivers user understandability,
query performance, and resilience to change
DIMENSIONAL MODELING
- is widely accepted as the preferred technique for Star Schemas Versus OLAP Cubes
presenting analytic data because it addresses two Dimensional models implemented in relational database
simultaneous requirements: management systems are referred to as star schemas
because of their resemblance to a star-like structure.
 Deliver data that’s understandable to the Dimensional models implemented in multidimensional
business users. database environments are referred to as online
 Deliver fast query performance analytical processing (OLAP) cubes.
-longstanding technique for making databases simple.
OLAP Deployment Considerations
Note: Simplicity is critical because it ensures that users Here are some things to keep in mind if you deploy data
can easily understand the data, as well as allows into OLAP cubes:
software to navigate and deliver results quickly and
efficiently. - A star schema hosted in a relational database is a
good physical foundation for building an OLAP
Albert Einstein captured the basic philosophy driving cube, and is generally regarded as a more stable
dimensional design when he said, “Make everything as basis for backup and recovery.
simple as possible, but not simpler. - OLAP cubes have traditionally been noted for
extreme performance advantages over RDBMSs,
Although dimensional models are often instantiated in but that distinction has become less important with
relational database management systems, they are quite advances in computer hardware, such as appliances
different from third normal form (3NF) models which and in-memory databases, and RDBMS software,
seek to remove data redundancies. such as columnar databases.
- OLAP cube data structures are more variable
Normalized 3NF structures divide data into many across different vendors than relational DBMSs,
discrete entities, each of which becomes a relational thus the final deployment details often depend on
table. A database of sales orders might start with a which OLAP vendor is chosen. It is typically more
record for each order line but turn into a complex spider difficult to port BI applications between different
web diagram as a 3NF model, perhaps consisting of OLAP tools than to port BI applications across
hundreds of normalized tables. different relational databases
- OLAP cubes typically off er more sophisticated
The industry sometimes refers to 3NF models as entity- security options than RDBMSs, such as limiting
relationship (ER) models. Entity-relationship diagrams access to detailed data but providing more open
(ER diagrams or ERDs) are drawings that communicate access to summary data.
the relationships between tables. Both 3NF and
- OLAP cubes off er significantly richer analysis  The most useful facts are numeric and
capabilities than RDBMSs, which are saddled by additive, such as dollar sales amount.
the constraints of SQL. This may be the main  Additivity is crucial because BI applications
justification for using an OLAP product. \ rarely retrieve a single fact table row.
- OLAP cubes gracefully support slowly changing  Semi-additive facts, such as account balances,
dimension type 2 changes (which are discussed in cannot be summed across the time dimension.
Chapter 5: Procurement), but cubes often need to  Non-additive facts, such as unit prices, can
be reprocessed partially or totally whenever data is never be added
overwritten using alternative slowly changing
dimension techniques. A textual measurement is a description of something
- OLAP cubes gracefully support transaction and and is drawn from a discrete list of values.
periodic snapshot fact tables, but do not handle
accumulating snapshot fact tables because of the A true text fact is rare because the unpredictable content
limitations on overwriting data described in the of a text fact, like a freeform text comment, makes it
previous point. nearly impossible to analyze
- OLAP cubes typically support complex ragged
hierarchies of indeterminate depth, such as All fact tables have two or more foreign keys (refer to
organization charts or bills of material, using native the FK notation in Figure 1-2) that connect to the
query syntax that is superior to the approaches dimension tables’ primary keys.
required for RDBMSs.
- OLAP cubes may impose detailed
constraints on the structure of dimension
keys that implement drill-down
hierarchies compared to relational
databases.
- Some OLAP products do not enable
dimensional roles or aliases, thus
requiring separate physical dimensions to
be defined.

FACT TABLES FOR MEASUREMENTS


The fact table in a dimensional model stores
the performance measurements resulting from
an organization’s business process events.
The term fact represents a business measure.

The data on each row is at a specific level of


detail, referred to as the grain, such as one row
per product sold on a sales transaction. One of
the core tenets of dimensional modeling is that all the
When all the keys in the fact table correctly
match their respective primary keys in the
corresponding dimension tables, the tables
satisfy referential integrity.

The fact table generally has its own primary


key composed of a subset of the foreign keys.
This key is often called a composite key. Every
table that has a composite key is a fact table.

measurement rows in a fact table must be at the same


grain.

NOTE: The idea that a measurement event in the


physical world has a one-to-one relationship to a single
row in the corresponding fact table is a bedrock
principle for dimensional modeling. Everything else
builds from this foundation.
Fact tables express many-to-many relationships. All Every dimension is equivalent; all dimensions are
others are dimension tables. symmetrically equal entry points into the fact table. The
dimensional model has no built-in bias regarding
Dimension Tables for Descriptive Context expected query patterns.
Dimension tables are integral companions to a fact
table. The dimension tables contain the textual context KIMBALL’S DW/BI ARCHITECTURE
associated with a business process measurement event. There are four separate and distinct components to
They describe the “who, what, where, when, how, and consider in the DW/BI environment: operational source
why” associated with the event. systems, ETL system, data presentation area, and
business intelligence applications.
It is not uncommon for a dimension table to have 50 to
100 attributes; although, some dimension tables Operational Source Systems
naturally have only a handful of attributes. Dimension - operational systems of record that capture the
tables tend to have fewer rows than fact tables, but can business’s transactions.
be wide with many large text columns. Each dimension - The main priorities of the source systems are
is defined by a single primary key (refer to the PK processing performance and availability.
notation in Figure 1-3), which serves as the basis for - Source systems maintain little historical data; a
referential integrity with any given fact table to which it good data warehouse can relieve the source
is joined. systems of much of the responsibility for
representing the past
Dimension attributes serve as the primary source of - the source systems are special purpose applications
query constraints, groupings, and report labels. without any commitment to sharing common data
such as product, customer, geography, or calendar
NOTE Dimensions provide the entry points to the data,
with other operational systems in the organization.
and the final labels and groupings on all DW/BI
analyses.

NOTE The designer’s dilemma of whether a numeric EXTRACT, TRANSFORMATION, AND LOAD
quantity is a fact or a dimension attribute is rarely a SYSTEM
difficult decision. Continuously valued
numeric observations are almost always
facts; discrete numeric observations drawn
from a small list are almost always dimension
attributes.

You should resist the perhaps habitual urge to


normalize data by storing only the brand code
in the product dimension and creating a
separate brand lookup table, and likewise for
the category description in a separate
category lookup table. This normalization is
called snowflaking. Instead of third normal
form, dimension tables typically are highly
denormalized with flattened many-to-one
relationships within a single dimension table.

The dimension and fact terminology originated from a - consists of a work area, instantiated data structures,
joint research project conducted by General Mills and and a set of processes.
Dartmouth University in the 1960s. In the 1970s, both - is everything between the operational source
AC Nielsen and IRI used the terms consistently to systems and the DW/BI presentation area.
describe their syndicated data offerings and gravitated to
Extraction is the first step in the process of getting data
dimensional models for simplifying the presentation of
into the data warehouse environment.
their analytic information.
Extracting means reading and understanding the source
Facts and Dimensions Joined in a Star Schema
data and copying the data needed into the ETL system
- star-like structure is often called a star join, a term for further manipulation.
dating back to the earliest days of relational
databases. After the data is extracted to the ETL system, there are
numerous potential transformations, such as cleansing
The first thing to notice about the dimensional schema is the data (correcting misspellings, resolving domain
its simplicity and symmetry. conflicts, dealing with missing elements, or parsing into
standard formats), combining data from multiple - analytic data is deployed on a departmental
sources, and de-duplicating data. basis without concern to sharing and
integrating information across the enterprise.
The final step of the ETL process is the physical - Typically, a single department identifies
structuring and loading of data into the presentation requirements for data from an operational
area’s target dimensional models. source system.
- The ETL system is typically dominated by
the simple activities of sorting and
sequential processing.
- In many cases, the ETL system is not
based on relational technology but instead
may rely on a system of flat files.

NOTE It is acceptable to create a normalized


database to support the ETL processes;
however, this is not the end goal. The
normalized structures must be off -limits to user
queries because they defeat the twin goals of
understandability and performance.

PRESENTATION AREA TO SUPPORT


BUSINESS
INTELLIGENCE
- The DW/BI presentation area is
where data is organized, stored, and
made available for direct querying by
users, report writers, and other analytical BI
applications. Hub-and-Spoke Corporate Information Factory
Inmon Architecture
NOTE Data in the query able presentation area of the
DW/BI system must be dimensional, atomic The hub-and-spoke Corporate Information Factory
(complemented by performance-enhancing aggregates), (CIF) approach is advocated by Bill Inmon and others in
business process-centric, and adhere to the enterprise the industry.
data warehouse bus architecture. The data must not be
structured according to individual departments’ - With the CIF, data is extracted from the
interpretation of the data. operational source systems and processed
through an ETL system sometimes referred to
BUSINESS INTELLIGENCE APPLICATIONS
as data acquisition.
- The final major component of the Kimball
- The atomic data that results from this
DW/BI architecture is the business
intelligence (BI) application. processing lands in a 3NF database; this
normalized, atomic repository is referred to as
- The term BI application loosely refers to the
the Enterprise Data Warehouse (EDW) within
range of capabilities provided to business
the CIF architecture.
users to leverage the presentation area for
analytic decision making. NOTE The process of normalization does not
technically speak to integration. Normalization simply
ALTERNATIVE DW/BI ARCHITECTURES
creates physical tables that implement many-to-one
Independent Data Mart Architecture
relationships. Integration, on the other hand, requires
that inconsistencies arising from separate
sources be resolved. Separate incompatible
database sources can be normalized to the hilt
without addressing integration. The Kimball
architecture based on conformed dimensions
reverses this logic and focuses on resolving data
inconsistencies without explicitly requiring
normalization.

NOTE The most extreme form of a pure CIF


architecture is unworkable as a data warehouse,
in our opinion. Such an architecture locks the
atomic data in difficult to-query normalized structures,
while delivering departmentally incompatible data marts
to different groups of business users. But before being
too depressed by this view, stay tuned for the next
section.

HYBRID HUB-AND-SPOKE AND KIMBALL


ARCHITECTURE

- The final architecture warranting discussion is


the marriage of the Kimball and Inmon CIF
architectures
- this architecture populates a CIF-centric EDW
that is completely off -limits to business users
for analysis and reporting. It’s merely the
source to populate a Kimball-esque
presentation area in which the data is
dimensional, atomic (complemented by
aggregates), processcentric, and conforms to
the enterprise data warehouse bus

architecture.

Dimensional Modeling Myth

Myth 1: Dimensional Models are Only for Summary


Data

Myth 2: Dimensional Models are Departmental, Not


Enterprise

Myth 3: Dimensional Models are Not Scalable

Myth 4: Dimensional Models are Only for Predictable


Usage

Myth 5: Dimensional Models Can’t Be Integrated.


CHAPTER 2 - KIMBALL DIMENSIONAL Facts for Measurements
MODELING TECHNIQUES OVERVIEW
 Facts are the measurements that result from a
Gather Business Requirements and Data Realities business process event and are almost always
numeric.
 Before launching a dimensional modeling effort,  A single fact table row has a one-to-one
the team needs to understand the needs of the relationship to a measurement event as described
business, as well as the realities of the underlying by the fact table’s grain
source data  Thus, a fact table corresponds to a physical
observable event, and not to the demands of a
COLLABORATIVE DIMENSIONAL MODELING
particular report.
WORKSHOPS
Star Schemas and OLAP Cubes
 Dimensional models should be designed in
collaboration with subject matter experts and data  Star schemas are dimensional structures deployed
governance representatives from the business. in a relational database management system
(RDBMS). It consists of fact tables linked to
FOUR-STEP DIMENSIONAL DESIGN PROCESS
associated dimension tables via primary/foreign
The four key decisions made during the design of a key relationships.
dimensional model include:  An online analytical processing (OLAP) cube is a
dimensional structure implemented in a
1. Select the business process. multidimensional database; it can be equivalent in
2. Declare the grain. content to, or more often derived from, a relational
3. Identify the dimensions. star schema. An OLAP cube contains dimensional
4. Identify the facts. attributes and facts, but it is accessed through
languages with more analytic capabilities than
BUSINESS PROCESSES
SQL, such as XMLA and MDX.
 are the operational activities performed by your  OLAP cubes are included in this list of basic
organization, such as taking an order, processing an techniques because an OLAP cube is often the final
insurance claim, registering students for a class, or step in the deployment of a dimensional DW/BI
snapshotting every account each month. system, or may exist as an aggregate structure
 Business process events generate or capture based on a more atomic relational star schema.
performance metrics that translate into facts in a
Graceful Extensions to Dimensional Models
fact table.
 Dimensional models are resilient when data
GRAIN
relationships change.
 Declaring the grain is the pivotal step in a  All the following changes can be implemented
dimensional design without altering any existing BI query or
 The grain establishes exactly what a single fact application, and without any change in query
table row represents. results.
 The grain must be declared before choosing - Facts consistent with the grain of an existing
dimensions or facts because every candidate fact table can be added by creating new
dimension or fact must be consistent with the columns.
grain. - Dimensions can be added to an existing fact
 Atomic grain refers to the lowest level at which table by creating new foreign key columns,
data is captured by a given business process. presuming they don’t alter the fact table’s
grain.
DIMENSIONS FOR DESCRIPTIVE CONTEXT - Attributes can be added to an existing
dimension table by creating new columns.
 Dimensions provide the “who, what, where,
- The grain of a fact table can be made more
when, why, and how” context surrounding a
atomic by adding attributes to an existing
business process event.
dimension table, and then restating the fact
 Dimension tables contain the descriptive
table at the lower grain, being careful to
attributes used by BI applications for filtering
preserve the existing column names in the
and grouping the facts.
fact and dimension tables.
 Dimension tables are sometimes called the
“soul” of the data warehouse because they BASIC FACT TABLE TECHNIQUES
contain the entry points and descriptive labels
that enable the DW/BI system to be leveraged for Fact Table Structure
business analysis.
 fact table contains the numeric measures Transaction Fact Tables
produced by an operational measurement event
in the real world.  A row in a transaction fact table corresponds to a
 At the lowest grain, a fact table row corresponds measurement event at a point in space and time.
to a measurement event and vice versa.  Atomic transaction grain fact tables are the most
 Thus, the fundamental design of a fact table is dimensional and expressive fact tables.
entirely based on a physical activity and is not  Transaction fact tables may be dense or sparse
influenced by the eventual report. because rows exist only if measurements take
 In addition to numeric measures, a fact table place.
always contains foreign keys for each of its  These fact tables always contain a foreign key for
associated dimensions, as well as optional each associated dimension, and optionally
degenerate dimension keys and date/time stamps. contain precise time stamps and degenerate
 Fact tables are the primary target of computations dimension keys. The measured numeric facts
and dynamic aggregations arising from queries. must be consistent with the transaction grain.

Additive, Semi-Additive, Non-Additive Facts Periodic Snapshot Fact Tables

Fully Additive  A row in a periodic snapshot fact table


summarizes many measurement events occurring
 The most flexible and useful facts are fully over a standard period, such as a day, a week, or
additive; additive measures can be summed a month.
across any of the dimensions associated with the  The grain is the period, not the individual
fact table. transaction.
 Periodic snapshot fact tables often contain many
Semi Additive facts because any measurement event consistent
with the fact table grain is permissible.
 Semi-additive measures can be summed across
 These fact tables are uniformly dense in their
some dimensions, but not all; balance amounts
foreign keys because even if no activity takes
are common semi-additive facts because they are
place during the period, a row is typically
additive across all dimensions except time.
inserted in the fact table containing a zero or null
Non- additive for each fact.

 such as ratios Accumulating Snapshot Fact Tables


 A good approach for non-additive facts is, where
 A row in an accumulating snapshot fact table
possible, to store the fully additive components
summarizes the measurement events occurring at
of the non-additive measure and sum these
predictable steps between the beginning and the
components into the final answer set before
end of a process.
calculating the final non-additive fact.
 Pipeline or workflow processes, such as order
Nulls in Fact Tables fulfillment or claim processing, that have a
defined start point, standard intermediate steps,
 Null-valued measurements behave gracefully in and defined end point can be modeled with this
fact tables. type of fact table.
 The aggregate functions (SUM, COUNT, MIN,  As pipeline progress occurs, the accumulating
MAX, and AVG) all do the “right thing” with fact table row is revisited and updated
null facts.  In addition to the date foreign keys associated
 However, nulls must be avoided in the fact with each critical process step, accumulating
table’s foreign keys because these nulls would snapshot fact tables contain foreign keys for
automatically cause a referential integrity other dimensions and optionally contain
violation. degenerate dimensions. They often include
numeric lag measurements consistent with the
Conformed Facts
grain, along with milestone completion counters.
 If the same measurement appears in separate fact
Factless Fact Tables
tables, care must be taken to make sure the
technical definitions of the facts are identical if  Factless fact tables can also be used to analyze
they are to be compared or computed together. If what didn’t happen. These queries always have
the separate fact definitions are consistent, the two parts: a factless coverage table that contains
conformed facts should be identically named; but all the possibilities of events that might happen
if they are incompatible, they should be and an activity table that contains the events that
differently named to alert the business users and did happen. When the activity is subtracted from
BI applications.
the coverage, the result is the set of events that dimension can use a more meaningful primary
did not happen. key.

Aggregate Fact Tables or OLAP Cubes Natural, Durable, and Supernatural Keys

 Aggregate fact tables are simple numeric rollups  Natural keys created by operational source
of atomic fact table data built solely to accelerate systems are subject to business rules outside the
query performance. control of the DW/BI system. For instance, an
 These aggregate fact tables should be available to employee number (natural key) may be changed
the BI layer at the same time as the atomic fact if the employee resigns and then is rehired.
tables so that BI tools smoothly choose the  a new durable key must be created that is
appropriate aggregate level at query time. persistent and does not change in this situation.
 This process, known as aggregate navigation, This key is sometimes referred to as a durable
must be open so that every report writer, query supernatural key.
tool, and BI application harvests the same  The best durable keys have a format that is
performance benefits. independent of the original business process and
 aggregate OLAP cubes with summarized thus should be simple integers assigned in
measures are frequently built in the same way as sequence beginning with 1.
relational aggregates, but the OLAP cubes are
meant to be accessed directly by the business Drilling Down
users.
 Drilling down is the most fundamental way
Consolidated Fact Tables data is analyzed by business users. Drilling
down simply means adding a row header to an
 It is often convenient to combine facts from existing query; the new row header is a
multiple processes together into a dimension attribute appended to the GROUP BY
 single consolidated fact table if they can be expression in an SQL query
expressed at the same grain.
 Consolidated fact tables add burden to the ETL Degenerate Dimensions
processing, but ease the analytic burden on the
 Sometimes a dimension is defined that has no
BI applications.
content except for its primary key.
 They should be considered for cross-process
 This degenerate dimension is placed in the fact
metrics that are frequently analyzed together.
table with the explicit acknowledgment that there
BASIC DIMENSION TABLE TECHNIQUES is no associated dimension table
 Degenerate dimensions are most common with
Dimension Table Structure transaction and accumulating snapshot fact
tables.
 Every dimension table has a single primary key
column Denormalized Flattened Dimensions
 This primary key is embedded as a foreign key in
any associated fact table where the dimension  Dimension denormalization supports
row’s descriptive context is exactly correct for dimensional modeling’s twin objectives of
that fact table row. simplicity and speed.
 Dimension tables are usually wide, flat
Multiple Hierarchies in Dimensions
denormalized tables with many low-cardinality
text attributes.  Many dimensions contain more than one natural
hierarchy
Dimension Surrogate Keys
Flags and Indicators as Textual Attributes
 A dimension table is designed with one column
serving as a unique primary key. This primary  Cryptic abbreviations, true/false flags, and
key cannot be the operational system’s natural operational indicators should be supplemented in
key because there will be multiple dimension dimension tables with full text words that have
rows for that natural key when changes are meaning when independently viewed.
tracked over time.
 dimension surrogate keys are simple integers, Null Attributes in Dimensions
assigned in sequence, starting with the value 1,
 Null-valued dimension attributes result when a
every time a new key is needed.
given dimension row has not been fully
 The date dimension is exempt from the surrogate
populated, or when there are attributes that are
key rule; this highly predictable and stable
not applicable to all the dimension’s rows.
 Nulls in dimension attributes should be avoided  defined once in collaboration with the business’s
because different databases handle grouping and data governance representatives, are reused
constraining on nulls inconsistently. across fact tables; they deliver both analytic
consistency and reduced future development
Calendar Date Dimensions costs because the wheel is not repeatedly re-
created.
 Calendar date dimensions are attached to
virtually every fact table to allow navigation of Shrunken Dimensions
the fact table through familiar dates, months,
fiscal periods, and special days on the calendar.  are conformed dimensions that are a subset of
 The calendar date dimension typically has many rows and/or columns of a base dimension
attributes describing characteristics such as week  Shrunken rollup dimensions are required
number, month name, fiscal period, and national when constructing aggregate fact tables.
holiday indicator  Another case of conformed dimension
 The date/time stamp is not a foreign key to a subsetting occurs when two dimensions are at
dimension table, but rather is a standalone the same level of detail, but one represents
column. only a subset of rows.

Role-Playing Dimensions Drilling Across

 It is essential that each foreign key refers to a  simply means making separate queries against
separate view of the date dimension so that the two or more fact tables where the row headers
references are independent. of each query consist of identical conformed
 These separate dimension views (with unique attributes.
attribute column names) are called roles.
Value Chain
Junk Dimensions
 identifies the natural flow of an organization’s
 transactional business processes typically primary business processes.
produce a number of miscellaneous, low  Operational source systems typically produce
cardinality flags and indicators. transactions or snapshots at each step of the
 This dimension, frequently labeled as a value chain.
transaction profile dimension in a schema, does
not need to be the Cartesian product of all the Enterprise Data Warehouse Bus Architecture
attributes’ possible values, but should only
 The enterprise data warehouse bus
contain the combination of values that actually
architecture provides an incremental approach
occur in the source data.
to building the enterprise DW/BI system
Snowflaked Dimension  This architecture decomposes the DW/ BI
planning process into manageable pieces by
 When this process is repeated with all the focusing on business processes, while
dimension table’s hierarchies, a characteristic delivering integration via standardized
multilevel structure is created that is called a conformed dimensions that are reused across
snowflake. processes.
 Although the snowflake represents hierarchical  The bus architecture is technology and
data accurately, you should avoid snowflakes database platform independent; both relational
because it is difficult for business users to and OLAP dimensional structures can
understand and navigate snowflakes. participate.

Outrigger Dimensions Detailed Implementation Bus Matrix

 These secondary dimension references are  The detailed implementation bus matrix is a
called outrigger dimensions. more granular bus matrix where each business
 are permissible, but should be used sparingly process row has been expanded to show
specific fact tables or OLAP cubes.
INTEGRATION VIA CONFORMED
DIMENSIONS Opportunity/Stakeholder Matrix

Conformed Dimensions  The opportunity/stakeholder matrix helps


identify which business groups should be
 Dimension tables conform when attributes in invited to the collaborative design sessions for
separate dimension tables have the same column each process-centric row.
names and domain contents
type 2 attribute value in effect when the
measurement occurred or the attribute’s
DEALING WITH SLOWLY CHANGING current value.
DIMENSION ATTRIBUTES
Type 7: Dual Type 1 and Type 2 Dimensions
Type 0: Retain Original
 Type 7 is the final hybrid technique used to
 With type 0, the dimension attribute value support both as-was and as-is reporting.
never changes, so facts are always grouped by  A fact table can be accessed through a
this original value. dimension modeled both as a type 1
 Type 0 is appropriate for any attribute labeled dimension showing only the most current
“original,” such as a customer’s original credit attribute values, or as a type 2 dimension
score or a durable identifier. showing correct contemporary historical
profiles.
Type 1: Overwrite
DEALING WITH DIMENSION HIERARCHIES
 With type 1, the old attribute value in the
dimension row is overwritten with the Fixed Depth Positional Hierarchies
new value; type 1 attributes always
reflect the most recent assignment, and  A fixed depth hierarchy is a series of many-to-
therefore this technique destroys history. one relationships, such as product to brand to
category to department.
Type 2: Add New Row  A fixed depth hierarchy is by far the easiest to
understand and navigate as long as the above
 Type 2 changes add a new row in the
criteria are met. It also delivers predictable
dimension with the updated attribute
and fast query performance.
values.
Slightly Ragged/Variable Depth Hierarchies
Type 3: Add New Attribute
 Slightly ragged hierarchies don’t have a fixed
 Type 3 changes add a new attribute in the
number of levels, but the range in depth is
dimension to preserve the old attribute
small.
value; the new value overwrites the main
attribute as in a type 1 change. Ragged/Variable Depth Hierarchies with Pathstring
 This kind of type 3 change is sometimes Attributes
called an alternate reality.
 The use of a bridge table for ragged variable
Type 4: Add Mini-Dimension depth hierarchies can be avoided by
implementing a pathstring attribute in the
 The type 4 technique is used when a group of
dimension.
attributes in a dimension rapidly changes and
is split off to a mini-dimension. ADVANCED FACT TABLE TECHNIQUES
 This situation is sometimes called a rapidly
changing monster dimension. Fact Table Surrogate Keys

Type 5: Add Mini-Dimension and Type 1 Outrigger  Surrogate keys are used to implement the
primary keys of almost all dimension tables.
 The type 5 technique is used to accurately  are not associated with any dimension, are
preserve historical attribute values, plus report assigned sequentially during the ETL load
historical facts according to current attribute process and are used 1) as the single column
values. primary key of the fact table; 2) to serve as an
 Type 5 builds on the type 4 mini-dimension immediate identifier of a fact table row
by also embedding a current type 1 reference without navigating multiple dimensions for
to the mini-dimension in the base dimension. ETL purposes; 3) to allow an interrupted load
process to either back out or resume; 4) to
Type 6: Add Type 1 Attributes to Type 2 Dimension
allow fact table update operations to be
 Like type 5, type 6 also delivers both decomposed into less risky inserts plus
historical and current dimension attribute deletes.
values.
Centipede Fact Tables
 Type 6 builds on the type 2 technique by also
embedding current type 1 versions of the same  Some designers create separate
attributes in the dimension row so that fact normalized dimensions for each level of
rows can be filtered or grouped by either the a many-to-one hierarchy, such as a date
dimension, month dimension, quarter Multiple Units of Measure Facts
dimension, and year dimension, and then
include all these foreign keys in a fact  Some business processes require facts to be
table. This results in a centipede fact stated simultaneously in several units of
table with dozens of hierarchically measure.
related dimensions.  If the fact table contains a large number of
 Centipede fact tables should be avoided. facts, each of which must be expressed in all
units of measure, a convenient technique is to
Numeric Values as Attributes or Facts store the facts once in the table at an agreed
standard unit of measure, but also
 Designers sometimes encounter numeric simultaneously store conversion factors
values that don’t clearly fall into either the between the standard measure and all the
fact or dimension attribute categories. others.
 If the numeric value is used primarily for  This fact table could be deployed through
calculation purposes, it likely belongs in the views to each user constituency, using an
fact table. appropriates selected conversion factor.
Lag/Duration Facts Year-to-Date Facts
 Accumulating snapshot fact tables capture Business users often request year-to-date (YTD) values
multiple process milestones, each with a date in a fact table. It is hard to argue against a single
foreign key and possibly a date/time stamp. request, but YTD requests can easily morph into “YTD
 Business users often want to analyze the lags at the close of the fiscal period” or “fiscal period to
or durations between these milestones; date.”
sometimes these lags are just the differences
between dates, but other times the lags are
based on more complicated business rules.
Multi-pass SQL to Avoid Fact-to-Fact Table Joins
Header/Line Fact Tables
A BI application must never issue SQL that joins two
 Operational transaction systems often consist fact tables together across the fact table’s foreign keys.
of a transaction header row that’s associated
with multiple transaction lines. For instance, if two fact tables contain customer’s
 With header/line schemas (also known as product shipments and returns, these two fact tables
parent/child schemas), all the header-level must not be joined directly across the customer and
dimension foreign keys and degenerate product foreign keys.
dimensions should be included on the line-
Timespan Tracking in Fact Tables
level fact table.
 There are three basic fact table grains:
Allocated Facts
transaction, periodic snapshot, and
 It is quite common in header/line transaction accumulating snapshot.
data to encounter facts of differing granularity,  In isolated cases, it is useful to add a row
such as a header freight charge. effective date, row expiration date, and current
row indicator to the fact table, much like you
Profit and Loss Fact Tables Using Allocations do with type 2 slowly changing dimensions, to
capture a timespan when the fact row was
 Fact tables that expose the full equation of effective.
profit are among the most powerful
deliverables of an enterprise DW/BI system. Late Arriving Facts
 Fact tables ideally implement the profit
equation at the grain of the atomic revenue  A fact row is late arriving if the most current
transaction and contain many components of dimensional context for new fact rows does
cost. not match the incoming row.
 This happens when the fact row is delayed.
Multiple Currency Facts
ADVANCED DIMENSION TECHNIQUES
 Fact tables that record financial transactions in
multiple currencies should contain a pair of Dimension-to-Dimension Table Joins
columns for every financial fact in the row.
 Dimensions can contain references to other
 This fact table also must have a currency
dimensions.
dimension to identify the transaction’s true
 Although these relationships can be modeled
currency.
with outrigger dimensions, in some cases, the
existence of a foreign key to the outrigger comments’ cardinality matches the number of
dimension in the base dimension can result in unique transactions) with a corresponding
explosive growth of the base dimension foreign key in the fact table.
because type 2 changes in the outrigger force
corresponding type 2 processing in the base Multiple Time Zones
dimension.
 To capture both universal standard time, as
Multivalued Dimensions and Bridge Tables well as local times in multi-time zone
applications, dual foreign keys should be
 In a classic dimensional schema, each placed in the affected fact tables that join to
dimension attached to a fact table has a single two role-playing date (and potentially time-of-
value consistent with the fact table’s grain. day) dimension tables.
But there are a number of situations in which
a dimension is legitimately multivalued. Measure Type Dimensions

Time Varying Multivalued Bridge Tables  Sometimes when a fact table has a long list of
facts that is sparsely populated in any
 A multivalued bridge table may need to be individual row, it is tempting to create a
based on a type 2 slowly changing dimension. measure type dimension that collapses the fact
table row down to a single generic fact
Behavior Tag Time Series identified by the measure type dimension.
 Although it removes all the empty fact
 Almost all text in a data warehouse is
columns, it multiplies the size of the fact table
descriptive text in dimension tables.
by the average number of occupied columns
 Data mining customer cluster analyses
in each row, and it makes intra-column
typically results in textual behavior tags, often
computations much more difficult
identified on a periodic basis.
 This technique is acceptable when the number
Behavior Study Groups of potential facts is extreme (in the hundreds),
but less than a handful would be applicable to
 Complex customer behavior can sometimes be any given fact table row.
discovered only by running lengthy iterative
analyses. Step Dimensions
 The results of the complex behavior analyses,
 Sequential processes, such as web page
however, can be captured in a simple table,
events, normally have a separate row in a
called a study group, consisting only of the
transaction fact table for each step in a
customers’ durable keys.
process.
Aggregated Facts as Dimension Attributes  is used that shows what step number is
represented by the current step and how many
 Business users are often interested in more steps were required to complete the
constraining the customer dimension based on session.
aggregated performance metrics, such as
filtering on all customers who spent over a Hot Swappable Dimensions
certain dollar amount during last year or
 Hot swappable dimensions are used
perhaps over the customer’s lifetime.
when the same fact table is alternatively
 Selected aggregated facts can be placed in a
paired with different copies of the same
dimension as targets for constraining and as
dimension
row labels for reporting.
Abstract Generic Dimensions
Dynamic Value Bands
 Some modelers are attracted to abstract
 A dynamic value banding report is organized
generic dimensions. For example, their
as a series of report row headers that defi ne a
schemas include a single generic location
progressive set of varying-sized ranges of a
dimension rather than embedded
target numeric fact.
geographic attributes in the store,
Text Comments Dimension warehouse, and customer dimensions.

 Rather than treating freeform comments as Audit Dimensions


textual metrics in a fact table, they should be
 When a fact table row is created in the
stored outside the fact table in a separate
ETL back room, it is helpful to create an
comments dimension (or as attributes in a
dimension with one row per transaction if the
audit dimension containing the ETL
processing metadata known at the time.
 A simple audit dimension row could
contain one or more basic indicators of
data quality, perhaps derived from
examining an error event schema that
records data quality violations
encountered while processing the data.

Late Arriving Dimensions

 Sometimes the facts from an operational


business process arrive minutes, hours, days,
or weeks before the associated dimension
context.
 . Late arriving dimension data also occurs
when retroactive changes are made to type 2-
dimension attributes. In this case, a new row
needs to be inserted in the dimension table,
and then the associated fact rows must be
restated.

SPECIAL PURPOSE SCHEMAS

Supertype and Subtype Schemas for Heterogeneous


Products

 Financial services and other businesses


frequently off er a wide variety of products in
disparate lines of business.
 Supertype and subtype fact tables are also
called core and custom fact tables.

Real-Time Fact Tables

 Real-time fact tables need to be updated more


frequently than the more traditional nightly
batch process.

Error Event Schemas

 When a data quality screen detects an error,


this event is recorded in a special dimensional
schema that is available only in the ETL back
room.
 This schema consists of an error event fact
table whose grain is the individual error event
and an associated error event detail fact table
whose grain is each column in each table that
participates in an error event.
CHAPTER 3 - RETAIL SALES data resulting from the business process
measurement events?”
Four-Step Dimensional Design Process
Step 4: Identify the Facts
Step 1: Select the Business Process
- Facts are determined by answering the
A business process is a low-level activity question, “What is the process
performed by an organization, such as taking measuring?”
orders, invoicing, receiving payments, handling - Facts that clearly belong to a different
service calls, registering students, performing a grain must be in a separate fact table
medical procedure, or processing claims. - Typical facts are numeric additive
Several common characteristics: figures, such as quantity ordered or dollar
cost amount.
 Business processes are frequently expressed
as action verbs because they represent
activities that the business performs. The
companion dimensions describe descriptive
context associated with each business process
event.
 Business processes are typically supported by
Figure 3-1: Key input to the four-step dimensional
an operational system, such as the billing or
design process.
purchasing system
 Business processes generate or capture key RETAIL CASE STUDY
performance metrics. Sometimes the metrics
are a direct result of the business process; the Note: The point of-sale (POS) system scans
measurements are derivations at other times. product barcodes at the cash register, measuring
Analysts invariably want to scrutinize and consumer takeaway at the front door of the
evaluate these metrics by a seemingly grocery store.
limitless combination of filters and
Step 1: Select the Business Process
constraints.
 Business processes are usually triggered by an The first step in the design is to decide what
input and result in output metrics. In many business process to model by combining an
organizations, there’s a series of processes in understanding of the business requirements with
which the outputs from one process become an understanding of the available source data.
the inputs to the next. In the parlance of a
dimensional modeler, this series of processes NOTE: The first DW/BI project should focus on
results in a series of fact tables the business process that is both the most critical
to the business users, as well as the most feasible.
Step 2: Declare the Grain Feasibility covers a range of considerations,
including data availability and quality, as well as
- Declaring the grain means specifying
organizational readiness.
exactly what an individual fact table row
represents. Step 2: Declare the Grain
- The grain conveys the level of detail
associated with the fact table After the business process has been identified, the
measurements design team faces a serious decision about the
- It provides the answer to the question, granularity. What level of data detail should be
“How do you describe a single row in the made available in the dimensional model?
fact table?” Atomic data provides maximum analytic
Step 3: Identify the Dimensions flexibility because it can be constrained and rolled
up in every way possible.
- Dimensions fall out of the question,
NOTE: You should develop dimensional models
“How do business people describe the
representing the most detailed, atomic information
captured by a business process.
NOTE: A DW/BI system almost always demands ratio can then be calculated in a BI tool for any
data expressed at the lowest possible grain, not slice of the fact table by remembering to calculate
because queries want to see individual rows but the ratio of the sums, not the sum of the ratio. Unit
because queries need to cut through the details in price is another non-additive fact.
very precise ways
Transaction Fact Tables
Step 3: Identify the Dimensions
Transactional business processes are the most
After the grain of the fact table has been chosen, common. The fact tables representing these
the choice of dimensions is straight forward. processes share several characteristics:

NOTE A careful grain statement determines the  The grain of atomic transaction fact tables can
primary dimensionality of the fact table. You then be succinctly expressed in the context of the
add more dimensions to the fact table if these transaction, such as one row per transaction or
additional dimensions naturally take on only one one row per transaction line.
value under each combination of the primary  Because these fact tables record a
dimensions. If the additional dimension violates transactional event, they are often sparsely
the grain by causing additional fact rows to be populated. In our case study, we certainly
generated, the dimension needs to be disqualified wouldn’t sell every product in every shopping
or the grain statement needs to be revisited. cart.
 Even though transaction fact tables are
Step 4: Identify the Facts unpredictably and sparsely populated, they
The fourth and final step in the design is to make a can be truly enormous. Most billion and
careful determination of which facts will appear in trillion row tables in a data warehouse are
the fact table. transaction fact tables.
 Transaction fact tables tend to be highly
dimensional.
 The metrics resulting from transactional
events are typically additive as long as they
have been extended by the quantity amount,
rather than capturing per unit metrics.

DIMENSION TABLE DETAILS

Date Dimension

 a special dimension because it is the one


dimension nearly guaranteed to be in
every dimensional model since virtually
every business process
 captures a time series of performance
metrics.

NOTE: Dimensional models always need an


DERIVED FACTS explicit date dimension table. There are many date
attributes not supported by the SQL date function,
Non-Additive Facts
including week numbers, fiscal periods, seasons,
 Gross margin can be calculated by dividing holidays, and weekends. Rather than attempting to
the gross profit by the extended sales dollar determine these nonstandard calendar calculations
revenue. in a query, you should look them up in a date
 Gross margin is a non-additive fact because it dimension table.
can’t be summarized along any dimension.
Flags and Indicators as Textual Attributes
NOTE: Percentages and ratios, such as gross
 Like many operational flags and
margin, are non-additive. The numerator and
indicators, the date dimension’s holiday
denominator should be stored in the fact table. The
indicator is a simple indicator with two Numeric Values as Attributes or Facts
potential values.
 You will sometimes encounter numeric
Current and Relative Date Attributes values that don’t clearly fall into either
the fact or dimension attribute categories.
Most date dimension attributes are not subject to A classic example is the standard list
updates. June 1, 2013 will always roll up to June, price for a product. It’s definitely a
Calendar Q2, and 2013. However, there are numeric value, so the initial instinct is to
attributes you can add to the basic date dimension place it in the fact table. But, typically the
that will change over time, including standard price changes infrequently,
IsCurrentDay, IsCurrentMonth, IsPrior60Days, unlike most facts that are often differently
and so on. valued on every measurement event.
Time-of-Day as a Dimension or Fact  Sometimes numeric values serve both
calculation and filtering/grouping
 Although date and time are comingled in an functions.
operational date/time stamp, time-of day is
typically separated from the date dimension to NOTE Data elements that are used both for fact
avoid a row count explosion in the date dimension calculations and dimension constraining,
grouping, and labeling should be stored in both
Product Dimension locations, even though a clever programmer could
write applications that access these data elements
 The product dimension describes every
from a single location. It is important that
SKU in the grocery store.
dimensional models be as consistent as possible
 The product dimension is almost always
and application development be predictably
sourced from the operational product
simple. Data involved in calculations should be in
master file.
fact tables and data involved in constraints, groups
 It is headquarters’ responsibility to define
and labels should be in dimension tables.
the appropriate product master record
(and unique SKU number) for each new Drilling Down on Dimension Attributes
product.
 Drilling down is nothing more than asking for
Flatten Many-to-One Hierarchies a row header from a dimension that provides
more information.
 The product dimension represents the
 Drilling down in a dimensional model is
many descriptive attributes of each SKU.
nothing more than adding row header
The merchandise hierarchy is an
attributes from the dimension tables. Drilling
important group of attributes.
up is removing row headers. You can drill
 Each of these is a many-to-one
down or up on attributes from more than one
relationship.
explicit hierarchy and with attributes that are
NOTE Keeping the repeated low cardinality part of no hierarchy.
values in the primary dimension table is a
Note: The product dimension is a common
fundamental dimensional modeling technique.
dimension in many dimensional models.
Normalizing these values into separate tables
defeats the primary goals of simplicity and Store Dimension
performance, as discussed in “Resisting
Normalization Urges” later in this chapter.  The store dimension describes every store in
the grocery chain. Unlike the product master
Attributes with Embedded Meaning fi le that is almost guaranteed to be available
in every large grocery business, there may not
 Often operational product codes,
be a comprehensive store master file.
identified in the dimension table by the
NK notation for natural key, have Multiple Hierarchies in Dimension Tables
embedded meaning with different parts of
the code representing significant  The store dimension is the case study’s
characteristics of the product. primary geographic dimension. Each store can
be thought of as a location. You can roll stores
up to any geographic attribute, such as ZIP did you transfer sales from regularly priced
code, county, and state in the United States. products to temporarily reduced priced
Contrary to popular belief, cities and states products?
within the United States are not a hierarchy.  Whether the products under promotion
Since many states have identically named showed a gain in sales but other products
cities, you’ll want to include a City-State nearby on the shelf showed a corresponding
attribute in the store dimension sales decrease (cannibalization).
 Whether all the products in the promoted
NOTE: It is not uncommon to represent multiple category of products experienced a net overall
hierarchies in a dimension table. The attribute gain in sales taking into account the time
names and values should be unique across the periods before, during, and after the
multiple hierarchies. promotion (market growth).
Dates Within Dimension Tables  Whether the promotion was profitable.
Usually, the profit of a promotion is taken to
 The first open date and last remodel date in be the incremental gain in profit of the
the store dimension could be date type promoted category over the baseline sales
columns. taking into account time shifting and
 These date dimension copies are declared in cannibalization, as well as the costs of the
SQL by the view construct and are promotion.
semantically distinct from the primary date
dimension. The trade-off s in favor of keeping the four
dimensions together include the following:
Now the system acts as if there is another physical
copy of the date dimension table called  If the four causal mechanisms are highly
FIRST_OPEN_DATE. correlated, the combined single dimension is
not much larger than any one of the separated
The first open date view is a permissible dimensions would be.
outrigger to the store dimension.  The combined single dimension can be
browsed efficiently to see how the various
Promotion Dimension
price reductions, ads, displays, and coupons
 is potentially the most interesting are used together. However, this browsing
dimension in the retail sales schema. only shows the possible promotion
 describes the promotion conditions under combinations. Browsing in the dimension
which a product is sold. table does not reveal which stores or products
 this dimension is often called a causal were affected by the promotion; this
dimension because it describes factors information is found in the fact table.
thought to cause a change in product
The trade-off s in favor of separating the causal
sales.
mechanisms into four distinct dimension tables
Promotions are judged on one or more of the include the following:
following factors:
 The separated dimensions may be more
 Whether the products under promotion understandable to the business community if
experienced a gain in sales, called lift, during users think of these mechanisms separately.
the promotional period. The lift can be This would be revealed during the business
measured only if the store can agree on what requirement interviews.
the baseline sales of the promoted products  Administration of the separate dimensions
would have been without the promotion. may be more straightforward than
Baseline values can be estimated from prior administering a combined dimension.
sales history and, in some cases, with the help
Keep in mind there is no difference in the content
of sophisticated models.
between these two choices.
 Whether the products under promotion
showed a drop in sales just prior to or after the Note: The inclusion of promotion cost attribute in
promotion, canceling the gain in sales during the promotion dimension should be done with
the promotion (time shifting). In other words, careful thought. This attribute can be used for
constraining and grouping. However, this cost Because the resulting dimension is empty,
should not appear in the POS transaction fact table we refer to the POS transaction number
representing individual product sales because it is as a degenerate dimension.
at the wrong grain; this cost would have to reside  Degenerate dimensions are very common
in a fact table whose grain is the overall when the grain of a fact table represents a
promotion. single transaction or transaction line
because the degenerate dimension
Null Foreign Keys, Attributes, and Facts represents the unique identifier of the
 The promotion dimension must include a row, parent.
with a unique key such as 0 or –1, to identify  Degenerate dimensions often play an
this no promotion condition and avoid a null integral role in the fact table’s primary
promotion key in the fact table. key.
 Referential integrity is violated if you put a  Order numbers, invoice numbers, and
null in a fact table column declared as a bill-of-lading numbers almost always
foreign key to a dimension table. In addition appear as degenerate dimensions in a
to the referential integrity alarms, null keys dimensional model.
are the source of great confusion to users NOTE Operational transaction control numbers
because they can’t join on null keys. such as order numbers, invoice numbers, and bill-
WARNING: You must avoid null keys in the fact of-lading numbers usually give rise to empty
table. A proper design includes a row in the dimensions and are represented as degenerate
corresponding dimension table to identify that the dimensions in transaction fact tables. The
dimension is not applicable to the measurement. degenerate dimension is a dimension key without
a corresponding dimension table.
Null values essentially disappear in pull-down
menus of possible attribute values or in report The predictable symmetry of dimensional models
groupings; special syntax is required to identify enabled them to absorb some rather significant
them. changes in source data and/or modeling
assumptions without invalidating existing BI
Finally, we can also encounter nulls as metrics in applications, including:
the fact table. We generally leave these null so that
they’re properly handled in aggregate functions  New dimension attributes
such as SUM, MIN, MAX, COUNT, and AVG  New dimensions
which do the “right thing” with nulls. Substituting  New measured facts
a zero instead would improperly skew these Factless Fact Tables
aggregated calculations.
 This fact table enables you to see the
 Data mining tools may use different relationship between the keys as defined by
techniques for tracking nulls. a promotion, independent of other events,
Other Retail Sales Dimensions such as actual product sales.
 it has no measurement metrics; it merely
 Any descriptive attribute that takes on a captures the relationship between the
single value in the presence of a fact table involved keys.
measurement event is a good candidate to
be added to an existing dimension or be DIMENSION AND FACT TABLE KEYS
its own dimension.

Degenerate Dimensions for Transaction


Numbers

 Although the POS transaction number


looks like a dimension key in the fact
table, the descriptive items that might
otherwise fall in a POS transaction
dimension have been stripped off.
pDimension Table Surrogate Keys  Support dimension attribute change
tracking.

Dimension Natural and Durable Supernatural


Keys

Natural Keys

 assigned and used by operational source


systems go by other names, such as business
keys, production keys, and operational keys.
 often modeled as an attribute in the
dimension table.
 Operational natural keys are often
composed of meaningful constituent parts,
such as the product’s line of business or
country of origin; these components should
be split apart and made available as separate
 The unique primary key of a dimension
attributes.
table should be a surrogate key rather than
relying on the operational system identifier, Supernatural keys
known as the natural key.
 If the dimension’s natural keys are not
Surrogate keys absolutely protected and preserved over
time, the ETL system needs to assign
 meaningless keys, integer keys, non-natural
permanent durable identifiers.
keys, artificial keys, and synthetic keys.
 controlled by the DW/ BI system and
 Surrogate keys are simply integers that are
remains immutable for the life of the
assigned sequentially as needed to populate
system.
a dimension.
 The actual surrogate key value has no Degenerate Dimension Surrogate Keys
business significance.
 The surrogate keys merely serve to join the  Although surrogate keys aren’t typically
dimension tables to the fact table. assigned to degenerate dimensions, each
 column names with a key suffix, identified situation needs to be evaluated to
as a primary key (PK) or foreign key (FK), determine if one is required.
imply a surrogate.  A surrogate key is necessary if the
transaction control numbers are not
NOTE: Every join between dimension and fact unique across locations or get reused.
tables in the data warehouse should be based on
meaningless integer surrogate keys. You should Date Dimension Smart Keys
avoid using a natural key as the dimension table’s  date dimension has unique characteristics
primary key. and requirements.
Here are several advantages:  Calendar dates are fixed and
predetermined; you never need to worry
 Buffer the data warehouse from operational about deleting dates or handling new,
changes unexpected dates on the calendar.
 Integrate multiple source systems  More commonly, the primary key of the
 Improve performance date dimension is a meaningful integer
-The surrogate key is as small an integer formatted as yyyymmdd.
as possible while ensuring it will
comfortably accommodate the future yyyymmdd key
anticipated cardinality (number of rows  is not intended to provide business users
in the dimension). and their BI applications with an
 Handle null or unknown conditions intelligent key so they can bypass the
date dimension and directly query the removed from the flat, denormalized
fact table. dimension table and placed in separate
 useful for partitioning fact tables. normalized dimension tables.
 Partitioning enables a table to be  Snowflaking is a legal extension of the
segmented into smaller tables under the dimensional model, however, we encourage
covers. Partitioning a large fact table on you to resist the urge to snowflake given the
the basis of date is effective because it two primary design drivers: ease of use and
allows old data to be removed gracefully performance.
and new data to be loaded and indexed in
the current partition without disturbing Multitude of snow flaked tables makes for a much
the rest of the fact table. more complex presentation. Business users
 Using a smart yyyymmdd key provides inevitably will struggle with the complexity;
the benefits of a surrogate, plus the simplicity is one of the primary objectives of a
advantages of easier partition dimensional model.
management. Bitmap indexes are useful when indexing low-
Fact Table Surrogate Keys cardinality columns, such as the category and
department attributes in the product dimension
 Fact table surrogate keys typically only table.
make sense for back room ETL
processing. NOTE: Fixed depth hierarchies should
 a fact table surrogate key is a simple be flattened in dimension tables.
integer, devoid of any business content, Normalized, snow-flaked dimension
that is assigned in sequence as fact table tables penalize cross-attribute browsing
rows are generated. and prohibit the use of bitmapped
indexes. Disk space savings gained by
Although the fact table surrogate key is unlikely to normalizing the dimension tables
deliver query performance advantages, it does typically are less than 1 percent of the
have the following benefits: total disk space needed for the overall
schema. You should knowingly sacrifice
 Immediate unique identification this dimension table space in the spirit of
 Backing out or resuming a bulk load performance and ease of use advantages.
 Replacing updates with inserts plus
deletes. Outriggers
 Using the fact table surrogate key as a
parent in a parent/child schema.  Although we generally do not recommend
snowflaking, there are situations in which
Snowflake Schemas with Normalized it is permissible to build an outrigger
Dimensions dimension that attaches to a dimension
within the fact table’s immediate halo.
 The flattened, denormalized dimension
tables with repeating textual values make
data modelers from the operational world
uncomfortable.
 the “once removed” outrigger is a date
dimension snow-flaked off a primary
dimension.
 The outrigger date attributes are
descriptively and uniquely labeled to
distinguish them from the other dates
associated with the business process.

WARNING Though outriggers are


permissible, a dimensional model should not
be littered with outriggers given the
potentially negative impact. Outriggers should
be the exception rather than the rule.

Centipede Fact Tables with Too Many


Dimensions

 The fact table in a dimensional schema is


naturally highly normalized and compact.
There is no way to further normalize the
extremely complex many-to-many
relationships among the keys in the fact
table because the dimensions are not
correlated with each other.
 they appear to have nearly 100 legs.

NOTE A very large number of dimensions


typically are a sign that several dimensions are not
completely independent and should be combined
into a single dimension. It is a dimensional
modeling mistake to represent elements of a single
hierarchy as separate dimensions in the fact table.
CHAPTER 4 – INVENTORY Semi-Additive Facts

Value Chain Introduction  are additive across some dimensions but not
all.
 Most organizations have an underlying value  the semi-additive nature of inventory balance
chain of key business processes. facts is even more understandable if you think
 The value chain identifies the natural, logical about your checking account balances.
flow of an organization’s primary activities.
NOTE: All measures that record a static level
(inventory levels, financial account balances, and
measures of intensity such as room temperatures) are
inherently non-additive across the date dimension and
possibly other dimensions. In these cases, the measure
may be aggregated across dates by averaging over the
number of time periods.

OLAP products provide the capability to defi ne


aggregation rules within the cube, so semi-additive
measures like balances are less problematic if the data is
deployed via OLAP cubes.

Enhanced Inventory Facts

 Notice that quantity on hand is semi-additive,


but the other measures in the enhanced
Operational source systems typically produce periodic snapshot are all fully additive.
transactions or snapshots at each step of the value chain.
The primary objective of most analytic DW/BI
systems is to monitor the performance results of these
key processes. Because each process produces unique
metrics at unique time intervals with unique granularity
and dimensionality, each process typically spawns one
or more fact tables.

Inventory Models

 The first is the inventory periodic snapshot


 The periodic snapshot is the most common
where product inventory levels are measured
inventory schema.
at regular intervals and placed as separate
rows in a fact table. Inventory Transactions
 These periodic snapshot rows appear over
time as a series of data layers in the Each inventory transaction identifies the date, product,
dimensional model, much like geologic layers warehouse, vendor, transaction type, and in most cases,
represent the accumulation of sediment over a single amount representing the inventory quantity
long periods of time. impact caused by the transaction.

Inventory Periodic Snapshot NOTE Remember there’s more to life than transactions
alone. Some form of a snapshot table to give a more
 The dimensions immediately fall out of this cumulative view of a process often complements a
grain declaration: date, product, and store transaction fact table.

If performance measurements have different natural


granularity or dimensionality, they likely result from
separate processes that should be modeled as separate
fact tables.

Inventory Accumulating Snapshot


 The final inventory model is the accumulating Accumulating Snapshot Fact Tables
snapshot.
 are used for processes that have a definite the third type of fact table
beginning, definite end, and identifiable milestones represent processes that have a definite beginning and definite
in between. end together with a standard set of intermediate process steps.
 The accumulating snapshot fact table provides an are most appropriate when business users want to perform
updated status of the lot as it moves through workflow or pipeline analysis.
standard milestones represented by multiple date- always have multiple date foreign keys, representing the
valued foreign keys. predictable major events or process milestones; sometimes
 Each accumulating snapshot fact table row is there’s an additional date column that indicates when the
updated repeatedly until the products received in a snapshot row was last updated.
lot are completely depleted from the warehouse.
Lags Between Milestones and Milestone Counts
Fact Table Types
 Because accumulating snapshots often represent
There are just three fundamental types of fact tables: the efficiency and elapsed time of a workflow or
transaction, periodic snapshot, and accumulating pipeline, the fact table typically contains metrics
snapshot. representing the durations or lags between key
milestones
 Sometimes the lag metrics are simply the raw
difference between the milestone dates or
date/time stamps.
 In other situations, the lag calculation is made
more complicated by taking workdays and
holidays into consideration.

Accumulating Snapshot Updates and OLAP Cubes

 Unlike the periodic snapshot where the prior


snapshots are preserved, the accumulating
snapshot merely reflects the most current status
and metrics.
Figure 4-7: Fact table type comparisons.  Accumulating snapshots do not attempt to
Transaction Fact Tables accommodate complex scenarios that occur
infrequently.
 the most fundamental view of the business’s
operations is at the individual transaction or Complementary Fact Table Types
transaction line level.  Sometimes accumulating and periodic snapshots
 These fact tables represent an event that occurred work in conjunction with one another, such as
at an instantaneous point in time. when you incrementally build the monthly
Periodic Snapshot Fact Tables snapshot by adding the effect of each day’s
transactions to a rolling accumulating snapshot
 Periodic snapshots are needed to see the while also storing 36 months of historical data in
cumulative performance of the business at regular, a periodic snapshot
predictable time intervals.  Transactions and snapshots are the yin and
 with the periodic snapshot, you take a picture yang of dimensional designs. Used together,
(hence the snapshot terminology) of the activity companion transaction and snapshot fact tables
at the end of a day, week, or month, then another provide a complete view of the business. Both
picture at the end of the next period, and so on. are needed because there is often no simple way
 represents an aggregation of the transactional to combine these two contrasting perspectives in
activity that occurred during a time period. a single fact table.
 The fact tables share many dimension tables; the
snapshot usually has fewer dimensions overall. Value Chain Integration
Conversely, there are usually more facts in a  Both business and IT organizations are typically
summarized periodic snapshot table than in a interested in value chain integration
transactional table because any activity that  Business management needs to look across the
happens during the period is fair game for a business’s processes to better evaluate
metric in a periodic snapshot. performance.
Enterprise Data Warehouse Bus Architecture

 For long-term DW/BI success, you need to use


an architected, incremental approach to build the
enterprise’s warehouse.

Understanding the Bus Architecture

 the word bus is not shorthand for business; it’s an


old term from the electrical power industry that is
now used in the computer industry.
 is a common structure to which everything The enterprise data warehouse bus architecture provides
connects and from which everything derives a rational approach to decomposing the enterprise
power. DW/BI planning task.
NOTE: By defining a standard bus interface for the Enterprise Data Warehouse Bus Matrix
DW/BI environment, separate dimensional models can
be implemented by different groups at different times.  Others have renamed the bus matrix, such as
The separate business process subject areas plug the conformance or event matrix, but these are
together and usefully coexist if they adhere to the merely synonyms for this fundamental
Kimball concept first introduced in the 1990s

Figure 4-10: Sample enterprise data warehouse bus


matrix for a retailer.

Profitability is a classic example of a consolidated


process in which separate revenue and cost factors are
combined from different processes to provide a
complete view of profitability.

The columns of the bus matrix represent the common


dimensions used across the enterprise. It is often helpful
to create a list of core dimensions before filling in the
standard. matrix to assess whether a given dimension should be
associated with a business process.
Figure 4-9: Enterprise data warehouse bus with shared
dimensions. Multiple Matrix Uses

 Creating the enterprise data warehouse bus


matrix is one of the most important DW/BI
implementation deliverables.
 The matrix enables you to communicate
effectively within and across data governance
and DW/BI teams.
 The matrix is a succinct deliverable that
visually conveys the master plan.

Opportunity/Stakeholder Matrix

 Based on each function’s requirements, the


matrix cells are shaded to indicate which
business functions are interested in which
business processes (and projects).

Common Bus Matrix Mistakes

 Departmental or overly encompassing rows


 Report-centric or too narrowly defined rows

When defining the matrix columns, architects naturally


fall into the similar traps of defining columns that are
either too broad or too narrow:
 Overly generalized columns - A “person” higher level of granularity than the atomic base
column on the bus matrix may refer to a wide dimension
variety of people, from internal employees to
external suppliers and customer contacts NOTE: Shrunken rollup dimensions conform to the
 Separate columns for each level of a hierarchy base atomic dimension if the attributes are a strict subset
- The columns of the bus matrix should refer of the atomic dimension’s attributes.
to dimensions at their most granular level.
Shrunken Conformed Dimension with Row Subset

 Another case of conformed dimension sub-


setting occurs when two dimensions are at the
same level of detail, but one represents only a
Retrofitting Existing Models to a Bus Matrix subset of rows.
 . By using a subset of rows, they aren’t
 It is unacceptable to build separate dimensional encumbered with the corporation’s entire product
models that ignore a framework tying them set. Of course, the fact table joined to this sub-
together. Isolated, independent dimensional setted dimension must be limited to the same
models are worse than simply a lost opportunity subset of products. If a user attempts to use a
for analysis. They deliver access to irreconcilable shrunken subset dimension while accessing a fact
views of the organization and further enshrine the table consisting of the complete product set, they
reports that cannot be compared with one may encounter unexpected query results because
another. Independent dimensional models referential integrity would be violated.
become legacy implementations in their own  Conformed date and month dimensions are a
right; by their existence, they block the unique example of both row and column
development of a coherent DW/BI environment. dimension sub-setting.
 Shrunken Conformed Dimensions on the Bus
Conformed Dimensions
Matrix The bus matrix identifies the reuse of
 Conformed dimensions go by many other aliases: common dimensions across business processes.
common dimensions, master dimensions,
Shrunken Conformed Dimensions on the Bus Matrix
reference dimensions, and shared dimensions.
 Conformed dimensions should be built once in  The bus matrix identifies the reuse of common
the ETL system and then replicated either dimensions across business processes. Typically,
logically or physically throughout the enterprise the shaded cells of the matrix indicate that the
DW/BI environment. atomic dimension is associated with a given
process.
Drilling Across Fact Tables
There are two viable approaches to represent the
 The full outer-join ensures all rows are included
shrunken dimensions within the matrix:
in the combined report, even if they only appear
in one set of query results. This linkage, often  Mark the cell for the atomic dimension, but then
referred to as drill across, is straightforward if textually document the rollup or row subset
the dimension table attribute values are identical. granularity within the cell.
 Drilling across is supported by many BI products  Subdivide the dimension column to indicate the
and platforms. common rollup or subset granularities, such as
day and month if processes collect data at both of
Identical Conformed Dimensions
these grains.
 conformed dimensions mean the same thing with
Importance of Data Governance and Stewardship
every possible fact table to which they are joined.
 Identical conformed dimensions have consistent  In many organizations, business rules and data
dimension keys, attribute column names, definitions have traditionally been established
attribute definitions, and attribute values departmentally. The consequences of this
(which translate into consistent report labels and commonly encountered lack of data governance
groupings) and control are the ubiquitous departmental data
 two dimensional models may be the same silos that perpetuate similar but slightly different
physical table within the database. versions of the truth. Business and IT
management need to recognize the importance of
Shrunken Rollup Conformed Dimension with
addressing this shortfall if you stand any chance
Attribute Subset
of bringing order to the chaos; if management is
 Shrunken rollup dimensions are required when reluctant to drive change, the project will never
a fact table captures performance metrics at a achieve its goals.
Business-Driven Governance Sometimes a fact has a natural unit of measure in one
fact table and another natural unit of measure in another
 Leading a cross-organizational governance fact table.
program is not for the faint of heart. The
governance resources identified by business
leadership should have the following
characteristics:
 Respect from the organization
 Broad knowledge of the enterprise’s
operations
 Ability to balance organizational needs
against departmental requirements
 Gravitas and authority to challenge the
status quo and enforce policies
 Strong communication skills
 Politically savvy negotiation and
consensus building skills

Governance Objectives

 One of the key objectives of the data


governance function is to reach agreement
on data definitions, labels, and domain values
so that everyone is speaking the same
language.
 the data governance function also establishes
policies and responsibilities for data quality
and accuracy, as well as data security and
access controls.
 A strong data governance function is a
necessary prerequisite for conforming
information regardless of technical approach.

Conformed Dimensions and the Agile Movement

 Conformed dimensions allow a dimension


table to be built and maintained once rather
than re-creating slightly different versions
during each development cycle. Reusing
conformed dimensions across projects is
where you get the leverage for more agile
DW/BI development.
 If you fail to focus on conformed dimensions
because you’re under pressure to deliver
something yesterday, the departmental
analytic data silos will likely have inconsistent
categorizations and labels

Conformed Facts

 Revenue, profit, standard prices and costs,


measures of quality and customer satisfaction,
and other key performance indicators (KPIs)
are facts that must also conform.

NOTE: You must be disciplined in your data naming


practices. If it is impossible to conform a fact exactly,
you should give different names to the different
interpretations so that business users do not combine
these incompatible facts in calculations.
CHAPTER 5 – PROCUREMENT  is appropriate for any attribute labeled “original,”
such as customer original credit score. It
also applies to most attributes in a date
dimension.
 Persistent durable keys are always type 0
attributes.

Type 1: Overwrite

 with the slowly changing dimension type 1


response, you overwrite the old attribute
value in the dimension row, replacing it with
the current value; the attribute always reflects
the most recent assignment.

Figure 5-1: Procurement fact table with multiple


transaction types.

 The procurement transaction type dimension


enables grouping or filtering on transaction types,
such as purchase orders.
 The contract number is a degenerate dimension; it
could be used to determine the volume of business
conducted under each negotiated contract.

Single Versus Multiple Transaction Fact Tables

 A single fact table may be the most appropriate


solution in some situations, whereas multiple fact
tables are most appropriate in others. When faced
with this design decision, the following
considerations help sort out the options:
What are the users’ analytic requirements?
Are there really multiple unique business
processes?
Are multiple source systems capturing metrics
with unique granularities?
What is the dimensionality of the facts?

Slowly Changing Dimension Basics

NOTE: The business’s data governance and stewardship


representatives must be actively involved in decisions
regarding the handling of slowly changing dimension
attributes; IT shouldn’t make determinations on its own.

 Since Ralph Kimball first introduced the


 The type 1 response is the simplest approach for
notion of slowly changing dimensions in
dimension attribute changes.
1995, some IT professionals in a never-ending
 The problem with a type 1 response is that you lose
quest to speak in acronym-ese termed them
all history of attribute changes
SCDs. The acronym stuck.
 For each dimension table attribute, you must NOTE The type 1 response is easy to implement, but it
specify a strategy to handle change. In other does not maintain any history of prior attribute values.
words, when an attribute value changes in the
operational world, how will you respond to WARNING Even though type 1 changes appear the
the change in the dimensional model? easiest to implement, remember they invalidate
relational tables and OLAP cubes that have aggregated
Type 0: Retain Original data over the affected attribute.
 the dimension attribute value never changes, so Type 2: Add New Row
facts are always grouped by this original value.
 A type 2 response is the predominant dimension table, such as a product line or sales
technique for supporting this requirement force reorganization.
when it comes to slowly changing dimension
attributes. Type 4: Add Mini-Dimension
 With type 2 changes, the fact table is again
 The solution is to break off frequently analyzed
untouched; you don’t go back to the historical
or frequently changing attributes into a separate
fact table rows to modify the product key.
dimension, referred to as a mini-dimension.
 Unlike the type 1 approach, there is no need to
 the attributes in the mini-dimension are
revisit preexisting aggregation tables when
typically forced to take on a relatively small
using the type 2 technique. Likewise, OLAP
number of discrete values.
cubes do not need to be reprocessed if
 The mini-dimension delivers performance
hierarchical attributes are handled as type 2.
benefits by providing a smaller point of entry
NOTE: The type 2 response is the primary workhorse to the facts.
technique for accurately tracking slowly changing
Type 5: Mini-Dimension and Type 1 Outrigger
dimension attributes. It is extremely powerful because
the new dimension row automatically partitions history  An embellishment to this technique is to add a
in the fact table. current mini-dimension key as an attribute in
the primary dimension. This mini-dimension
 Type 2 is the safest response if the business is not
key reference is a type 1 attribute, overwritten
absolutely certain about the SCD business rules
with every profile change.
for an attribute.
 The type 5 technique is useful if you want a
Type 2 Effective and Expiration Dates current profile count in the absence of fact
table metrics or want to roll up historical facts
 The effective and expiration dates refer to the based on the customer’s current profile.
moment when the row’s attribute values become
valid or invalid. NOTE: The type 4 mini-dimension terminology refers
 Effective and expiration dates or date/time to when the demographics key is part of the fact table
stamps are necessary in the ETL system because composite key. If the demographics key is a foreign key
it needs to know which surrogate key is valid in the customer dimension, it is referred to as an
when loading historical fact rows. outrigger.
 The effective and expiration dates support
Type 6: Add Type 1 Attributes to Type 2 Dimension
precise time slicing of the dimension; however,
there is no need to constrain on these dates in the  With type 6, you would have two department
dimension table to get the right answer from the attributes on each row.
fact table.  The current department column represents the
current assignment; the historic department
Type 1 Attributes in Type 2 Dimensions
column is a type 2 attribute representing the
 When type 1 and type 2 are both used in a historically accurate department value.
dimension, sometimes a type 1 attribute change  An engineer at a technology company
necessitates updating multiple dimension rows. suggested we refer to this combo approach as
type 6 because both the sum and product of 1,
Type 3: Add New Attribute 2, and 3 equals 6.

 With a type 3 response, you do not issue a new Type 7: Dual Type 1 and Type 2 Dimensions
dimension row, but rather add a new column to
capture the attribute change.
 Type 3 is distinguished from type 2 because the
pair of current and prior attribute values are
regarded as true at the same time.

NOTE The type 3 slowly changing dimension


technique enables you to see new and historical fact
data by either the new or prior attribute values,
sometimes called alternate realities.

 Type 3 is not useful for attributes that change


unpredictably, such as a customer’s home
state.
 Type 3 is most appropriate when there’s a
significant change impacting many rows in the
 In this final hybrid technique, the dimension  This method is especially appropriate for the date
natural key (assuming it’s durable) is included dimension on the bus matrix given its numerous
as a fact table foreign key, in addition to the logical roles.
surrogate key for type 2 tracking.
 This approach delivers the same functionality Product Dimension Revisited
as type 6.
 The product dimension is one of the most common
 Type 7 invariably requires less ETL eff ort
and most important dimension tables.
because the current type 1 attribute table could
easily be delivered via a view of the type 2- Most product dimension tables share the following
dimension table, limited to the most current characteristics:
rows.
1. Numerous verbose, descriptive columns.
CHAPTER 6 – ORDER MANAGEMENT 2. One or more attribute hierarchies, plus non-
hierarchical attributes.
Order Management Bus Matrix
3. Remap the operational product code to a
 The order management function is composed of a surrogate key.
series of business processes. 4. Add descriptive attribute values to augment or
replace operational codes.
Order Transactions 5. Quality checks the attribute values to ensure
no misspellings, impossible values, or
 The natural granularity for an order transaction fact multiple variations.
table is one row for each line item on an order.
 The dimensions associated with the orders business Customer Dimension
process are order date, requested ship date,
product, customer, sales rep, and deal.  The customer dimension contains one row for each
discrete location to which you ship a product.
Fact Normalization  Customer dimension tables can range from
moderately sized (thousands of rows) to extremely
 some designers want to further normalize the fact large (millions of rows) depending on the nature of
table so there’s a single, generic fact amount along the business.
with a dimension that identifies the type of
measurement. NOTE It is natural and common, especially for
 In this scenario, the fact table granularity is one customer-oriented dimensions, for a dimension to
row per measurement per order line, instead of the simultaneously support multiple independent
more natural one row per order line event. hierarchies. The hierarchies may have different numbers
 This technique may make sense when the set of of levels. Drilling up and drilling down within each of
facts is extremely lengthy, but sparsely populated these hierarchies must be supported in a dimensional
for a given fact row, and no computations are made model
between facts.
Single Versus Multiple Dimension Tables
Dimension Role Playing
 Designers sometimes question whether sales
 you now have two unique logical date dimensions organization attributes should be modeled as a
that can be used as if they were independent with separate dimension or added to the customer
completely unrelated constraints. dimension.
 This is referred to as role playing because the date  In many scenarios, this two-dimension table is
dimension simultaneously serves different roles in unnecessary. There is no reason to avoid the fact
a single fact table. table to respond to this relationship inquiry. Fact
tables are incredibly efficient because they contain
NOTE Role playing in a dimensional model occurs only dimension keys and measurements, along with
when a single dimension simultaneously appears several the occasional degenerate dimension. The fact table
times in the same fact table. The underlying dimension is created specifically to represent the correlations
may exist as a single physical table, but each of the roles and many-to-many relationships between
should be presented to the BI tools as a separately dimensions.
labeled view.
Factless Fact Table for Customer/Rep Assignments
Role Playing and the Bus Matrix
 The coverage table would provide a complete map
 The most common technique to document role of the historical assignments of sales reps to
playing on the bus matrix is to indicate the multiple customers, even if some of the assignments never
roles within a single cell resulted in a sale. This factless fact table contains
dual date keys for the effective and expiration dates  a currency dimension is needed even if the location
of each assignment. of the transaction is otherwise known because the
location does not necessarily guarantee which
Deal Dimension currency was used.
 The deal dimension is similar to the promotion Transaction Facts at Different Granularity
dimension.
 describes the incentives offered to customers that  The designer’s first response should be to try to
theoretically affect the customers’ desire to force all the facts down to the lowest level. This
purchase products. procedure is broadly referred to as allocating.
 This dimension is also sometimes referred to as the  Allocating the parent order facts to the child line-
contract. item level is critical if you want the ability to slice
 describes the full combination of terms, and dice and roll up all order facts by all
allowances, and incentives that pertain to the dimensions, including product.
particular order line item.
WARNING You shouldn’t mix fact granularities such
Degenerate Dimension for Order Number as order header and order line facts within a single fact
table. Instead, either allocate the higher-level facts to a
 Each line item row in the order fact table includes more detailed level or create two separate fact tables to
the order number as a degenerate dimension. handle the differently grained facts. Allocation is the
 Unlike an operational header/line or parent/child preferred approach.
database, the order number in a dimensional model
is typically not tied to an order header table. Design teams sometimes attempt to devise alternative
 It enables you to group the separate line items on techniques for handling header/line facts at different
the order and answer questions such as “What is granularity, including the following:
the average number of line items on an order?”
 Repeat the unallocated header fact on every
NOTE Degenerate dimensions typically are line.
reserved for operational transaction identifiers.  Store the unallocated amount on the
They should not be used as an excuse to stick transaction’s first or last line.
cryptic codes in the fact table without joining to  Set up a special product key for the header fact.
dimension tables for descriptive decodes.
Another Header/Line Pattern to Avoid
Junk Dimensions
 Pattern to avoid: not inheriting header
 when modeling complex transactional source data, dimensionality in line facts.
you often encounter a number of miscellaneous
indicators and flags that are populated with a small Invoice Transactions
range of discrete values.
 invoicing typically occurs when products are
 We typically refer to the junk dimension as a
shipped from your facility to the customer.
transaction indicator or transaction profile
 In the invoice fact table, you can see all the
dimension when talking with the business users.
company’s products, customers, contracts and
 NOTE A junk dimension is a grouping of low-
deals, off -invoice discounts and allowances,
cardinality flags and indicators. By creating a junk
revenue generated by customers, variable and
dimension, you remove the flags from the fact table
fixed costs associated with manufacturing and
and place them into a useful dimensional
delivering products (if available), money left
framework.
over after delivery of product (profit
Header/Line Pattern to Avoid contribution), and customer satisfaction metrics
such as on-time shipment.
 Pattern to avoid: treating transaction header as a
dimension. NOTE For any company that ships products to
customers or bills customers for services rendered, the
Multiple Currencies optimal place to start a DW/BI project typically is with
invoices. We often refer to invoicing as the most
 The most common analytic requirement is that powerful data because it combines the company’s
order transactions be expressed in both the local customers, products, and components of profitability.
transaction currency and the standardized corporate
currency. Profit and Loss Facts
 The metrics in standard currency would be fully
additive. The local currency metrics would be  It is traditional to arrange these revenues and costs
additive only for a single specified currency. in sequence from the top line, which represents the
undiscounted value of the products shipped to the
customer, down to the bottom line, which  the audit dimension is added to the fact table by
represents the money left over after discounts, including an audit dimension foreign key. The audit
allowances, and costs. This list of revenues and dimension itself contains the metadata conditions
costs is referred to as a profit and loss (P&L) encountered when processing fact table rows. It is
statement. best to start with a modest audit dimension design.
 the bottom line in the P&L statement is referred to  the audit dimension is now just an ordinary
as contribution. dimension, you can just add the out-of-bounds
indicator to your standard report
The elements of the P&L statement shown in Figure 6-
14 have the following interpretations:

 Quantity shipped: Number of cases of the Accumulating Snapshot for Order Fulfillment
particular line item’s product. Pipeline
 Extended gross amount: Also known as extended
list price because it is the quantity shipped  The order management process can be thought of
multiplied by the list unit price. as a pipeline, especially in a build-to-order
 Extended allowance amount: Amount subtracted manufacturing business.
from the invoice line gross amount for deal-related  Periodic snapshots would provide insight into the
allowances. The allowances are described in the amount of product sitting in the pipeline, such as
adjoined deal dimension. The allowance amount is the backorder or finished goods inventories, or the
often called an off -invoice allowance. amount of product flowing through a pipeline
 Extended discount amount: Amount subtracted spigot during a predefined interval. The
for volume or payment term discounts. The accumulating snapshot helps you better understand
discount descriptions are found in the deal the current state of an order, as well as product
dimension. movement velocities to identify pipeline
 Extended net amount: Amount the customer is bottlenecks and inefficiencies.
expected to pay for this line item before tax. It is  The fundamental difference between accumulating
equal to the gross invoice amount less the snapshots and other fact tables is that you can
allowances and discounts. revisit and update existing fact table rows as more
information becomes available.
The following cost amounts, leading to a bottom -line
contribution, are for internal consumption only: NOTE Accumulating snapshot fact tables typically
have multiple dates representing the major milestones of
 Extended fixed manufacturing cost: Amount the process. However, just because a fact table has
identified by manufacturing as the pro rata fixed several dates doesn’t dictate that it is an accumulating
manufacturing cost of the invoice line’s product. snapshot. The primary differentiator of an accumulating
 Extended variable manufacturing cost: Amount snapshot is that you revisit the fact rows as activity
identified by manufacturing as the variable occurs.
manufacturing cost of the product on the invoice
line.  The accumulating snapshot technique is especially
 Extended storage cost: Cost charged to the useful when the product moving through the
invoice line for storage prior to being shipped to pipeline is uniquely identified, such as an
the customer. automobile with a vehicle identification number,
 Extended distribution cost: Cost charged to the electronics equipment with a serial number, lab
invoice line for transportation from the point of specimens with an identification number, or
manufacture to the point of shipment. This cost is process manufacturing batches with a lot number.
notorious for not being activity-based. The accumulating snapshot helps you understand
 Contribution amount: Extended net invoice less throughput and yield.
all the costs just discussed. This is not the true
Accumulating Snapshots and Type 2 Dimensions
bottom line of the overall company because general
and administrative expenses and other financial  Accumulating snapshots present the latest state of a
adjustments have not been made, but it is important workflow or pipeline. If the dimensions associated
nonetheless. This column sometimes has with an accumulating snapshot contain type 2
alternative labels, such as margin, depending on the attributes, the fact table should be updated to
company culture. reference the most current surrogate dimension key
for active pipelines.
Audit Dimension
Lag Calculations
 invoice line-item design is one of the most
powerful because it provides a detailed look at  represent basic measures of fulfillment efficiency.
customers, products, revenues, costs, and bottom-
line profit in one schema.
 You could build a view on this fact table that
calculated a large number of these date differences
and presented them as if they were stored in the
underlying table. These view columns could
include metrics such as orders to manufacturing
release lag, manufacturing release to finished
goods lag, and order to shipment lag, depending on
the date spans monitored by the organization.

Multiple Units of Measure

 designers are tempted to bury the unit-of-measure


conversion factors, such as ship case factor, in the
product dimension.

NOTE Packaging all the facts and conversion factors


together in the same fact table row provides the safest
guarantee that these factors will be used correctly. The
converted facts are presented in a view(s) to the users.

Beyond the Rearview Mirror

 People sometimes refer to these as rearview mirror


metrics because they enable you to look backward
and see where you’ve been.
CHAPTER 7 – ACCOUNTING  The two most important dimensions in the
proposed general ledger design are account
Accounting Case Study and Bus Matrix and organization. The account dimension is
carefully derived from the uniform chart of
 Financial analysts are some of the most data-
accounts in the enterprise. The organization
literate and spreadsheet-savvy individuals.
dimension describes the financial reporting
 The DW/BI system can provide a single
entities in the enterprise.
source of usable, understandable financial
information, ensuring everyone is working off Year-to-Date Facts
the same data with common definitions and
common tools.  Designers are often tempted to store “to-date”
columns in fact tables.
General Ledger Data
NOTE In general, “to-date” totals should be
 The general ledger (G/L) is a core foundation calculated, not stored in the fact table.
financial system that ties together the detailed
information collected by subledgers or Multiple Currencies Revisited
separate systems for purchasing, payables
(what you owe to others), and receivables  If want to represent the facts both in terms of
(what others owe you). the local currency, as well as a standardized
corporate currency.
General Ledger Periodic Snapshot
General Ledger Journal Transactions
 The grain of this periodic snapshot is one row
per accounting period for the most granular  the grain of the fact table is now one row for
level in the general ledger’s chart of accounts. every general ledger journal entry transaction.
The journal entry transaction identifies the
Chart of Accounts G/L account and the applicable debit or credit
amount.
 The cornerstone of the general ledger  The journal entry number is likely a
 is the epitome of an intelligent key because it degenerate dimension with no linkage to an
usually consists of a series of identifiers. associated dimension table. If the journal
 charts of accounts vary from organization to entry numbers from the source are ordered,
organization. then this degenerate dimension can be used to
 this kind of conformed dimension has an old order the journal entries because the calendar
and familiar name in financial circles: the date dimension on this fact table is too coarse
uniform chart of accounts. to provide this sorting.
 If the journal entry numbers do not easily
Period Close
support the sort, then an effective date/time
 At the end of each accounting period, the fi stamp must be added to the fact table.
nance organization is responsible for
Multiple Fiscal Accounting Calendars
finalizing the financial results so that they can
be officially reported internally and externally.  the data is captured by posting date, but users
It typically takes several days at the end of may also want to summarize the data by fiscal
each period to reconcile and balance the books account period. Unfortunately, fiscal
before they can be closed with fi nance’s accounting periods often do not align with
official stamp of approval standard Gregorian calendar months.
 Financial analysts are constantly looking to  The most common approach is to create a date
streamline the processes for period end dimension outrigger with a multipart key
closing, reconciliation, and reporting of consisting of the date and subsidiary keys.
general ledger results.  A second approach for tackling the subsidiary-
specific calendars would be to create separate
WARNING The ledger dimension is a convenient and
physical date dimensions for each subsidiary
intuitive dimension that enables multiple ledgers to be
calendar, using a common set of surrogate
stored in the same fact table. However, every query that
date keys.
accesses this fact table must constrain the ledger
 This approach simplifies user access but puts
dimension to a single value (for example, Final
additional strain on the ETL system because it
Approved Domestic Ledger) or the queries will double
must insert the appropriate fiscal period key
count values from the various ledgers in this table. The
during the transformation process.
best way to deploy this schema is to release separate
views to the business users with the ledger dimension Drilling Down Through a Multilevel Hierarchy
pre-constrained to a single value.
 Very large enterprises or government agencies  One calendar hierarchy may be day ➪ fiscal
may have multiple ledgers arranged in an period ➪ year. Another could be day ➪ month
ascending hierarchy, perhaps by enterprise, ➪ year.
division, and department. At the lowest level,  In a fixed position hierarchy, it is important
department ledger entries may be consolidated that each level have a specific name.
to roll up to a single division ledger entry.
 One way to model this hierarchy is by WARNING Avoid fixed position hierarchies with
introducing the parent snapshot’s fact table abstract names such as Level-1, Level-2, and so on. This
surrogate key in the fact table. is a cheap way to avoid correctly modeling a ragged
hierarchy. When the levels have abstract names, the
Financial Statements business user has no way of knowing where to place a
constraint, or what the attribute values in a level mean in
 One of the primary functions of a general a report. If a ragged hierarchy attempts to hide within a
ledger system is to produce the organization’s fixed position hierarchy with abstract names, the
official financial reports, such as the balance individual levels are essentially meaningless.
sheet and income statement
 the operational system typically handles the Slightly Ragged Variable Depth Hierarchies
production of these reports.
 In this manner, managers could easily look at  The simple location has four levels: address,
performance trends for a given line in the city, state, and country.
financial statement over time for their  The medium complex location adds a zone
organization. Similarly, key performance level, and the complex location adds both
indicators and financial ratios may be made district and zone levels.
available at the same level of detail.  If you need to represent all three types of
locations in a single geographic hierarchy, you
Budgeting Process have a slightly variable hierarchy.

 Most modern general ledger systems include Ragged Variable Depth Hierarchies
the capability to integrate budget data into the
general ledger.  In the budget use case, the organization
 Within most organizations, the budgeting structure is an excellent example of a
process can be viewed as a series of events. ragged hierarchy of indeterminate depth.
 Budgets are becoming more dynamic because  we often refer to the hierarchical
there are budget adjustments as the year structure as a “tree” and the individual
progresses, reflecting changes in business organizations in that tree as “nodes.”
conditions or the realities of actual spending  the classic way to represent a
versus the original budget. parent/child tree structure is by placing
 The facts in such a “status report” are all recursive pointers in the organization
semi-additive balances, rather than fully dimension from each row to its parent.
additive facts.  The highest parent flag in the map table
 The account dimension is also a reused means the particular path comes from the
dimension. highest parent in the tree. The lowest
 The budget line item identifies the purpose of child flag means the particular path ends
the proposed spending, such as employee in a “leaf node” of the tree.
wages or office supplies.
NOTE The article “Building Hierarchy Bridge Tables”
 The budget fact table has a single budget
(available at www.kimballgroup.com under the Tools
amount fact that is fully additive.
and Utilities tab for this book title) provides a code
Dimension Attribute Hierarchies example for building the hierarchy bridge table
described in this section.
 a hierarchy is defined by a series of many-to-
one relationships. Time Varying Ragged Hierarchies

Fixed Depth Positional Hierarchies  The ragged hierarchy bridge table can
accommodate slowly changing hierarchies
 In the budget chain, the calendar levels are with the addition of two date/time stamps.
familiar fixed depth position hierarchies. As
the name suggests, a fixed position hierarchy WARNING When using the bridge, the query must
has a fixed set of levels, all with meaningful always constrain to a single date/time to “freeze” the
labels. bridge table to a single consistent view of the hierarchy.
Failing to constrain in this way otherwise would result
in multiple paths being fetched that could not exist at table, while failing to meet user requirements
the same time. that demand the ability to dive into more
granular data.
Modifying Ragged Hierarchies
NOTE When facts from multiple business
 The organization map bridge table can easily processes are combined in a consolidated fact
be modified. table, they must live at the same level of
 In the bridge table, only the paths directly granularity and dimensionality. Because the
involved in the change are affected. All other separate facts seldom naturally live at a common
paths are untouched. grain, you are forced to eliminate or aggregate
some dimensions to support the one-to-one
Alternative Ragged Hierarchy Modeling Approaches
correspondence, while retaining the atomic data in
 there are at least two other ways to model a separate fact tables. Project teams should not create
ragged hierarchy, both involving clever artificial facts or dimensions in an attempt to force-
columns placed in the organization dimension. fit the consolidation of differently grained fact data.

There are two disadvantages to these schemes: Role of OLAP and Packaged Analytic Solutions

1. the definition of the hierarchy is locked into  OLAP products have been used
the dimension and cannot easily be replaced. extensively for financial reporting,
2. both of these schemes are vulnerable to a budgeting, and consolidation
relabeling disaster in which a large part of the applications.
tree must be relabeled due to a single small  OLAP cubes can deliver fast query
change. performance that is critical for executive
usage.
 Another similar scheme, known to computer  OLAP is well suited to handle
scientists as the modified preordered tree complicated organizational rollups, as
traversal approach, numbers the tree. well as complex calculations, including
 Leaf nodes can be found where Left and Right inter-row manipulations.
diff er by 1, meaning there aren’t any children.  OLAP cubes often also readily support
complex security models, such as
Advantages of the Bridge Table Approach for limiting access to detailed data while
Ragged Hierarchies providing more open access to summary
metrics.
In particular, the bridge table allows:

■ Alternative rollup structures to be selected at query


time

■ Shared ownership rollups

■ Time varying ragged hierarchies

■ Limited impact when nodes undergo slowly changing


dimension (SCD) type 2 changes

■ Limited impact when the tree structure is changed

Consolidated Fact Tables

 Fact tables that combine metrics from


multiple business processes at a common
granularity are referred to as consolidated fact
tables.
 Although consolidated fact tables can be
useful, both in terms of performance and
usability, they often represent a
dimensionality compromise as they
consolidate facts at the “least common
denominator” of dimensionality. One potential
risk associated with consolidated fact tables is
that project teams sometimes base designs
solely on the granularity of the consolidated
CHAPTER 8 - CUSTOMER RELATIONSHIP Chinese, and dozens of other less familiar writing
MANAGEMENT systems.
 This small character set is usually encoded in
CRM Overview American Standard Code for Information
Interchange (ASCII), which is an 8-bit encoding
 The goal of CRM is to maximize relationships with
that has a maximum of 255 possible characters.
your customers over their lifetime.
Only approximately 100 of these 255 characters
 It entails focusing all aspects of the business, from
have a standard interpretation that can be invoked
marketing, sales, operations, and service, on
from a normal English keyboard, but this is usually
establishing and sustaining mutually beneficial
enough for English speaking computer users.
customer relations.
 An international body of system architects, the
 CRM is like a stick of dynamite that knocks down
Unicode Consortium, defined a standard known
the silo walls. It requires the right integration of
as Unicode for representing characters and
business processes, people resources, and
alphabets in almost all the world’s languages and
application technology to be effective.
cultures.
 CRM involves brand new ways of interacting with
 The Unicode Standard, version 6.2.0 has defined
customers and often entails radical changes to the
specific interpretations for 110,182 possible
sales channels.
characters and now covers the principal written
 CRM requires new information flows based on the
languages of the Americas, Europe, the Middle
complete acquisition and dissemination of
East, Africa, India, Asia, and Pacifica.
customer “touch point” data.
 Unicode is the foundation you must use for
Operational and Analytic CRM addressing international character sets.
 the most current releases of all the major operating
 Effective CRM relies on the collection of data at systems are Unicode-compliant.
every interaction you have with a customer and  Data warehouse back-room tools must be Unicode-
then leveraging that breadth of data through compliant, including sort packages, programming
analysis. languages, and automated ETL packages.
 On the operational front, CRM calls for the
synchronization of customer-facing processes. NOTE Customer dimensions sometimes include a full
 Often operational systems must either be updated address block attribute. This is a specially crafted
or supplemented to coordinate across sales, column that assembles a postally-valid address for the
marketing, operations, and service. customer including mail stop, ZIP code, and other
 Analytic CRM is enabled via accurate, integrated, attributes needed to satisfy postal authorities. This
and accessible customer data in the DW/BI system. attribute is useful for international locations where
addresses have local idiosyncrasies.
Customer Dimension Attributes
International DW/BI Goals
 The conformed customer dimension is a critical
element for effective CRM. After committing to a Unicode foundation, you need to
 The customer dimension is typically the most keep the following goals in mind:
challenging dimension for any DW/BI system.
1. Universal and consistent
Name and Address Parsing  All the BI tool messages and prompts need
to be translated for the benefit of the
 Regardless of whether you deal with individual business user. This process is known as
human beings or commercial entities, customers’ localization.
name and address attributes are typically captured. 2. End-to-end data quality and downstream
 The operational handling of name and address compatibility.
information is usually too simplistic to be very 3. Cultural correctness.
useful in the DW/BI system. 4. Real-time customer response.
 Commercial customers typically have multiple 5. Other kinds of addresses.
addresses, such as physical and shipping addresses;
each of these addresses would follow much the Customer-Centric Dates
same logic as the address structure.
 Customer dimensions often contains dates, such as
International Name and Address Considerations the date of the first purchase, date of last purchase,
and date of birth.
 International display and printing typically require  These date dimension roles are declared as
representing foreign characters, including not just semantically distinct views, such as a First
the accented characters from western European Purchase Date dimension table with unique column
alphabets, but also Cyrillic, Arabic, Japanese, labels.
Aggregated Facts as Dimension Attributes D: Occasional customer, good credit

 Business users are often interested in constraining E: Occasional customer, poor credit
the customer dimension based on aggregated
performance metrics, such as filtering on all F: Former good customer, not seen recently G:
customers who spent more than a certain dollar Frequent window shopper, mostly unproductive
amount during last year.
H: Other
 Providing aggregated facts as dimension attributes
is sure to be a crowd-pleaser with the business John Doe: C C C D D A A A B B T
users
 This time series of behavior tags is unusual
Segmentation Attributes and Scores because although it comes from a regular
periodic measurement process, the observed
 Some of the most powerful attributes in a
“values” are textual. The behavior tags are not
customer dimension are segmentation
numeric and cannot be computed or averaged,
classifications.
but they can be queried.
For an individual customer, they may include:  Behavior tags should not be stored as regular
facts. The main use of behavior tags is
 Gender Ethnicity formulating complex query patterns like the
 Age or other life stage classifications example in the previous paragraph.
 Income or other lifestyle classifications
 Status (such as new, active, inactive, and In addition to the separate columns for each behavior
closed) tag time period, it would be a good idea to create a
 Referring sources single attribute with all the behavior tags concatenated
 Business-specific market segment (such as a together, such as CCCDDAAABB. This column would
preferred customer identifier support wild card searches for exotic patterns, such as
“D followed by a B.”
Statistical segmentation models typically generate
these scores which cluster customers in a variety of NOTE In addition to the customer dimension’s time
ways, such as based on their purchase behavior, series of behavior tags, it would be reasonable to
payment behavior, propensity to churn, or probability to include the contemporary behavior tag value in a
default. minidimension to analyze facts by the behavior tag in
effect when the fact row was loaded.
Behavior Tag Time Series
Relationship Between Data Mining and DW/BI
 One popular approach for scoring and profiling System
customers looks at the recency (R), frequency (F),
and intensity (I) of the customer’s behavior.  The data mining team can be a great client of the
 These are known as the RFI measures; sometimes data warehouse, and especially great users of
intensity is replaced with monetary (M), so it’s also customer behavior data.
known as RFM
Counts with Type 2 Dimension Changes
 Recency is how many days has it been since the
customer last ordered or visited your site.  Businesses frequently want to count customers
 Frequency is how many times the customer has based on their attributes without joining to a fact
ordered or visited, typically in the past year. table. If you used type 2 to track customer
 Intensity is how much money the customer has dimension changes, you need to be careful to avoid
spent over the same time period. overcounting because you may have multiple rows
in the customer dimension for the same individual.
The data mining professional may come back with a list
 Doing a COUNT DISTINCT on a unique
of behavior tags like the following, which is drawn from
customer identifier is a possibility, assuming the
a slightly more complicated scenario that includes credit
attribute is indeed unique and durable.
behavior and returns:
Outrigger for Low Cardinality Attribute Set
A: High volume repeat customer, good credit, few
product returns  Generally, snowflaking is not recommended in a
DW/BI environment because it almost always
B: High volume repeat customer, good credit,
makes the user presentation more complex, in
many product returns
addition to negatively impacting browsing
C: Recent new customer, no established credit performance.
pattern  the dimension outrigger is a set of data from an
external data provider consisting of 150
demographic and socio-economic attributes point of contact is associated with a specific role.
regarding the customers’ county of residence. Because the number of contacts is unpredictable
 Rather than repeating this large block of data for but possibly large, a bridge table design is a
every customer within a county, opt to model it as convenient way to handle this situation.
an outrigger.
Complex Customer Behavior
WARNING Dimension outriggers are permissible, but
they should be the exception rather than the rule. A red  Customer behavior can be very complex.
warning flag should go up if your design is riddled with
Behavior Study Groups for Cohorts
outriggers; you may have succumbed to the temptation
to overly normalize the design.  In other situations, you may want to capture the set
of customers from a query or exception report,
Customer Hierarchy Considerations
such as the top 100 customers from last year,
 One of the most challenging aspects of dealing customers who spent more than $1,000 last month,
with commercial customers is modeling their or customers who received a specific test
internal organizational hierarchy. solicitation, and then use that group of customers,
called a behavior study group, for subsequent
Bridge Tables for Multivalued Dimensions analyses without reprocessing to identify the initial
condition.
 A fundamental tenet of dimensional modeling is to  To create a behavior study group, run a query (or
decide on the grain of the fact table, and then series of queries) to identify the set of customers
carefully add dimensions and facts to the design you want to further analyze, and then capture the
that are true to the grain. customer durable keys of the identified set as an
 When faced with a multivalued dimension, there actual physical table consisting of a single
are two basic choices: a positional design or bridge customer key column.
table design.
 Positional designs are very attractive because the NOTE The secret to building complex behavioral study
multivalued dimension is spread out into named group queries is to capture the keys of the customers or
columns that are easy to query. products whose behavior you are tracking. You then use
 The positional design approach isn’t very scalable. the captured keys to subsequently constrain other fact
 The bridge table approach to multivalued tables without having to rerun the original behavior
dimensions is powerful but comes with a big analysis.
compromise. The bridge table removes the
scalability and null value objections. The exceptional simplicity of study group tables allows
them to be combined with union, intersection, and set
WARNING Be aware that complex queries using difference operations. For example, a set of problem
bridge tables may require SQL that is beyond the customers this month can be intersected with the set of
normal reach of BI tools. problem customers from last month to identify
customers who were problems for two consecutive
Bridge Table for Sparse Attributes months.
 Organizations are increasingly collecting Step Dimension for Sequential Behavior
demographics and status information about their
customers, but the traditional fixed column  Most DW/BI systems do not have good examples
modeling approach for handling these attributes of sequential processes.
becomes difficult to scale with hundreds of  The step dimension is an abstract dimension
attributes. defined in advance
 Positional designs can be scaled up to perhaps 100  Using the step dimension, a specific page can
or so columns before the databases and user immediately be placed into one or more
interfaces become awkward or hard to maintain. understandable contexts.
Columnar databases are well suited to these kinds
of designs because new columns can be easily Timespan Fact Tables
added with minimal disruption to the internal
 Was the customer on fraud alert when denied an
storage of the data, and the low-cardinality
extension of credit? How long had he been on
columns containing only a few discrete values are
fraud alert? How many times in the past two years
dramatically compressed
has he been on fraud alert? How many customers
Bridge Table for Multiple Customer Contacts were on fraud alert at some point in the past two
years?
 Large commercial customers have many points of  All these questions can be addressed if you
contact, including decision makers, purchasing carefully manage the transaction fact table
agents, department heads, and user liaisons; each
containing all customer events. The key modeling the customer dimension should represent the “best”
step is to include a pair of date/time stamps. source available in the enterprise
 A national change of address (NCOA) process
Back Room Administration of Dual Date/Time should be integrated to ensure address changes are
Stamps captured.
 For a given customer, the date/time stamps on the Avoiding Fact-to-Fact Table Joins
sequence of transactions must form a perfect
unbroken sequence with no gaps.  DW/BI systems should be built process-by-
 It is tempting to make the end effective date/time process, not department-by-department, on a
stamp be one “tick” less than the beginning foundation of conformed dimensions to support
effective date/time stamp of the next transaction, so integration.
the query SQL can use the BETWEEN syntax  Because the sales and support tables both contain a
rather than the uglier constraints shown above. customer foreign key, you can further imagine
 Using the pair of date/time stamps requires a two- joining both fact tables to a common customer
step process whenever a new transaction row is dimension to simultaneously summarize sales facts
entered. In the first step, the end effective date/time along with support facts for a given customer.
stamp of the most current transaction must be set to  Simultaneously joining the solicitations fact table
a fictitious date/time far in the future. to the customer dimension, which is, in turn, joined
 In the second step, after the new transaction is to the responses fact table, does not return the
entered into the database, the ETL process must correct answer in a relational DBMS due to the
retrieve the previous transaction and set its end cardinality differences. Fortunately, this problem is
effective date/time to the date/time of the newly easily avoided.
entered transaction.  You simply issue the drill-across technique to
query the solicitations table and responses table in
Tagging Fact Tables with Satisfaction Indicators separate queries and then outer join the two answer
sets.
 Although profitability might be the most important
 The drill-across approach has additional benefits
key performance indicator in many organizations,
for better controlling performance parameters, in
customer satisfaction is a close second.
addition to supporting queries that combine data
 Textual satisfaction data is generally modeled in
from fact tables in different physical locations.
two ways, depending on the number of satisfaction
attributes and the sparsity of the incoming data. WARNING Be very careful when simultaneously
joining a single dimension table to two fact tables of
Tagging Fact Tables with Abnormal Scenario
different cardinality. In many cases, relational engines
Indicators
return the “wrong” answer.
 Accumulating snapshot fact tables depend on a
Low Latency Reality Check
series of dates that implement the “standard
scenario” for the pipeline process.  Generally, data quality suffers as the data is
delivered closer to real time.
Customer Data Integration Approaches
 Business users may automatically think that the
 In typical environments with many customer facing faster the information arrives in the DW/BI system,
processes, you need to choose between two the better. But decreasing the latency increases the
approaches: a single customer dimension derived data quality problems.
from all the versions of customer source system  Low latency data delivery can be very valuable, but
records or multiple customer dimensions tied the business users need to be informed about these
together by conformed attributes. trade-offs.

Master Data Management Creating a Single


Customer Dimension

 In some cases, you can build a single customer


dimension that is the “best of breed” choice among
a number of available customer data sources.
 Some organizations are lucky enough to have a
centralized master data management (MDM)
system that takes responsibility for creating and
controlling the single enterprise-wide customer
entity.
 Unfortunately, there’s no secret weapon for
tackling this data consolidation. The attributes in

You might also like