Ais Prof 1 Chapter 5
Ais Prof 1 Chapter 5
MODELING PRIMER because both consist of joined relational tables; the key
difference between 3NF and dimensional models is the
One of the most important assets of any organization is degree of normalization. Because both model types can
its information.
NOTE The designer’s dilemma of whether a numeric EXTRACT, TRANSFORMATION, AND LOAD
quantity is a fact or a dimension attribute is rarely a SYSTEM
difficult decision. Continuously valued
numeric observations are almost always
facts; discrete numeric observations drawn
from a small list are almost always dimension
attributes.
The dimension and fact terminology originated from a - consists of a work area, instantiated data structures,
joint research project conducted by General Mills and and a set of processes.
Dartmouth University in the 1960s. In the 1970s, both - is everything between the operational source
AC Nielsen and IRI used the terms consistently to systems and the DW/BI presentation area.
describe their syndicated data offerings and gravitated to
Extraction is the first step in the process of getting data
dimensional models for simplifying the presentation of
into the data warehouse environment.
their analytic information.
Extracting means reading and understanding the source
Facts and Dimensions Joined in a Star Schema
data and copying the data needed into the ETL system
- star-like structure is often called a star join, a term for further manipulation.
dating back to the earliest days of relational
databases. After the data is extracted to the ETL system, there are
numerous potential transformations, such as cleansing
The first thing to notice about the dimensional schema is the data (correcting misspellings, resolving domain
its simplicity and symmetry. conflicts, dealing with missing elements, or parsing into
standard formats), combining data from multiple - analytic data is deployed on a departmental
sources, and de-duplicating data. basis without concern to sharing and
integrating information across the enterprise.
The final step of the ETL process is the physical - Typically, a single department identifies
structuring and loading of data into the presentation requirements for data from an operational
area’s target dimensional models. source system.
- The ETL system is typically dominated by
the simple activities of sorting and
sequential processing.
- In many cases, the ETL system is not
based on relational technology but instead
may rely on a system of flat files.
architecture.
Aggregate Fact Tables or OLAP Cubes Natural, Durable, and Supernatural Keys
Aggregate fact tables are simple numeric rollups Natural keys created by operational source
of atomic fact table data built solely to accelerate systems are subject to business rules outside the
query performance. control of the DW/BI system. For instance, an
These aggregate fact tables should be available to employee number (natural key) may be changed
the BI layer at the same time as the atomic fact if the employee resigns and then is rehired.
tables so that BI tools smoothly choose the a new durable key must be created that is
appropriate aggregate level at query time. persistent and does not change in this situation.
This process, known as aggregate navigation, This key is sometimes referred to as a durable
must be open so that every report writer, query supernatural key.
tool, and BI application harvests the same The best durable keys have a format that is
performance benefits. independent of the original business process and
aggregate OLAP cubes with summarized thus should be simple integers assigned in
measures are frequently built in the same way as sequence beginning with 1.
relational aggregates, but the OLAP cubes are
meant to be accessed directly by the business Drilling Down
users.
Drilling down is the most fundamental way
Consolidated Fact Tables data is analyzed by business users. Drilling
down simply means adding a row header to an
It is often convenient to combine facts from existing query; the new row header is a
multiple processes together into a dimension attribute appended to the GROUP BY
single consolidated fact table if they can be expression in an SQL query
expressed at the same grain.
Consolidated fact tables add burden to the ETL Degenerate Dimensions
processing, but ease the analytic burden on the
Sometimes a dimension is defined that has no
BI applications.
content except for its primary key.
They should be considered for cross-process
This degenerate dimension is placed in the fact
metrics that are frequently analyzed together.
table with the explicit acknowledgment that there
BASIC DIMENSION TABLE TECHNIQUES is no associated dimension table
Degenerate dimensions are most common with
Dimension Table Structure transaction and accumulating snapshot fact
tables.
Every dimension table has a single primary key
column Denormalized Flattened Dimensions
This primary key is embedded as a foreign key in
any associated fact table where the dimension Dimension denormalization supports
row’s descriptive context is exactly correct for dimensional modeling’s twin objectives of
that fact table row. simplicity and speed.
Dimension tables are usually wide, flat
Multiple Hierarchies in Dimensions
denormalized tables with many low-cardinality
text attributes. Many dimensions contain more than one natural
hierarchy
Dimension Surrogate Keys
Flags and Indicators as Textual Attributes
A dimension table is designed with one column
serving as a unique primary key. This primary Cryptic abbreviations, true/false flags, and
key cannot be the operational system’s natural operational indicators should be supplemented in
key because there will be multiple dimension dimension tables with full text words that have
rows for that natural key when changes are meaning when independently viewed.
tracked over time.
dimension surrogate keys are simple integers, Null Attributes in Dimensions
assigned in sequence, starting with the value 1,
Null-valued dimension attributes result when a
every time a new key is needed.
given dimension row has not been fully
The date dimension is exempt from the surrogate
populated, or when there are attributes that are
key rule; this highly predictable and stable
not applicable to all the dimension’s rows.
Nulls in dimension attributes should be avoided defined once in collaboration with the business’s
because different databases handle grouping and data governance representatives, are reused
constraining on nulls inconsistently. across fact tables; they deliver both analytic
consistency and reduced future development
Calendar Date Dimensions costs because the wheel is not repeatedly re-
created.
Calendar date dimensions are attached to
virtually every fact table to allow navigation of Shrunken Dimensions
the fact table through familiar dates, months,
fiscal periods, and special days on the calendar. are conformed dimensions that are a subset of
The calendar date dimension typically has many rows and/or columns of a base dimension
attributes describing characteristics such as week Shrunken rollup dimensions are required
number, month name, fiscal period, and national when constructing aggregate fact tables.
holiday indicator Another case of conformed dimension
The date/time stamp is not a foreign key to a subsetting occurs when two dimensions are at
dimension table, but rather is a standalone the same level of detail, but one represents
column. only a subset of rows.
It is essential that each foreign key refers to a simply means making separate queries against
separate view of the date dimension so that the two or more fact tables where the row headers
references are independent. of each query consist of identical conformed
These separate dimension views (with unique attributes.
attribute column names) are called roles.
Value Chain
Junk Dimensions
identifies the natural flow of an organization’s
transactional business processes typically primary business processes.
produce a number of miscellaneous, low Operational source systems typically produce
cardinality flags and indicators. transactions or snapshots at each step of the
This dimension, frequently labeled as a value chain.
transaction profile dimension in a schema, does
not need to be the Cartesian product of all the Enterprise Data Warehouse Bus Architecture
attributes’ possible values, but should only
The enterprise data warehouse bus
contain the combination of values that actually
architecture provides an incremental approach
occur in the source data.
to building the enterprise DW/BI system
Snowflaked Dimension This architecture decomposes the DW/ BI
planning process into manageable pieces by
When this process is repeated with all the focusing on business processes, while
dimension table’s hierarchies, a characteristic delivering integration via standardized
multilevel structure is created that is called a conformed dimensions that are reused across
snowflake. processes.
Although the snowflake represents hierarchical The bus architecture is technology and
data accurately, you should avoid snowflakes database platform independent; both relational
because it is difficult for business users to and OLAP dimensional structures can
understand and navigate snowflakes. participate.
These secondary dimension references are The detailed implementation bus matrix is a
called outrigger dimensions. more granular bus matrix where each business
are permissible, but should be used sparingly process row has been expanded to show
specific fact tables or OLAP cubes.
INTEGRATION VIA CONFORMED
DIMENSIONS Opportunity/Stakeholder Matrix
Type 5: Add Mini-Dimension and Type 1 Outrigger Surrogate keys are used to implement the
primary keys of almost all dimension tables.
The type 5 technique is used to accurately are not associated with any dimension, are
preserve historical attribute values, plus report assigned sequentially during the ETL load
historical facts according to current attribute process and are used 1) as the single column
values. primary key of the fact table; 2) to serve as an
Type 5 builds on the type 4 mini-dimension immediate identifier of a fact table row
by also embedding a current type 1 reference without navigating multiple dimensions for
to the mini-dimension in the base dimension. ETL purposes; 3) to allow an interrupted load
process to either back out or resume; 4) to
Type 6: Add Type 1 Attributes to Type 2 Dimension
allow fact table update operations to be
Like type 5, type 6 also delivers both decomposed into less risky inserts plus
historical and current dimension attribute deletes.
values.
Centipede Fact Tables
Type 6 builds on the type 2 technique by also
embedding current type 1 versions of the same Some designers create separate
attributes in the dimension row so that fact normalized dimensions for each level of
rows can be filtered or grouped by either the a many-to-one hierarchy, such as a date
dimension, month dimension, quarter Multiple Units of Measure Facts
dimension, and year dimension, and then
include all these foreign keys in a fact Some business processes require facts to be
table. This results in a centipede fact stated simultaneously in several units of
table with dozens of hierarchically measure.
related dimensions. If the fact table contains a large number of
Centipede fact tables should be avoided. facts, each of which must be expressed in all
units of measure, a convenient technique is to
Numeric Values as Attributes or Facts store the facts once in the table at an agreed
standard unit of measure, but also
Designers sometimes encounter numeric simultaneously store conversion factors
values that don’t clearly fall into either the between the standard measure and all the
fact or dimension attribute categories. others.
If the numeric value is used primarily for This fact table could be deployed through
calculation purposes, it likely belongs in the views to each user constituency, using an
fact table. appropriates selected conversion factor.
Lag/Duration Facts Year-to-Date Facts
Accumulating snapshot fact tables capture Business users often request year-to-date (YTD) values
multiple process milestones, each with a date in a fact table. It is hard to argue against a single
foreign key and possibly a date/time stamp. request, but YTD requests can easily morph into “YTD
Business users often want to analyze the lags at the close of the fiscal period” or “fiscal period to
or durations between these milestones; date.”
sometimes these lags are just the differences
between dates, but other times the lags are
based on more complicated business rules.
Multi-pass SQL to Avoid Fact-to-Fact Table Joins
Header/Line Fact Tables
A BI application must never issue SQL that joins two
Operational transaction systems often consist fact tables together across the fact table’s foreign keys.
of a transaction header row that’s associated
with multiple transaction lines. For instance, if two fact tables contain customer’s
With header/line schemas (also known as product shipments and returns, these two fact tables
parent/child schemas), all the header-level must not be joined directly across the customer and
dimension foreign keys and degenerate product foreign keys.
dimensions should be included on the line-
Timespan Tracking in Fact Tables
level fact table.
There are three basic fact table grains:
Allocated Facts
transaction, periodic snapshot, and
It is quite common in header/line transaction accumulating snapshot.
data to encounter facts of differing granularity, In isolated cases, it is useful to add a row
such as a header freight charge. effective date, row expiration date, and current
row indicator to the fact table, much like you
Profit and Loss Fact Tables Using Allocations do with type 2 slowly changing dimensions, to
capture a timespan when the fact row was
Fact tables that expose the full equation of effective.
profit are among the most powerful
deliverables of an enterprise DW/BI system. Late Arriving Facts
Fact tables ideally implement the profit
equation at the grain of the atomic revenue A fact row is late arriving if the most current
transaction and contain many components of dimensional context for new fact rows does
cost. not match the incoming row.
This happens when the fact row is delayed.
Multiple Currency Facts
ADVANCED DIMENSION TECHNIQUES
Fact tables that record financial transactions in
multiple currencies should contain a pair of Dimension-to-Dimension Table Joins
columns for every financial fact in the row.
Dimensions can contain references to other
This fact table also must have a currency
dimensions.
dimension to identify the transaction’s true
Although these relationships can be modeled
currency.
with outrigger dimensions, in some cases, the
existence of a foreign key to the outrigger comments’ cardinality matches the number of
dimension in the base dimension can result in unique transactions) with a corresponding
explosive growth of the base dimension foreign key in the fact table.
because type 2 changes in the outrigger force
corresponding type 2 processing in the base Multiple Time Zones
dimension.
To capture both universal standard time, as
Multivalued Dimensions and Bridge Tables well as local times in multi-time zone
applications, dual foreign keys should be
In a classic dimensional schema, each placed in the affected fact tables that join to
dimension attached to a fact table has a single two role-playing date (and potentially time-of-
value consistent with the fact table’s grain. day) dimension tables.
But there are a number of situations in which
a dimension is legitimately multivalued. Measure Type Dimensions
Time Varying Multivalued Bridge Tables Sometimes when a fact table has a long list of
facts that is sparsely populated in any
A multivalued bridge table may need to be individual row, it is tempting to create a
based on a type 2 slowly changing dimension. measure type dimension that collapses the fact
table row down to a single generic fact
Behavior Tag Time Series identified by the measure type dimension.
Although it removes all the empty fact
Almost all text in a data warehouse is
columns, it multiplies the size of the fact table
descriptive text in dimension tables.
by the average number of occupied columns
Data mining customer cluster analyses
in each row, and it makes intra-column
typically results in textual behavior tags, often
computations much more difficult
identified on a periodic basis.
This technique is acceptable when the number
Behavior Study Groups of potential facts is extreme (in the hundreds),
but less than a handful would be applicable to
Complex customer behavior can sometimes be any given fact table row.
discovered only by running lengthy iterative
analyses. Step Dimensions
The results of the complex behavior analyses,
Sequential processes, such as web page
however, can be captured in a simple table,
events, normally have a separate row in a
called a study group, consisting only of the
transaction fact table for each step in a
customers’ durable keys.
process.
Aggregated Facts as Dimension Attributes is used that shows what step number is
represented by the current step and how many
Business users are often interested in more steps were required to complete the
constraining the customer dimension based on session.
aggregated performance metrics, such as
filtering on all customers who spent over a Hot Swappable Dimensions
certain dollar amount during last year or
Hot swappable dimensions are used
perhaps over the customer’s lifetime.
when the same fact table is alternatively
Selected aggregated facts can be placed in a
paired with different copies of the same
dimension as targets for constraining and as
dimension
row labels for reporting.
Abstract Generic Dimensions
Dynamic Value Bands
Some modelers are attracted to abstract
A dynamic value banding report is organized
generic dimensions. For example, their
as a series of report row headers that defi ne a
schemas include a single generic location
progressive set of varying-sized ranges of a
dimension rather than embedded
target numeric fact.
geographic attributes in the store,
Text Comments Dimension warehouse, and customer dimensions.
NOTE A careful grain statement determines the The grain of atomic transaction fact tables can
primary dimensionality of the fact table. You then be succinctly expressed in the context of the
add more dimensions to the fact table if these transaction, such as one row per transaction or
additional dimensions naturally take on only one one row per transaction line.
value under each combination of the primary Because these fact tables record a
dimensions. If the additional dimension violates transactional event, they are often sparsely
the grain by causing additional fact rows to be populated. In our case study, we certainly
generated, the dimension needs to be disqualified wouldn’t sell every product in every shopping
or the grain statement needs to be revisited. cart.
Even though transaction fact tables are
Step 4: Identify the Facts unpredictably and sparsely populated, they
The fourth and final step in the design is to make a can be truly enormous. Most billion and
careful determination of which facts will appear in trillion row tables in a data warehouse are
the fact table. transaction fact tables.
Transaction fact tables tend to be highly
dimensional.
The metrics resulting from transactional
events are typically additive as long as they
have been extended by the quantity amount,
rather than capturing per unit metrics.
Date Dimension
Natural Keys
Value Chain Introduction are additive across some dimensions but not
all.
Most organizations have an underlying value the semi-additive nature of inventory balance
chain of key business processes. facts is even more understandable if you think
The value chain identifies the natural, logical about your checking account balances.
flow of an organization’s primary activities.
NOTE: All measures that record a static level
(inventory levels, financial account balances, and
measures of intensity such as room temperatures) are
inherently non-additive across the date dimension and
possibly other dimensions. In these cases, the measure
may be aggregated across dates by averaging over the
number of time periods.
Inventory Models
Inventory Periodic Snapshot NOTE Remember there’s more to life than transactions
alone. Some form of a snapshot table to give a more
The dimensions immediately fall out of this cumulative view of a process often complements a
grain declaration: date, product, and store transaction fact table.
Opportunity/Stakeholder Matrix
Governance Objectives
Conformed Facts
Type 1: Overwrite
With a type 3 response, you do not issue a new Type 7: Dual Type 1 and Type 2 Dimensions
dimension row, but rather add a new column to
capture the attribute change.
Type 3 is distinguished from type 2 because the
pair of current and prior attribute values are
regarded as true at the same time.
Quantity shipped: Number of cases of the Accumulating Snapshot for Order Fulfillment
particular line item’s product. Pipeline
Extended gross amount: Also known as extended
list price because it is the quantity shipped The order management process can be thought of
multiplied by the list unit price. as a pipeline, especially in a build-to-order
Extended allowance amount: Amount subtracted manufacturing business.
from the invoice line gross amount for deal-related Periodic snapshots would provide insight into the
allowances. The allowances are described in the amount of product sitting in the pipeline, such as
adjoined deal dimension. The allowance amount is the backorder or finished goods inventories, or the
often called an off -invoice allowance. amount of product flowing through a pipeline
Extended discount amount: Amount subtracted spigot during a predefined interval. The
for volume or payment term discounts. The accumulating snapshot helps you better understand
discount descriptions are found in the deal the current state of an order, as well as product
dimension. movement velocities to identify pipeline
Extended net amount: Amount the customer is bottlenecks and inefficiencies.
expected to pay for this line item before tax. It is The fundamental difference between accumulating
equal to the gross invoice amount less the snapshots and other fact tables is that you can
allowances and discounts. revisit and update existing fact table rows as more
information becomes available.
The following cost amounts, leading to a bottom -line
contribution, are for internal consumption only: NOTE Accumulating snapshot fact tables typically
have multiple dates representing the major milestones of
Extended fixed manufacturing cost: Amount the process. However, just because a fact table has
identified by manufacturing as the pro rata fixed several dates doesn’t dictate that it is an accumulating
manufacturing cost of the invoice line’s product. snapshot. The primary differentiator of an accumulating
Extended variable manufacturing cost: Amount snapshot is that you revisit the fact rows as activity
identified by manufacturing as the variable occurs.
manufacturing cost of the product on the invoice
line. The accumulating snapshot technique is especially
Extended storage cost: Cost charged to the useful when the product moving through the
invoice line for storage prior to being shipped to pipeline is uniquely identified, such as an
the customer. automobile with a vehicle identification number,
Extended distribution cost: Cost charged to the electronics equipment with a serial number, lab
invoice line for transportation from the point of specimens with an identification number, or
manufacture to the point of shipment. This cost is process manufacturing batches with a lot number.
notorious for not being activity-based. The accumulating snapshot helps you understand
Contribution amount: Extended net invoice less throughput and yield.
all the costs just discussed. This is not the true
Accumulating Snapshots and Type 2 Dimensions
bottom line of the overall company because general
and administrative expenses and other financial Accumulating snapshots present the latest state of a
adjustments have not been made, but it is important workflow or pipeline. If the dimensions associated
nonetheless. This column sometimes has with an accumulating snapshot contain type 2
alternative labels, such as margin, depending on the attributes, the fact table should be updated to
company culture. reference the most current surrogate dimension key
for active pipelines.
Audit Dimension
Lag Calculations
invoice line-item design is one of the most
powerful because it provides a detailed look at represent basic measures of fulfillment efficiency.
customers, products, revenues, costs, and bottom-
line profit in one schema.
You could build a view on this fact table that
calculated a large number of these date differences
and presented them as if they were stored in the
underlying table. These view columns could
include metrics such as orders to manufacturing
release lag, manufacturing release to finished
goods lag, and order to shipment lag, depending on
the date spans monitored by the organization.
Most modern general ledger systems include Ragged Variable Depth Hierarchies
the capability to integrate budget data into the
general ledger. In the budget use case, the organization
Within most organizations, the budgeting structure is an excellent example of a
process can be viewed as a series of events. ragged hierarchy of indeterminate depth.
Budgets are becoming more dynamic because we often refer to the hierarchical
there are budget adjustments as the year structure as a “tree” and the individual
progresses, reflecting changes in business organizations in that tree as “nodes.”
conditions or the realities of actual spending the classic way to represent a
versus the original budget. parent/child tree structure is by placing
The facts in such a “status report” are all recursive pointers in the organization
semi-additive balances, rather than fully dimension from each row to its parent.
additive facts. The highest parent flag in the map table
The account dimension is also a reused means the particular path comes from the
dimension. highest parent in the tree. The lowest
The budget line item identifies the purpose of child flag means the particular path ends
the proposed spending, such as employee in a “leaf node” of the tree.
wages or office supplies.
NOTE The article “Building Hierarchy Bridge Tables”
The budget fact table has a single budget
(available at www.kimballgroup.com under the Tools
amount fact that is fully additive.
and Utilities tab for this book title) provides a code
Dimension Attribute Hierarchies example for building the hierarchy bridge table
described in this section.
a hierarchy is defined by a series of many-to-
one relationships. Time Varying Ragged Hierarchies
Fixed Depth Positional Hierarchies The ragged hierarchy bridge table can
accommodate slowly changing hierarchies
In the budget chain, the calendar levels are with the addition of two date/time stamps.
familiar fixed depth position hierarchies. As
the name suggests, a fixed position hierarchy WARNING When using the bridge, the query must
has a fixed set of levels, all with meaningful always constrain to a single date/time to “freeze” the
labels. bridge table to a single consistent view of the hierarchy.
Failing to constrain in this way otherwise would result
in multiple paths being fetched that could not exist at table, while failing to meet user requirements
the same time. that demand the ability to dive into more
granular data.
Modifying Ragged Hierarchies
NOTE When facts from multiple business
The organization map bridge table can easily processes are combined in a consolidated fact
be modified. table, they must live at the same level of
In the bridge table, only the paths directly granularity and dimensionality. Because the
involved in the change are affected. All other separate facts seldom naturally live at a common
paths are untouched. grain, you are forced to eliminate or aggregate
some dimensions to support the one-to-one
Alternative Ragged Hierarchy Modeling Approaches
correspondence, while retaining the atomic data in
there are at least two other ways to model a separate fact tables. Project teams should not create
ragged hierarchy, both involving clever artificial facts or dimensions in an attempt to force-
columns placed in the organization dimension. fit the consolidation of differently grained fact data.
There are two disadvantages to these schemes: Role of OLAP and Packaged Analytic Solutions
1. the definition of the hierarchy is locked into OLAP products have been used
the dimension and cannot easily be replaced. extensively for financial reporting,
2. both of these schemes are vulnerable to a budgeting, and consolidation
relabeling disaster in which a large part of the applications.
tree must be relabeled due to a single small OLAP cubes can deliver fast query
change. performance that is critical for executive
usage.
Another similar scheme, known to computer OLAP is well suited to handle
scientists as the modified preordered tree complicated organizational rollups, as
traversal approach, numbers the tree. well as complex calculations, including
Leaf nodes can be found where Left and Right inter-row manipulations.
diff er by 1, meaning there aren’t any children. OLAP cubes often also readily support
complex security models, such as
Advantages of the Bridge Table Approach for limiting access to detailed data while
Ragged Hierarchies providing more open access to summary
metrics.
In particular, the bridge table allows:
Business users are often interested in constraining E: Occasional customer, poor credit
the customer dimension based on aggregated
performance metrics, such as filtering on all F: Former good customer, not seen recently G:
customers who spent more than a certain dollar Frequent window shopper, mostly unproductive
amount during last year.
H: Other
Providing aggregated facts as dimension attributes
is sure to be a crowd-pleaser with the business John Doe: C C C D D A A A B B T
users
This time series of behavior tags is unusual
Segmentation Attributes and Scores because although it comes from a regular
periodic measurement process, the observed
Some of the most powerful attributes in a
“values” are textual. The behavior tags are not
customer dimension are segmentation
numeric and cannot be computed or averaged,
classifications.
but they can be queried.
For an individual customer, they may include: Behavior tags should not be stored as regular
facts. The main use of behavior tags is
Gender Ethnicity formulating complex query patterns like the
Age or other life stage classifications example in the previous paragraph.
Income or other lifestyle classifications
Status (such as new, active, inactive, and In addition to the separate columns for each behavior
closed) tag time period, it would be a good idea to create a
Referring sources single attribute with all the behavior tags concatenated
Business-specific market segment (such as a together, such as CCCDDAAABB. This column would
preferred customer identifier support wild card searches for exotic patterns, such as
“D followed by a B.”
Statistical segmentation models typically generate
these scores which cluster customers in a variety of NOTE In addition to the customer dimension’s time
ways, such as based on their purchase behavior, series of behavior tags, it would be reasonable to
payment behavior, propensity to churn, or probability to include the contemporary behavior tag value in a
default. minidimension to analyze facts by the behavior tag in
effect when the fact row was loaded.
Behavior Tag Time Series
Relationship Between Data Mining and DW/BI
One popular approach for scoring and profiling System
customers looks at the recency (R), frequency (F),
and intensity (I) of the customer’s behavior. The data mining team can be a great client of the
These are known as the RFI measures; sometimes data warehouse, and especially great users of
intensity is replaced with monetary (M), so it’s also customer behavior data.
known as RFM
Counts with Type 2 Dimension Changes
Recency is how many days has it been since the
customer last ordered or visited your site. Businesses frequently want to count customers
Frequency is how many times the customer has based on their attributes without joining to a fact
ordered or visited, typically in the past year. table. If you used type 2 to track customer
Intensity is how much money the customer has dimension changes, you need to be careful to avoid
spent over the same time period. overcounting because you may have multiple rows
in the customer dimension for the same individual.
The data mining professional may come back with a list
Doing a COUNT DISTINCT on a unique
of behavior tags like the following, which is drawn from
customer identifier is a possibility, assuming the
a slightly more complicated scenario that includes credit
attribute is indeed unique and durable.
behavior and returns:
Outrigger for Low Cardinality Attribute Set
A: High volume repeat customer, good credit, few
product returns Generally, snowflaking is not recommended in a
DW/BI environment because it almost always
B: High volume repeat customer, good credit,
makes the user presentation more complex, in
many product returns
addition to negatively impacting browsing
C: Recent new customer, no established credit performance.
pattern the dimension outrigger is a set of data from an
external data provider consisting of 150
demographic and socio-economic attributes point of contact is associated with a specific role.
regarding the customers’ county of residence. Because the number of contacts is unpredictable
Rather than repeating this large block of data for but possibly large, a bridge table design is a
every customer within a county, opt to model it as convenient way to handle this situation.
an outrigger.
Complex Customer Behavior
WARNING Dimension outriggers are permissible, but
they should be the exception rather than the rule. A red Customer behavior can be very complex.
warning flag should go up if your design is riddled with
Behavior Study Groups for Cohorts
outriggers; you may have succumbed to the temptation
to overly normalize the design. In other situations, you may want to capture the set
of customers from a query or exception report,
Customer Hierarchy Considerations
such as the top 100 customers from last year,
One of the most challenging aspects of dealing customers who spent more than $1,000 last month,
with commercial customers is modeling their or customers who received a specific test
internal organizational hierarchy. solicitation, and then use that group of customers,
called a behavior study group, for subsequent
Bridge Tables for Multivalued Dimensions analyses without reprocessing to identify the initial
condition.
A fundamental tenet of dimensional modeling is to To create a behavior study group, run a query (or
decide on the grain of the fact table, and then series of queries) to identify the set of customers
carefully add dimensions and facts to the design you want to further analyze, and then capture the
that are true to the grain. customer durable keys of the identified set as an
When faced with a multivalued dimension, there actual physical table consisting of a single
are two basic choices: a positional design or bridge customer key column.
table design.
Positional designs are very attractive because the NOTE The secret to building complex behavioral study
multivalued dimension is spread out into named group queries is to capture the keys of the customers or
columns that are easy to query. products whose behavior you are tracking. You then use
The positional design approach isn’t very scalable. the captured keys to subsequently constrain other fact
The bridge table approach to multivalued tables without having to rerun the original behavior
dimensions is powerful but comes with a big analysis.
compromise. The bridge table removes the
scalability and null value objections. The exceptional simplicity of study group tables allows
them to be combined with union, intersection, and set
WARNING Be aware that complex queries using difference operations. For example, a set of problem
bridge tables may require SQL that is beyond the customers this month can be intersected with the set of
normal reach of BI tools. problem customers from last month to identify
customers who were problems for two consecutive
Bridge Table for Sparse Attributes months.
Organizations are increasingly collecting Step Dimension for Sequential Behavior
demographics and status information about their
customers, but the traditional fixed column Most DW/BI systems do not have good examples
modeling approach for handling these attributes of sequential processes.
becomes difficult to scale with hundreds of The step dimension is an abstract dimension
attributes. defined in advance
Positional designs can be scaled up to perhaps 100 Using the step dimension, a specific page can
or so columns before the databases and user immediately be placed into one or more
interfaces become awkward or hard to maintain. understandable contexts.
Columnar databases are well suited to these kinds
of designs because new columns can be easily Timespan Fact Tables
added with minimal disruption to the internal
Was the customer on fraud alert when denied an
storage of the data, and the low-cardinality
extension of credit? How long had he been on
columns containing only a few discrete values are
fraud alert? How many times in the past two years
dramatically compressed
has he been on fraud alert? How many customers
Bridge Table for Multiple Customer Contacts were on fraud alert at some point in the past two
years?
Large commercial customers have many points of All these questions can be addressed if you
contact, including decision makers, purchasing carefully manage the transaction fact table
agents, department heads, and user liaisons; each
containing all customer events. The key modeling the customer dimension should represent the “best”
step is to include a pair of date/time stamps. source available in the enterprise
A national change of address (NCOA) process
Back Room Administration of Dual Date/Time should be integrated to ensure address changes are
Stamps captured.
For a given customer, the date/time stamps on the Avoiding Fact-to-Fact Table Joins
sequence of transactions must form a perfect
unbroken sequence with no gaps. DW/BI systems should be built process-by-
It is tempting to make the end effective date/time process, not department-by-department, on a
stamp be one “tick” less than the beginning foundation of conformed dimensions to support
effective date/time stamp of the next transaction, so integration.
the query SQL can use the BETWEEN syntax Because the sales and support tables both contain a
rather than the uglier constraints shown above. customer foreign key, you can further imagine
Using the pair of date/time stamps requires a two- joining both fact tables to a common customer
step process whenever a new transaction row is dimension to simultaneously summarize sales facts
entered. In the first step, the end effective date/time along with support facts for a given customer.
stamp of the most current transaction must be set to Simultaneously joining the solicitations fact table
a fictitious date/time far in the future. to the customer dimension, which is, in turn, joined
In the second step, after the new transaction is to the responses fact table, does not return the
entered into the database, the ETL process must correct answer in a relational DBMS due to the
retrieve the previous transaction and set its end cardinality differences. Fortunately, this problem is
effective date/time to the date/time of the newly easily avoided.
entered transaction. You simply issue the drill-across technique to
query the solicitations table and responses table in
Tagging Fact Tables with Satisfaction Indicators separate queries and then outer join the two answer
sets.
Although profitability might be the most important
The drill-across approach has additional benefits
key performance indicator in many organizations,
for better controlling performance parameters, in
customer satisfaction is a close second.
addition to supporting queries that combine data
Textual satisfaction data is generally modeled in
from fact tables in different physical locations.
two ways, depending on the number of satisfaction
attributes and the sparsity of the incoming data. WARNING Be very careful when simultaneously
joining a single dimension table to two fact tables of
Tagging Fact Tables with Abnormal Scenario
different cardinality. In many cases, relational engines
Indicators
return the “wrong” answer.
Accumulating snapshot fact tables depend on a
Low Latency Reality Check
series of dates that implement the “standard
scenario” for the pipeline process. Generally, data quality suffers as the data is
delivered closer to real time.
Customer Data Integration Approaches
Business users may automatically think that the
In typical environments with many customer facing faster the information arrives in the DW/BI system,
processes, you need to choose between two the better. But decreasing the latency increases the
approaches: a single customer dimension derived data quality problems.
from all the versions of customer source system Low latency data delivery can be very valuable, but
records or multiple customer dimensions tied the business users need to be informed about these
together by conformed attributes. trade-offs.