First Part 27 Pages
First Part 27 Pages
Integration to Analytics
www.biguidebook.com
November 2014
Imprint: Morgan Kaufmann
Print Book ISBN : 9780124114616
eBook ISBN : 9780124115286
1
Chapter 9
Dimensional Modeling
2
Outline
• Introduction to dimensional modeling.
• High-level view of a dimensional model
- Facts
- Dimensions
- Schemas
• ER vs dimensional modeling
• Purpose of dimensional modeling
• Advanced dimensional modeling
• Dimensional modeling recap
3
Introduction to Dimensional Modeling
• The purpose of dimensional modeling is to enable business intelligence (BI) re-
porting, query, and analysis.
• It depicts business processes throughout an enterprise and organizes that data and its
structure in a logical way.
• It is much better suited for business intelligence (BI) applications and data
warehousing (DW)
4
High-Level View of a Dimensional Model
• There are two key entities in a dimensional model:
● Facts (measures).
● Dimensions (context).
• Example:
5
High-Level View of a Dimensional Model (Example)
f09-01
6
Facts
• A fact is a measurement of a business activity, such as a business event or transaction,
and is generally numeric.
• Facts can be aggregated or derived. For example, you can sum up the total revenue or
calculate the profitability of a set of sales transactions.
• Facts provide the measurements of how well or how poorly the business is
performing. A fact is also referred to as organizational performance measure.
7
Facts (Cont.)
• Fact tables are normalized and contain little redundancy.
• Fact table record counts can become very large. Ninety-percent of the data in a
dimensional model is typically located in the fact tables.
• The key dimensional modeling design concerns when working with the data in fact
tables are how to minimize and standardize it and make it consistent.
• Fact tables are composed of two types of columns: keys and measures
8
Fact table – keys
• The key column of a fact table,
consists of a group of foreign
keys (FK) that point to the
primary keys of dimensional
tables that are associated with
this fact table to enable business
analysis.
• The multipart key may be a subset of the foreign keys such as in our
example: DateKey, StoreKey, ProductKey, and CustomerKey may
uniquely identify each row in the sales fact table.
f09-04 12
Fact table – measures
• The second type of column in a fact table is the
actual measures of the business activity such as
the sales revenue and order quantity.
• Every measurement has a grain, which is the
level of detail in the measurement of an event.
For example, the grain of currency could be to
the dollar amount, or be more granular and
include cents.
• Granularity is determined by its data source.
• Now, it’s an established practice to store at the
lowest transactional level of detail that’s available
from a transactional or operational systems. f09-05
13
Fact table - types of facts
• After you’ve defined the measures and their level of grain in the facts, you need to determine the
numeric attributes of the types of measures that are being stored in the fact. There are three types of
measures:
• After you define measures and determine whether they are additive,
nonadditive, or semiadditive facts, you need to establish how they can be
analyzed in BI.
• It is the BI team’s responsibility to ensure that the business people performing
the analysis know what type of measure they are accessing to prevent the risk
of using data inappropriately. 15
Dimensions
• A dimension is an entity that establishes the business context for the
measures (facts) used by an enterprise.
• Dimensions define the who, what, where, and why of the dimensional
model, and group similar attributes into a category or subject area.
• Whereas facts are numeric, dimensions are descriptive in nature
(although some of those descriptions, such as a product list price, may
be numeric).
• Creating a dimension enables facts to store attributes in a single place,
rather than multiplying them redundantly across the rows of the fact
16
tables (i.e, eliminates redundancy).
Dimensions (cont.)
• From a business perspective, the key purpose of dimensions
it to use their attributes to filter and analyze data based on
performance measures.
f09-06
17
Dimensions (cont.)
To be useful in analysis a dimensional attribute needs these key characteristics:
• Descriptive, so business people and those designing the BI applications can
understand it.
• Complete, with no missing values.
• Unique, because it’s critical that values are uniquely identifiable.
• Valid, so the data is useful to the business.
18
Dimension Hierarchy
• Another aspect of the business context created by
dimensions is that they are often hierarchical; they group
things together in ways that an enterprise would measure
itself.
• These hierarchies represent many-to-one relationships.
Examples of hierarchies include:
- Organizational structures, such as a marketing or sales
organization.
- Product or service categories.
- Geographic groupings such as sales territories.
- Time. Years breaks down to quarters, months, weeks,
etc. f09-07
• Using BI terminology, dimensions allow you to drill up
and down and across.
19
Dimension keys
20
Dimensions - surrogate and natural keys
• One of the best practices to emerge for dimensions is
using a surrogate key as the primary key as depicted
in Figure 9.8.
f09-08
21
Dimensions - surrogate and natural keys
(Cont.)
• The reasons to create a surrogate key are:
- When gathering dimensional data from multiple source systems, there are often inconsistent
or incompatible primary keys used across these systems.
- Primary keys from source systems often change over time with different naming or
numbering conventions being used at different times. Additionally, over time, source
applications may be replaced by newer systems, or mergers may create the need to replace
systems.
- Primary key consistency may be maintained by source systems for shorter periods than the
enterprise analytical needs dictate.
22
Dimensions - surrogate and natural keys
(Cont.)
What is a smart key?
Operational and transactional systems sometimes define or identify items such as products with
smart keys.
These are alphanumeric strings, maybe 24 or 40 characters in length.
The character string is typically divided into substrings. The substrings have meaning, hence the
word smart key.
For example, the first three characters might designate what manufacturing plant the product was
built in. The next five characters might designate the materials that were used to construct the
product. The next 10 might designate the size or some other characteristics of the construction of
the product, and so on.
23
Dimensions - surrogate and natural keys
(Cont.)
• An additional best practice is to maintain the source
system’s primary key as an alternate key in the
dimension. This is also called the source system’s
natural key.
• An additional benefit of the surrogate key is that, being integer-based, it is a great data type to
index and join in a relational model.
25
Dimensions – not null primary keys
• The foreign keys used as the primary key in fact tables should never
contain null values.
• For an example of how a null value gets assigned, suppose a row in a sales
fact table has a null value in the customer identifier column that is the
foreign key linked to the customer dimension. The null value was input
into that column when, loading the data from the source systems, the ETL
process could not find a customer associated with that sale because the
value was unknown, missing or invalid.
26
Dimensions – not null primary keys (Cont.)
• This condition clearly results in misleading analysis that potentially creates business risk. The best
practices that address this potential business risk include:
- Create row(s) in each dimension that are used when dimensional values are unknown,
missing, invalid, or other conditions in which referential integrity is not met.
- Because the numbering convention for surrogate keys is a positive integer, use negative
integers such as −999 for “missing” row keys.
- Dimensional rows have surrogate keys along with attributes used for naming and describing
them. Create a standard name and description for these rows used across all dimensions, e.g.,
“Missing,” “Unknown,” or “Invalid.”
- At a minimum, designate one row per dimension table for missing values, but if it is important
to be able to identify different conditions then use multiple rows. If there are multiple
conditions handled then use standard numbering and naming for each of these conditions
across all dimensions.
27