Modeling Dimension Tables in Warehouse - Microsoft Fabric - Microsoft Learn
Modeling Dimension Tables in Warehouse - Microsoft Fabric - Microsoft Learn
7 Note
This article forms part of the Dimensional modeling in Microsoft Fabric Data Warehouse
series of articles. This series focuses on guidance and design best practices related to
dimensional modeling in Microsoft Fabric Warehouse.
This article provides you with guidance and best practices for designing dimension tables in a
dimensional model. It provides practical guidance for Warehouse in Microsoft Fabric, which is an
experience that supports many T-SQL capabilities, like creating tables and managing data in
tables. So, you're in complete control of creating your dimensional model tables and loading
them with data.
7 Note
In this article, the term data warehouse refers to an enterprise data warehouse, which delivers
comprehensive integration of critical data across the organization. In contrast, the standalone
term warehouse refers to a Fabric Warehouse, which is a software as a service (SaaS)
relational database offering that you can use to implement a data warehouse. For clarity, in
this article the latter is mentioned as Fabric Warehouse.
Tip
If you're inexperienced with dimensional modeling, consider this series of articles your first
step. It isn't intended to provide a complete discussion on dimensional modeling design. For
more information, refer directly to widely adopted published content, like The Data
Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd edition, 2013) by Ralph
Kimball, and others.
In a dimensional model, a dimension table describes an entity relevant to your business and
analytics requirements. Broadly, dimension tables represent the things that you model. Things
could be products, people, places, or any other concept, including date and time. To easily identify
dimension tables, you typically prefix their names with d_ or Dim_ .
SQL
--Natural key(s)
EmployeeID VARCHAR(20) NOT NULL,
--Dimension attributes
FirstName VARCHAR(20) NOT NULL,
<...>
--Audit attributes
AuditMissing BIT NOT NULL,
AuditIsInferred BIT NOT NULL,
AuditCreatedDate DATE NOT NULL,
AuditCreatedBy VARCHAR(15) NOT NULL,
AuditLastModifiedDate DATE NOT NULL,
AuditLastModifiedBy VARCHAR(15) NOT NULL
);
Surrogate key
The sample dimension table has a surrogate key, which is named Salesperson_SK . A surrogate key
is a single-column unique identifier that's generated and stored in the dimension table. It's a
primary key column used to relate to other tables in the dimensional model.
Surrogate keys strive to insulate the data warehouse from changes in source data. They also
deliver many other benefits, allowing you to:
A surrogate key column is a recommended practice, even when a natural key (described next)
seems an acceptable candidate. You should also avoid giving meaning to the key values (except
for date and time dimension keys, as described later).
Natural keys
The sample dimension table also has a natural key, which is named EmployeeID . A natural key is
the key stored in the source system. It allows relating the dimension data to its source system,
which is typically done by an Extract, Load, and Transform (ETL) process to load the dimension
table. Sometimes a natural key is called a business key, and its values might be meaningful to
business users.
Sometimes dimensions don't have a natural key. That could be the case for your date dimension
or lookup dimensions, or when you generate dimension data by normalizing a flat file.
Dimension attributes
A sample dimension table also has dimension attributes, like the FirstName column. Dimension
attributes provide context to the numeric data stored in related fact tables. They're typically text
columns that are used in analytic queries to filter and group (slice and dice), but not to be
aggregated themselves. Some dimension tables contain few attributes, while others contain many
attributes (as many as it takes to support the query requirements of the dimensional model).
Tip
A good way to determine which dimensions and attributes you need is to find the right
people and ask the right questions. Specifically, stay alert for the mention of the word by. For
example, when someone says they need to analyze sales by salesperson, by month, and by
product category, they're telling you that they need dimensions that have those attributes.
If you plan to create a Direct Lake semantic model, you should include all possible columns
required for filtering and grouping as dimension attributes. That's because Direct Lake semantic
models don't support calculated columns.
Foreign keys
The sample dimension table also has a foreign key, which is named SalesRegion_FK . Other
dimension tables can reference a foreign key, and their presence in a dimension table is a special
case. It indicates that the table is related to another dimension table, meaning that it might form
part of a snowflake dimension or it's related to an outrigger dimension.
Fabric Warehouse supports foreign key constraints but they can't be enforced. Therefore, it's
important that your ETL process tests for integrity between related tables when data is loaded.
It's still a good idea to create foreign keys. One good reason to create unenforced foreign keys is
to allow modeling tools, like Power BI Desktop, to automatically detect and create relationships
between tables in the semantic model.
For more information, see Manage historical change later in this article.
Audit attributes
The sample dimension table also has various audit attributes. Audit attributes are optional but
recommended. They allow you to track when and how dimension records were created or
modified, and they can include diagnostic or troubleshooting information raised during ETL
processes. For example, you'll want to track who (or what process) updated a row, and when.
Audit attributes can also help diagnose a challenging problem, like when an ETL process stops
unexpectedly. They can also flag dimension members as errors or inferred members.
Big dimensions might be sourced from multiple source systems. In this case, dimension
processing needs to combine, merge, deduplicate, and standardize the data; and assign surrogate
keys.
By comparison, some dimensions are tiny. They might represent lookup tables that contain only
several records and attributes. Often these small dimensions store category values related to
transactions in fact tables, and they're implemented as dimensions with surrogate keys to relate
to the fact records.
Tip
When you have many small dimensions, consider consolidating them into a junk dimension.
Snowflake dimensions
One exception to denormalization is to design a snowflake dimension. A snowflake dimension is
normalized, and it stores the dimension data across several related tables.
The following diagram depicts a snowflake dimension that comprises three related dimension
tables: Product , Subcategory , and Category .
The dimension is extremely large and storage costs outweigh the need for high query
performance. (However, periodically reassess that this still remains the case.)
You need keys to relate the dimension to higher-grain facts. For example, the sales fact table
stores rows at product level, but the sales target fact table stores rows at subcategory level.
You need to track historical changes at higher levels of granularity.
7 Note
Bear in mind that a hierarchy in a Power BI semantic model can only be based on columns
from a single semantic model table. Therefore, a snowflake dimension should deliver a
denormalized result by using a view that joins the snowflake tables together.
Hierarchies
Commonly, dimension columns produce hierarchies. Hierarchies enable exploring data at distinct
levels of summarization. For example, the initial view of a matrix visual might show yearly sales,
and the report consumer can choose to drill down to reveal quarterly and monthly sales.
There are three ways to store a hierarchy in a dimension. You can use:
Hierarchies can be balanced or unbalanced. It's also important to understand that some
hierarchies are ragged.
Balanced hierarchies
Balanced hierarchies are the most common type of hierarchy. A balanced hierarchy has the same
number of levels. A common example of a balanced hierarchy is a calendar hierarchy in a date
dimension that comprises levels for year, quarter, month, and date.
The following diagram depicts a balanced hierarchy of sales regions. It comprises two levels,
which are sales region group and sales region.
Levels of a balanced hierarchy are either based on columns from a single, denormalized
dimension, or from tables that form a snowflake dimension. When based on a single,
denormalized dimension, the columns that represent the higher levels contain redundant data.
For balanced hierarchies, facts always relate to a single level of the hierarchy, which is typically the
lowest level. That way, the facts can be aggregated (rolled up) to the highest level of the
hierarchy. Facts can relate to any level, which is determined by the grain of the fact table. For
example, the sales fact table might be stored at date level, while the sales target fact table might
be stored at quarter level.
Unbalanced hierarchies
Unbalanced hierarchies are a less common type of hierarchy. An unbalanced hierarchy has levels
based on a parent-child relationship. For this reason, the number of levels in an unbalanced
hierarchy is determined by the dimension rows, and not specific dimension table columns.
The following diagram depicts an unbalanced hierarchy. It comprises four levels, and each
member in the hierarchy is a salesperson. Salespeople have a different number of ancestors in the
hierarchy according to who they report to.
Other common examples of unbalanced hierarchies include bill of materials, company ownership
models, and general ledger.
For unbalanced hierarchies, facts always relate to the dimension grain. For example, sales facts
relate to different salespeople, who have different reporting structures. The dimension table
would have a surrogate key (named Salesperson_SK ) and a ReportsTo_Salesperson_FK foreign key
column, which references the primary key column. Each salesperson without anyone to manage
isn't necessarily at the lowest level of any branch of the hierarchy. When they're not at the lowest
level, a salesperson might sell products and have reporting salespeople who also sell products. So,
the rollup of fact data must consider the individual salesperson and all their descendants.
Querying parent-child hierarchies can be complex and slow, especially for large dimensions. While
the source system might store relationships as parent-child, we recommend that you naturalize
the hierarchy. In this instance, naturalize means to transform and store the hierarchy levels in the
dimension as columns.
Tip
If you choose not to naturalize the hierarchy, you can still create a hierarchy based on a
parent-child relationship in a Power BI semantic model. However, this approach isn't
recommended for large dimensions. For more information, see Understanding functions for
parent-child hierarchies in DAX.
Ragged hierarchies
Sometimes a hierarchy is ragged because the parent of a member in the hierarchy exists at a level
that's not immediately above it. In these cases, missing level values repeat the value of the parent.
It's possible that a dimension could support both SCD type 1 and SCD type 2 changes.
SCD type 3 isn't commonly used, in part due to the fact that it's difficult to use in a semantic
model. Consider carefully whether an SCD type 2 approach would be a better fit.
Tip
If you anticipate a rapidly changing dimension, which is a dimension that has an attribute that
changes frequently, consider adding that attribute to the fact table instead. If the attribute is
numeric, like the product price, you can add it as a measure in the fact table. If the attribute
is a text value, you can create a dimension based on all text values and add its dimension key
to the fact table.
SCD type 1
SCD type 1 changes overwrite the existing dimension row because there's no need to keep track
of changes. This SCD type can also be used to correct errors. It's a common type of SCD, and it
should be used for most changing attributes, like customer name, email address, and others.
The following diagram depicts the before and after state of a salesperson dimension member
where their phone number has changed.
This SCD type doesn't preserve historical perspective because the existing row is updated. That
means SCD type 1 changes can result in different higher-level aggregations. For example, if a
salesperson is assigned to a different sales region, an SCD type 1 change would overwrite the
dimension row. The rollup of salespeople historic sales results to region would then produce a
different outcome because it now uses the new current sales region. It's as if that salesperson was
always assigned to the new sales region.
SCD type 2
SCD type 2 changes result in new rows that represent a time-based version of a dimension
member. There's always a current version row, and it reflects the state of the dimension member
in the source system. Historical tracking attributes in the dimension table store values that allow
identifying the current version (current flag is TRUE ) and its validity time period. A surrogate key is
required because there will be duplicate natural keys when multiple versions are stored.
It's a common type of SCD, but it should be reserved for attributes that must preserve historical
perspective.
For example, if a salesperson is assigned to a different sales region, an SCD type 2 change
involves an update operation and an insert operation.
1. The update operation overwrites the current version to set the historical tracking attributes.
Specifically, the end validity column is set to the ETL processing date (or a suitable
timestamp in the source system) and the current flag is set to FALSE .
2. The insert operation adds a new, current version, setting the start validity column to the end
validity column value (used to update the prior version) and the current flag to TRUE .
It's important to understand that the granularity of related fact tables isn't at the salesperson
level, but rather the salesperson version level. The rollup of their historic sales results to region will
produce correct results but there will be two (or more) salesperson member versions to analyze.
The following diagram depicts the before and after state of a salesperson dimension member
where their sales region has changed. Because the organization wants to analyze salespeople
effort by the region they're assigned to, it triggers an SCD type 2 change.
Tip
When a dimension table supports SCD type 2 changes, you should include a label attribute
that describes the member and the version. Consider an example when the salesperson Lynn
Tsoflias from Adventure Works changes assignment from the Australian sales region to the
United Kingdom sales region. The label attribute for the first version could read "Lynn Tsoflias
(Australia)" and the label attribute for the new, current version could read "Lynn Tsoflias
(United Kingdom)." If helpful, you might include the validity dates in the label too.
You should balance the need for historic accuracy versus usability and efficiency. Try to avoid
having too many SCD type 2 changes on a dimension table because it can result in an
overwhelming number of versions that might make it difficult for analysts to comprehend.
Also, too many versions could indicate that a changing attribute might be better stored in
the fact table. Extending the earlier example, if sales region changes were frequent, the sales
region could be stored as a dimension key in the fact table rather than implementing an SCD
type 2.
SQL
<...>
);
The RecChangeDate_FK column stores the date when the change came into effect. It allows
you to query when changes took place.
The RecValidFromKey and RecValidToKey columns store the effective dates of validity for the
row. Consider storing the earliest date found in the date dimension for RecValidFromKey to
represent the initial version, and storing 01/01/9999 for the RecValidToKey of the current
versions.
The RecReason column is optional. It allows documenting the reason why the version was
inserted. It could encode which attributes changed, or it could be a code from the source
system that states a particular business reason.
The RecIsCurrent column makes it possible to retrieve current versions only. It's used when
the ETL process looks up dimension keys when loading fact tables.
7 Note
Some source systems don't store historical changes, so it's important that the dimension is
processed regularly to detect changes and implement new versions. That way, you can detect
changes shortly after they occur, and their validity dates will be accurate.
SCD type 3
SCD type 3 changes track limited history with attributes. This approach can be useful when there's
a need to record the last change, or a number of the latest changes.
This SCD type preserves limited historical perspective. It might be useful when only the initial and
current values should be stored. In this instance, interim changes wouldn't be required.
For example, if a salesperson is assigned to a different sales region, an SCD type 3 change
overwrites the dimension row. A column that specifically stores the previous sales region is set as
the previous sales region, and the new sales region is set as the current sales region.
The following diagram depicts the before and after state of a salesperson dimension member
where their sales region has changed. Because the organization wants to determine any previous
sales region assignment, it triggers an SCD type 3 change.
ノ Expand table
-3 Error
Calendar and time
Almost without exception, fact tables store measures at specific points in time. To support analysis
by date (and possibly time), there must be calendar (date and time) dimensions.
It's uncommon that a source system would have calendar dimension data, so it must be
generated in the data warehouse. Typically, it's generated once, and if it's a calendar dimension,
it's extended with future dates when needed.
Date dimension
The date (or calendar) dimension is the most common dimension used for analysis. It stores one
row per date, and it supports the common requirement to filter or group by specific periods of
dates, like years, quarters, or months.
) Important
A date dimension shouldn't include a grain that extends to time of day. If time of day analysis
is required, you should have both a date dimension and a time dimension (described next).
Fact tables that store time of day facts should have two foreign keys, one to each of these
dimensions.
The natural key of the date dimension should use the date data type. The surrogate key should
store the date by using YYYYMMDD format and the int data type. This accepted practice should be
the only exception (alongside the time dimension) when the surrogate key value has meaning and
is human readable. Storing YYYYMMDD as an int data type is not only efficient and sorted
numerically, but it also conforms to the unambiguous International Standards Organization (ISO)
8601 date format.
the start/end of the calendar year and the fiscal year are different.
FiscalQuarterNumberInYear , FiscalMonthNumberInYear – which might be required to sort text
labels.
WeekOfYear – there are multiple ways to label the week of year, including an ISO standard
maintain multiple sets of holiday lists that each geography observes as a separate dimension
or naturalized in multiple attributes in the date dimension. Adding a HolidayText attribute
could help identify holidays for reporting.
IsWeekday – similarly, in some geographies, the standard work week isn't Monday to Friday.
For example, the work week is Sunday to Thursday in many Middle Eastern regions, while
other regions employ a four-day or six-day work week.
LastDayOfMonth
which might be required to support relative date filtering (for example, previous month).
Current periods use an offset of zero (0); previous periods store offsets of -1, -2, -3...; future
periods store offsets of 1, 2, 3....
As with any dimension, what's important is that it contains attributes that support the known
filtering, grouping, and hierarchy requirements. There might also be attributes that store
translations of labels into other languages.
When the dimension is used to relate to higher-grain facts, the fact table can use the first date of
the date period. For example, a sales target fact table that stores quarterly salespeople targets
would store the first date of the quarter in the date dimension. An alternative approach is to
create key columns in the date table. For example, a quarter key could store the quarter key by
using YYYYQ format and the smallint data type.
The dimension should be populated with the known range of dates used by all fact tables. It
should also include future dates when the data warehouse stores facts about targets, budgets, or
forecasts. As with other dimensions, you might include rows that represent missing, unknown,
N/A, or error situations.
Tip
Search the internet for "date dimension generator" to find scripts and spreadsheets that
generate date data.
Typically, at the beginning of the next year, the ETL process should extend the date dimension
rows to a specific number of years ahead. When the dimension includes relative offset attributes,
the ETL process must be run daily to update offset attribute values based on the current date
(today).
Time dimension
Sometimes, facts need to be stored at a point in time (as in time of day). In this case, create a time
(or clock) dimension. It could have a grain of minutes (24 x 60 = 1,440 rows) or even seconds (24 x
60 x 60 = 86,400 rows). Other possible grains include half hour or hour.
The natural key of a time dimension should use the time data type. The surrogate key could use
an appropriate format and store values that have meaning and are human readable, for example,
by using the HHMM or HHMMSS format.
Conformed dimensions
Some dimensions might be conformed dimensions. Conformed dimensions relate to many fact
tables, and so they're shared by multiple stars in a dimensional model. They deliver consistency
and can help you to reduce ongoing development and maintenance.
For example, it's typical that fact tables store at least one date dimension key (because activity is
almost always recorded by date and/or time). For that reason, a date dimension is a common
conformed dimension. You should therefore ensure that your date dimension includes attributes
relevant for the analysis of all fact tables.
The following diagram shows the Sales fact table and the Inventory fact table. Each fact table
relates to the Date dimension and Product dimension, which are conformed dimensions.
As another example, your employee and users could be the same set of people. In this case, it
might make sense to combine the attributes of each entity to produce one conformed dimension.
Role-playing dimensions
When a dimension is referenced multiple times in a fact table, it's known as a role-playing
dimension.
For example, when a sales fact table has order date, ship date, and delivery date dimension keys,
the date dimension relates in three ways. Each way represents a distinct role, yet there's only one
physical date dimension.
The following diagram depicts a Flight fact table. The Airport dimension is a role-playing
dimension because it's related twice to the fact table as the Departure Airport dimension and the
Arrival Airport dimension.
Junk dimensions
A junk dimension is useful when there are many independent dimensions, especially when they
comprise a few attributes (perhaps one), and when these attributes have low cardinality (few
values). The objective of a junk dimension is to consolidate many small dimensions into a single
dimension. This design approach can reduce the number of dimensions, and decrease the number
of fact table keys and thus fact table storage size. They also help to reduce Data pane clutter
because they present fewer tables to users.
A junk dimension table typically stores the Cartesian product of all dimension attribute values,
with a surrogate key attribute.
Good candidates include flags and indicators, order status, and customer demographic states
(gender, age group, and others).
The following diagram depicts a junk dimension named Sales Status that combines order status
values and delivery status values.
Degenerate dimensions
A degenerate dimension can occur when the dimension is at the same grain as the related facts. A
common example of a degenerate dimension is a sales order number dimension that relates to a
sales fact table. Typically, the invoice number is a single, non-hierarchical attribute in the fact
table. So, it's an accepted practice not to copy this data to create a separate dimension table.
The following diagram depicts a Sales Order dimension that's a degenerate dimension based on
the SalesOrderNumber column in a sales fact table. This dimension is implemented as a view that
retrieves the distinct sales order number values.
Tip
It's possible to create a view in a Fabric Warehouse that presents the degenerate dimension
as a dimension for querying purposes.
Outrigger dimensions
When a dimension table relates to other dimension tables, it's known as an outrigger dimension.
An outrigger dimension can help to conform and reuse definitions in the dimensional model.
For example, you could create a geography dimension that stores geographic locations for every
postal code. That dimension could then be referenced by your customer dimension and
salesperson dimension, which would store the surrogate key of the geography dimension. That
way, customers and salespeople could then be analyzed by using consistent geographic locations.
The following diagram depicts a Geography dimension that's an outrigger dimension. It doesn't
relate directly to the Sales fact table. Instead, it's related indirectly via the Customer dimension
and the Salesperson dimension.
Consider that the date dimension can be used as an outrigger dimension when other dimension
table attributes store dates. For example, the birth date in a customer dimension could be stored
by using the surrogate key of the date dimension table.
Multivalued dimensions
When a dimension attribute must store multiple values, you need to design a multivalued
dimension. You implement a multivalued dimension by creating a bridge table (sometimes called a
join table). A bridge table stores a many-to-many relationship between entities.
For example, consider there's a salesperson dimension, and that each salesperson is assigned to
one or possibly more sales regions. In this case, it makes sense to create a sales region dimension.
That dimension stores each sales region only once. A separate table, known as the bridge table,
stores a row for each salesperson and sales region relationship. Physically, there's a one-to-many
relationship from the salesperson dimension to the bridge table, and another one-to-many
relationship from the sales region dimension to the bridge table. Logically, there's a many-to-
many relationship between salespeople and sales regions.
In the following diagram, the Account dimension table relates to the Transaction fact table.
Because customers can have multiple accounts and accounts can have multiple customers, the
Customer dimension table is related via the Customer Account bridge table.
Related content
In the next article in this series, learn about guidance and design best practices for fact tables.