Dimensional Data Modeling Introduction
Dimensional Data Modeling Introduction
Agenda
Basic Terminology
Dimensional Model Schemas Types of Dimensions Types of Facts Dimensional Modeling Process
2 2005-02-23
Dimensional Data Modeling techniques organize the content of the data warehouse. It structures the data according to the way users ask business questions.
3 2005-02-23
4 2005-02-23
Fact Table
Sales Fact
Definition
- The performance measures of the business - Usually stores numerical and additive measures - The what I want to know
Characteristics
Basis for analysis Continuously valued Can be derived or calculated Column headers in query results Revenue Quantity Cost Gross Margin
Examples
5 2005-02-23
Dimension Tables
Time Dimension Sales Fact Product Dimension
Definition
- Descriptions of the business; - The which, who, how, where, or when that describes or explains the fact.
Characteristics
- Constant - Enables slicing and dicing the facts by different variables
Customer Dimension Revenue Qty Cost Gross_margin Store Dimension
Examples
Time Customer Product Store
6 2005-02-23
Time Dimension
Definition
- Fields within the dimension table - Describes each item associated with a dimension
Characteristics
- Acts as a source of query constraints - DW is only as good as the attributes in the dimension table
Examples
- month, quarter, year, season, holiday, name, street address, city, brand, category, description region, type, manage
Store id Store name Store address Store city Store address code Store country
7 2005-02-23
Time Dimension
Grain
- Level of detail contained in fact or dimension table - Meaning of a single fact table record
Hierarchy
- Represents levels or roll-up of detailed data
Store id Store name Store address Store city Store address code Store country
8 2005-02-23
Terminology
Atomic Layer - Dimensions and facts at the lowest level of detail (think ODS). Summary Layer - Dimensions and facts aggregated to intermediate values. Presentation Layer - Dimensions, facts, and other tables altered specifically for presentation tool limitations. Reporting Layer - Dimensions, facts, and other tables created or altered to improve reporting capabilities and performance.
9 2005-02-23
More terminology
Facts Dimensions Attributes Grain Hierarchies Keys Referential Integrity Sparsity Numeric Fields as attributes, not facts Slowly Changing Dimensions Calculated Facts
Status Indicators/Flags/Events Ranges Levels Counts/Occurrences Conformed Tables History Roll-Off Causal Dimensions Huge Dimensions and Mini-Dimensions Star Schema/Snowflake Schema Heterogeneous Products Factless facts Additive, semi-, and non-additive facts Degenerate dimensions
10 2005-02-23
Agenda
Basic Terminology
11 2005-02-23
12 2005-02-23
Star Schema
STAR SCHEMA
A database design that stores a central fact table surrounded by multiple dimension tables.
Star schema represents a compromise between the fully normalized model and the denormalized model. Descriptive dimension information is maintained in a set of denormalized dimension tables.
13 2005-02-23
Star Schema
14 2005-02-23
Snowflake Schema
SNOWFLAKE SCHEMA
A database design that stores a central fact table surrounded by multiple dimension tables decomposed or normalized into one or more hierarchies.
Snowflake schemas are most often used when dealing with large hierarchies that are static. Snowflaked tables (look-up tables) may increase the speed of queries depending on the presentation tool (i.e. MicroStrategy)
15 2005-02-23
Snowflake Schema
16 2005-02-23
Multi-Dimensional Schemas
MULTI-DIMENSIONAL SCHEMA
Hierarchical databases that consists of only one structure - a multi-dimensional array - that contains all the summarized data at higher levels in the array.
Also known as MOLAP databases Stores and aggregates data at multiple levels in a hierarchy. Utilizes drill-up and drill-down to move around the hierarchy.
17 2005-02-23
Multi-Dimensional Schemas
Multi-Dimensional Schemas:
- Provide user with a cross-dimensional perspective allowing analysis across dimensions - Specialized programmer must create database - Data explosion becomes an issue because each additional dimension results in an exponential increase in the number of dimension intersections (cells).
18 2005-02-23
New York North Geography Chicago Miami Dallas West Los Angeles San Francisco Country Region City
Class
U.S.
South
Durables Electronics
Product
Sparsity relates to the unpopulated cells in a table. It results from every combination of attributes not having a value or an entry associated with it. May be reduced if users are satisfied with more summarized than atomic level data. Addressing the issue may be tool-dependent.
20 2005-02-23
Agenda
Types of Dimensions
Types of Facts Dimensional Modeling Process
21 2005-02-23
Types of Dimensions
Rapidly Changing or Volatile Dimensions Huge Dimensions and Mini-Dimensions Causal Dimensions Dirty Dimensions Degenerate Dimensions
22 2005-02-23
Slowly-Changing Dimensions
Most dimensions change over time.
- Products change offered coverage or limits and deductibles. - Employees are promoted, fired, or change departments. - Customers change names and addresses.
What are our choices for tracking these changes over time?
23 2005-02-23
Slowly-Changing Dimensions
There are three types of slowly changing dimensions:
- Type 1: Overwrites the old data for a record with new data. This eliminates the ability to track history over time. - Type 2: Creates a new record with the new data at the type of the change. Accurately tracks history, but requires generalized key. - Type 3: Tracks new and original values in separate fields at time of change. Intermediate values are lost.
24 2005-02-23
25 2005-02-23
26 2005-02-23
27 2005-02-23
Volatile Dimensions
What if a dimensions values change frequently? Price would naturally be an attribute of product and would change semifrequently. Few products have prices that remain constant over many months or years. To capture these changes over time, we can capture these values in the fact table rather than treating it as a slowly changing dimension.
28 2005-02-23
A General Rule...
Fact tables contain counts, amounts, and other numerical information. Dimensions describe the business with textual fields and dates in time. As a general rule, one should question numerical information that occurs in the dimension tables as well as textual and data fields that occur in the fact table.
29 2005-02-23
Product and Customer dimensions with millions and tens of millions of entries are not unusual for retailers, telecommunications companies, insurance companies, or financial service institutions. These dimensions can have hundreds of attributes and complex,multiple hierarchies that can exist simultaneously.
30 2005-02-23
Huge Dimensions
HUGE DIMENSIONS
Dimensions with millions or tens of millions of entries, such as customer, that take too long to browse among relationships due to volume.
The customer dimension in financial institutions, telecommunications companies, and catalog retailers hold data for customers on an individual basis. Over time, these can grow to tens of millions of rows.
31 2005-02-23
The heavily-used fields in the Customer dimension consist of demographic information: age, sex, number of children, income level, education level, and other purchasing behavior information. These fields are also compared together to select an interesting subset of the market base for analysis.
32 2005-02-23
The most effective technique for handling this situation is to separate one or more sets of these attributes into demographic mini-dimensions. If five or six of the demographic variables are isolated into a separate table, we need only to store the distinct combinations of information that actually occur.
33 2005-02-23
Mini-Dimensions
MINI-DIMENSIONS
Groups of related attributes separated into separate dimensions that create significant gains in performance and decreased volatility in the parent dimension.
Typically, demographic information changes at a different rate than other customer information. Marketing can analyze different segments of the customer base for purchasing habits and other information.
34 2005-02-23
35 2005-02-23
Causal Dimensions
CAUSAL DIMENSIONS
Causal dimensions describe factors that are thought to cause a change in the performance of a measure or fact, such as advertising or promotion.
Causal Dimensions track conditions that may influence sales, counts, or revenue. Promotions, holidays, and weather conditions may influence the behavior of fact data.
36 2005-02-23
Causal Dimensions
Causal dimensional attributes can be placed in a single dimension table or separated into different tables by subject. A Promotion dimension could include price reduction type, ad type, display type, and promo start and end dates. The trade-offs include efficient browsing vs. more understandable tables for the user community.
37 2005-02-23
Causal Dimensions
38 2005-02-23
Dirty Dimensions
DIRTY DIMENSIONS
Dimensional information that may contain duplicate or extraneous entries due to inconsistent legacy data.
Financial institutions might have a poor account-to-account correlation of individuals names. Insurance companies may not make a serious attempt to identify previous instances of an insured party or other policies.
39 2005-02-23
Dirty Dimensions
Some cleaning can be done in ETL process. Will influence fact data accuracy. All tools that access the data will need to take the possible inaccuracy of data into account. Some tools are designed to alleviate some of the problem to 80% accuracy. Level of inaccuracy may influence design of dimensions and facts so that it may be minimized.
40 2005-02-23
Degenerate Dimensions
DEGENERATE DIMENSIONS
Dimensions that are so small and have no attributes of their own that they have been added to the fact table.
Certain attributes are tracked that dont necessarily belong in their own dimension - orphan attributes. This may occur when fact tables are designed to reflect the actual working document.
41 2005-02-23
Degenerate Dimensions
Examples include order_number, bill_of_lading_num, and invoice_number. While these fields seem very transaction oriented, they are helpful in grouping things such as all line items on an invoice. Including these fields on the fact table amounts to denormalizing the attribute due to the granularity of the fact table being the document itself or a line item of the document.
42 2005-02-23
Agenda
Types of Facts
Dimensional Modeling Process
43 2005-02-23
Types of Facts
FACT
A measurement, generally additive in nature, of the organization.
We use facts to measure performance based on business questions. This data is numeric in nature and is contained in our fact tables by subject and granularity.
44 2005-02-23
Types of Facts
Understanding which facts can be added across which dimensions is an important data design issue. Three Types of Facts:
- Additive - Non-Additive - Semi-Additive
45 2005-02-23
Additive Facts
ADDITIVE FACTS
Measurements in a fact table that can be added across all dimensions.
Since aggregation is a key element in the usefulness of the dimensional model, its best utilized for facts that are additive, numeric values. We can add revenue, cost, and quantity sold for all products, all stores, and any time period.
46 2005-02-23
Semi-Additive Facts
SEMI-ADDITIVE FACTS
Measurements in a fact table that can be added across some dimensions but not others.
We cannot add risk exposure at the coverage level to get the number of policy level exposures. We can add coverage level exposures across the customer dimension to determine exposure by gender or age range.
47 2005-02-23
Non-Additive Facts
NON-ADDITIVE FACTS
Measurements in a fact table that cannot be added across any dimensions, like ratios.
A new value will need to be calculated at each level for each level or for each set of data. It should be determined, at what levels, if any, the fact should be stored. Some values may need to be pre-calculated.
48 2005-02-23
49 2005-02-23
Agenda
50 2005-02-23
Step 1: Choose the grain of each fact table. Granularity defines the level of detailed data. It must be determined prior to going forward in the modeling process. Typical grains are individual transactions, time-based aggregation, and/or aggregations along a commonly used dimension.
51 2005-02-23
Step 2: Choose the dimension attributes. For example, what should our time dimension look like? Should it have just January for month, or also Jan and 1? Should we store the code and the description, just the code, or just the description? What values will our users need to filter or report on?
52 2005-02-23
Step 3: Identify dimensional hierarchies. A dimension such as time may have days rolling into months and then quarters, as well as days rolling into weeks which may cross months and quarters. Sales geography may differ from physical geography. Zip codes can cross city boundaries and cities are made up of multiple zip codes.
53 2005-02-23
Step 4: Choose the dimensions that apply to each fact table. Typical dimensions include time, product, policyholder, agent, and geography. Remember to evaluate granularity when applying dimensions to facts.
54 2005-02-23
Step 5: Choose the measured facts, including precalculated facts. Each aggregated and derived fact will need to be evaluated for inclusion in the model or calculation in the application. Trade-offs include storage and indexing and must be weighed against the access requirements.
55 2005-02-23
56 2005-02-23