Advanced Dimensional Modeling
Advanced Dimensional Modeling
Snowflake Schema
Because of the various levels of hierarchy, data in a dimension table in Star
schema contain duplicates or redundant values. Thus dimension tables are
not typically normalized. There is no redundancy in the fact table, only in
dimensions.
Mohammad A. Rob
Mohammad A. Rob
Family of Stars
Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of
stars, and often called as a family of stars, galaxy schema or fact
constellation.
Mohammad A. Rob
Disadvantages:
Star or Snowflake
Both star and snowflake schemas are dimensional models; the difference is
in their physical implementations. Snowflake schemas support ease of
dimension maintenance because they are more normalized. Star schemas
are easier for direct user access and often support simpler and more
efficient queries.
The decision to model a dimension as a star or snowflake depends on the
nature of the dimension itself, such as how frequently it changes and which
of its elements change, and often involves evaluating tradeoffs between
ease of use and ease of maintenance.
It is often easiest to maintain a complex dimension by snowflaking the
dimension. By pulling hierarchical levels into separate tables, referential
integrity between the levels of the hierarchy is guaranteed. OLAP services
reads from a snowflaked dimension as well as, or better than, from a star
dimension.
However, it is important to present a simple and appealing user interface
(such as OLAP) to business users who are developing ad hoc queries on
the dimensional database. It may be better to create a star version of the
snowflaked dimension for presentation to the users.
Mohammad A. Rob
Large Dimensions
A large dimension can be very deep containing a very large number of
rows, or it may contain a large number of attributes. In either case,
populating a dimension table should be done in a special way. In case of
the large number of attributes, we may want to separate a large dimension
into multiple smaller dimensions.
Large dimensions usually tend to have multiple hierarchies in their
attributes. For example, a product dimension of a grocery store may form
one hierarchy for the marketing department and another hierarchy for the
finance department. OLAP tools can be used to represent different
hierarchies of the same dimension.
Product
Date
Fact
Store
Time
Mohammad A. Rob
Conforming Dimensions
A dimension table may be used in multiple places if the data warehouse
contains multiple fact tables or contributes data to data marts. For example,
a product dimension may be used with a sales fact table and an inventory
fact table in the data warehouse, and also in one or more departmental
data marts.
A dimension such as customer, time, or product that is used in multiple
schemas is called a conforming dimension if all copies of the dimension are
the same. Summarization data and reports will not correspond if different
schemas use different versions of a dimension table.
Use of Confirming Dimensions in Multiple Facts: Multiple fact tables are
used in data warehouses that address multiple business functions, such as
sales, inventory, and finance. Each business function will typically have its
own schema that contains a fact table, several conforming dimension
tables, and some dimension tables unique to the specific business function.
Mohammad A. Rob
If we calculate the number of rows in the fact table for a chain with 100
stores, selling 7000 products a day, for 365 days, we will have:
100x7000x365 = 255,500,000 rows.
If we create an aggregate of product sales by store by week, we would
expect that the number of rows in the aggregate table would be reduced by
seven, or 255,500,000/7 = 36,500,000. This will not be the case due to the
sparsity of data, because all stores do not sell the same products on the
same day. The number of rows will be (100X15000X52 =) 78,000,000, or
double than expected.
Measures in the Fact Tables
The values that quantify facts are usually numeric, and are often referred to
as measures. Measures are typically additive along all dimensions, such as
Quantity in a sales fact table. A sum of Quantity by customer, product, time,
or any combination of these dimensions results in a meaningful value.
Additive and Non-additive Measures: Some measures are not additive
along one or more dimensions, such as quantity-on-hand in an inventory
system or price in a sales system. Some measures can be added along
dimensions other than the time dimension. These measures are sometimes
referred to as semi-additive. For example, quantity-on-hand can be added
along the Warehouse dimension to achieve the total-quantity-on-hand.
Measures that cannot be added along any dimension are truly non-additive.
Non-additive measures can often be combined with additive measures to
create new additive measures. For example, Sale Price =Quantity*Price.
Calculated Measures: A calculated measure is a measure that results
from applying a function to one or more measures, for example, the
computed Extended Price value resulting from multiplying Quantity times
Price. Other calculated measures may be more complex, such as profit,
contribution to margin, allocation of sales tax, and so forth.
Calculated measures may be pre-computed during the load process and
stored in the fact table, or they may be computed on the fly as they are
used. Determination of which measures should be pre-computed is a
design consideration.
Mohammad A. Rob
Name
State
1001
Christina
Illinois
Mohammad A. Rob
Name
State
1001
Christina
California
Name
State
1001
Christina
Illinois
1005
Christina
California
Mohammad A. Rob
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets
updated, and we have the following table (assuming the effective date of
change is January 15, 2003):
Customer Key
Name
Original State
Current State
Effective Date
1001
Christina
Illinois
California
15-JAN-2003
Advantages: This does not increase the size of the table, since new
information is updated. This allows us to keep some part of history.
Disadvantages: Type 3 will not be able to keep all history where an
attribute is changed more than once. For example, if Christina later moves
to Texas on December 15, 2003, the California information will be lost.
When to use: Type 3 slowly changing dimension should only be used when
it is necessary for the data warehouse to track historical changes, and
when such changes will only occur for a finite number of times.
Mohammad A. Rob
10
Mohammad A. Rob
11
Mohammad A. Rob
12
Approaches to Aggregation
There are three approaches to aggregation: no aggregation, selective
aggregation, or exhaustive aggregation. In some cases, the volume of data
in the fact table will be small enough that performance is acceptable
without aggregates; however this is not common in a data warehouse
database.
The opposite extreme is exhaustive aggregation. This approach will
produce optimal query results because a query can read the minimum
number of rows required to return an answer. However, this approach is
not normally practical due to the processing required to produce all
possible aggregates and the storage required to store them.
In a simple sales example where the dimensions are product, sales
geography, customer, and time, the number of possible aggregates is the
number of levels in each hierarchy of each dimension multiplied together.
Mohammad A. Rob
13
Examples of Aggregations
Depending on the user needs, there can be various ways of aggregations.
Consider an example of a retail store consisting of three dimensions: store,
product, and time. Each dimension has several hierarchies.
Mohammad A. Rob
14
Mohammad A. Rob
15
Choosing Aggregates
Usage and Analysis Patterns: There are two basic pieces of information
which are required to select the appropriate aggregates. Probably the most
important item is the expected usage patterns of the data. Based on this
information from the users, it is possible to determine the most frequently
examined levels and they will be the good candidates for aggregation.
Base Table Row Reduction: The second piece of information to consider
is the data volumes and distributions in the fact table. Queries can be run to
get an idea of the number of rows at various levels in the dimension
hierarchies. This will tell us where there are significant decreases in the
volume of data along a given hierarchy. Some of the best candidates for
aggregation will be those where the row counts decrease the most from
one level in a hierarchy to the next.
The decrease of rows in a dimension hierarchy is not a hard rule due to the
distribution of data along multiple dimensions. When you combine the fact
rows to create an aggregate at a higher level, the reduction in the number
of rows may not be as much as was expected. This may be due to the
sparsity of data in the fact table, as discussed before.
Since we are trying to reduce the number of rows a query must process,
one of the key procedures is finding aggregates where the intersection of
dimensions has a significant decrease in the number of rows. Figure below
shows the row counts for all possible aggregates of product by store by day
using one year of data for a 200 store retail grocer.
Mohammad A. Rob
16
Mohammad A. Rob
17
The granularity of the fact table is product by store by day. This means the
base level in the geography dimension is the store level. All fact rows will
have as part of their key the store key from a row in this dimension. The
hierarchy in the dimension is: store, district, region, all stores. There is no
Mohammad A. Rob
18
Mohammad A. Rob
19