0% found this document useful (0 votes)
48 views52 pages

Computer Science Faculty Information Systems Department: Data Warehousing & BI

The document discusses dimensional modeling and its benefits over entity relationship modeling. Dimensional modeling simplifies data models by organizing data into fact and dimension tables. This results in a star schema that is easier for users to understand and navigate, and improves query performance for decision support systems. The key features of dimensional modeling include a central fact table linked to smaller dimension tables through foreign keys, allowing for simplified yet high performance analysis of business data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views52 pages

Computer Science Faculty Information Systems Department: Data Warehousing & BI

The document discusses dimensional modeling and its benefits over entity relationship modeling. Dimensional modeling simplifies data models by organizing data into fact and dimension tables. This results in a star schema that is easier for users to understand and navigate, and improves query performance for decision support systems. The key features of dimensional modeling include a central fact table linked to smaller dimension tables through foreign keys, allowing for simplified yet high performance analysis of business data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Computer Science Faculty

Information Systems Department

Data warehousing & BI


Abdul Rahman Safi
Rafiullah Momand
With Materials taken from Dr. Marcela Charfuelan, Dr. Ahsan Abdullah, Dr. Michael
Mannino, Dr.Jahangir Karimi
Dimensional Modeling
Objectives
• Why Dimensional Modelling?
• Features of Dimensional Modelling
• Dimensional Modelling Process
• Issues of Dimensional Modelling

Information Systems Department 3


The need for ER modeling?
• Problems with early data processing systems.
• Data redundancies
• From flat file to Table, each entity ultimately becomes a Table in the
physical schema.
• Simple O(n2) Join to work with Tables

Information Systems Department 4


Why ER Modeling has been so successful?
• Coupled with normalization drives out all the
redundancy out of the database.
• Change (or add or delete) the data at just one point.
• Can be used with indexing for very fast access.
• Resulted in success of OLTP systems.

Information Systems Department 5


Need for DM: Un-answered Qs
• Lets have a look at a typical ER data model first.
• Some Observations:
• All tables look-alike, as a consequence it is difficult to identify:
• Which table is more important ?
• Which is the largest?
• Which tables contain numerical measurements of the business?
• Which table contain nearly static descriptive attributes?

Information Systems Department 6


Need for DM: Complexity of Representation

• Many topologies for the same ER diagram, all appearing


different.
• Very hard to12visualize and remember.
7 6
3 12 7
11 4 8
8
9
1 10
10 9 11
6 1

3 2 5
2 5 4
• A large number of possible connections to any two (or more) tables
Information Systems Department 7
Need for DM: The Paradox
• The Paradox: Trying to make information accessible using tables resulted in an
inability to query them!
• ER and Normalization result in large number of tables which are:
• Hard to understand by the users (DB programmers)
• Hard to navigate optimally by DBMS software
• Real value of ER is in using tables individually or in pairs
• Too complex for queries that span multiple tables with a large number of records

Information Systems Department 8


ER vs. DM
ER DM
Constituted to optimize OLTP Constituted to optimize DSS
performance. query performance.

Models the macro relationships


Models the micro relationships
among data elements with an
among data elements.
overall deterministic strategy.

A wild variability of the structure of All dimensions serve as equal


ER models. entry points to the fact table.

Very vulnerable to changes in the Changes in users' querying


user's querying habits, because habits can be accommodated
such schemas are asymmetrical. by automatic SQL generators.

Information Systems Department 9


How to simplify a ER data model?
•Two general methods:
• De-Normalization
• Dimensional Modeling (DM)

Information Systems Department 10


What is DM?
• A simpler logical model optimized for decision support.
• Inherently dimensional in nature, with a single central fact table and a
set of smaller dimensional tables.
• Multi-part key for the fact table
• Dimensional tables with a single-part PK.
• Keys are usually system generated

Information Systems Department 11


What is DM?
• Results in a star like structure, called star schema or star join.
• All relationships mandatory M-1.
• Single path between any two levels.
• Supports ROLAP operations.

Information Systems Department 12


Dimensions have Hierarchies
Items

Books Cloths

Fiction Text Men Women

Engg Medical

Analysts tend to look at the data through dimension at a


particular “level” in the hierarchy

Information Systems Department 13


The two Schemas

Star
Snow-flake

Information Systems Department 14


“Simplified” 3NF (Retail)
CITY DISTRICT M DIVISION PROVINCE
1 district
1 1
zone M division
M DISTRICT DIVISION
ZONE CITY
1
store M 1week
STORE #STREET ZONE ... DATE WEEK
1 M
sale_header quarter
M M
RECEIPT #STORE # DATE ... MONTH QTR
1 1
1 M M
WEEK MONTH
M sale_detail month 1
RECEIPT #ITEM # ... $
YEAR QTR
1 M M year
1
ITEM # CATEGORY
ITEM # SUPPLIER
item_x_cat M
1 item_x_splir
CATEGORYDEPT
Information Systems Department 15
cat_x_dept
Vastly Simplified Star Schema Product Dim
Geography Dim
1 ITEM#
STORE# 1
Fact Table CATEGORY
ZONE
RECEIPT#
DEPT
CITY
STORE#
M SUPPLIER
DISTRICT
ITEM# M
DIVISION
DATE Time Dim
M
PROVINCE . DATE
. 1
facts . WEEK

Sale Rs. MONTH

QUARTER

YEAR
Information Systems Department 16
The Benefit of Simplicity

Beauty lies in close correspondence with the business, evident even to


business users.

Information Systems Department 17


Features of Star Schema
Dimensional hierarchies are collapsed into a single table for each
dimension. Loss of Information?
A single fact table created with a single header from the detail
records, resulting in:
• A vastly simplified physical data model!
• Fewer tables.
• Fewer joins resulting in high performance.
• Some requirement of additional space.

Information Systems Department 18


Quantifying space requirement
Quantifying use of additional space using star schema
There are about 10 million mobile phone users in Pakistan.
Say the top company has half of them = 500,000
Number of days in 1 year = 365
Number of calls recorded each day = 250,000 (assumed)
Maximum number of records in fact table = 91 billion rows
Assuming a relatively small header size = 128 bytes
Fact table storage used = 11 Tera bytes
Average length of city name = 8 characters  8 bytes
Total number of cities with telephone access = 170 (1 byte)
Space used for city name in fact table using Star = 8 x 0.091 = 0.728 TB
Space used for city code using snow-flake = 1x 0.091= 0.091 TB
Additional space used  0.637 Tera byte i.e. about 5.8%

Information Systems Department 19


Process of Dimensional
Modeling
The Process of Dimensional Modeling
Four Step Method from ER to DM

1. Choose the Business Process


2. Choose the Grain
3. Choose the Facts
4. Choose the Dimensions

Information Systems Department 21


Step-1: Choose the Business Process
• A business process is a major operational process in an organization.

• Typically supported by a legacy system (database) or an OLTP.


• Examples: Orders, Invoices, Inventory etc.

• Business Processes are often termed as Data Marts and that is why
many people criticize DM as being data mart oriented.

Information Systems Department 22


Step-1: Separating the Process

Star-1

Snow-flake

Star-2 23

Information Systems Department 23


Step-2: Choosing the Grain
• Grain is the fundamental, atomic level of data to be represented.

• Grain is also termed as the unit of analyses.

• Example grain statements

• Typical grains
• Individual Transactions
• Daily aggregates (snapshots)
• Monthly aggregates

• Relationship between grain and expressiveness.

• Grain vs. hardware trade-off.

Information Systems Department 24


Step-2: Relationship b/w Grain
LOW Granularity HIGH Granularity

Four aggregates per week


4 x 4 = 16 values

Two aggregates per week Daily aggregates


2 x 4 = 8 values 6 x 4 = 24 values

Information Systems Department 25


The case FOR data aggregation
• Works well for repetitive queries.
• Follows the known thought process.

• Justifiable if used for max number of queries.

• Provides a “big picture” or macroscopic view.

• Application dependent, usually inflexible to business changes


(remember lack of absoluteness of conventions).

Information Systems Department 26


The case AGAINST data aggregation
• Aggregation is irreversible.
• Can create monthly sales data from weekly sales data, but the reverse is not
possible.

• Aggregation limits the questions that can be answered.


• What, when, why, where, what-else, what-next
• Aggregation can hide crucial facts.
• The average of 100 & 100 is same as 150 & 50

Information Systems Department 27


Aggregation hides crucial facts Example

Week-1 Week-2 Week-3 Week-4 Average


Zone-1 100 100 100 100 100
Zone-2 50 100 150 100 100
Just looking at the averages i.e. aggregate
Zone-3 50 100 100 150 100
Zone-4 200 100 50 50 100
Average 100 100 100 100

Information Systems Department 28


Aggregation hides crucial facts chart
250
Z1 Z2 Z3 Z4
200

150

100

50

0
Week-1 Week-2 Week-3 Week-4

Z1: Sale is constant (need to work on it)


Z2: Sale went up, then fell (need of concern)
Z3: Sale is on the rise, why?
Z4: Sale dropped sharply, need to look deeply.
W2: Static sale Information Systems Department 29
Step 3: Choose Facts statement

Facts
“We need monthly sales
volume and Rs. by
week, product and Zone”

Dimensions

Information Systems Department 30


Step 3: Choose Facts
• Choose the facts that will populate each fact table
record.

• Remember that best Facts are Numeric, Continuously


Valued and Additive.

• Example: Quantity Sold, Amount etc.

Information Systems Department 31


Step 4: Choose Dimensions

• Choose the dimensions that apply to each fact in the


fact table.
• Typical dimensions: time, product, geography etc.
• Identify the descriptive attributes that explain each
dimension.
• Determine hierarchies within each dimension.
Information Systems Department 32
Step-4: How to Identify a Dimension?
• The single valued attributes during recording of a transaction are
dimensions.
Fact Table
Calendar_Date
Time_of_Day
Dim Account _No
ATM_Location
Transaction_Type
Transaction_Rs

Time_of_day: Morning, Mid Morning, Lunch Break etc.


Transaction_Type: Withdrawal, Deposit, Check balance etc.

Information Systems Department 33


Step-4: Can Dimensions be Multi-valued?
• Are dimensions ALWYS single?
• Not really
• What are the problems? And how to handle them
 Calendar_Date (of inspection)
 Reg_No
 Technician
 Workshop
 Maintenance_Operation
• How many maintenance operations are possible?
• Few
• Maybe more for old cars.

Information Systems Department 34


Step-4: Dimensions & Grain
• Several grains are possible as per business requirement.

• For some aggregations certain descriptions do not remain atomic.

• Example: Time_of_Day may change several times during daily aggregate, but
not during a transaction

• Choose the dimensions that are applicable within the selected grain.

Information Systems Department 35


Step-4: Dimensions & Grain
• Several grains are possible as per business requirement.

• For some aggregations certain descriptions do not remain atomic.

• Example: Time_of_Day may change several times during daily aggregate, but
not during a transaction

• Choose the dimensions that are applicable within the selected grain.

Information Systems Department 36


Issues of Dimensional
Modeling
Additive vs. Non-Additive facts
• Additive facts are easy to work with Month Crates of
Bottles Sold
• Summing the fact value gives
meaningful results May 14
• Additive facts: Jun. 20
• Quantity sold Jul. 24
• Total Rs. sales
TOTAL 58

• Non-additive facts: Month % discount


• Averages (average sales price, unit price) May 10
• Percentages (% discount)
Jun. 8
• Ratios (gross margin)
• Count of distinct products sold Jul. 6
TOTAL 24% ← Incorrect!
Information Systems Department 38
Classification of Aggregation Functions
• How hard to compute aggregate from sub-aggregates?
• Three classes of aggregates:
• Distributive
• Compute aggregate directly from sub-aggregates
• Examples: MIN, MAX ,COUNT, SUM

• Algebraic
• Compute aggregate from constant-sized summary of subgroup
• Examples: STDDEV, AVERAGE
• For AVERAGE, summary data for each group is SUM, COUNT

• Holistic
• Require unbounded amount of information about each subgroup
• Examples: MEDIAN, COUNT DISTINCT
• Usually impractical for a data warehouses!

Information Systems Department 39


Not recording Facts
• Transactional fact tables don’t have records for events that don’t
occur
• Example: No records(rows) for products that were not sold.
• This has both advantage and disadvantage.
• Advantage: Benefit of sparsity of data
• Significantly less data to store for “rare” events
• Disadvantage: Lack of information
• Example: What products on promotion were not sold?

Information Systems Department 40


A Fact-less Fact Table
• “Fact-less” fact table
• A fact table without numeric fact columns

• Captures relationships between dimensions

• Use a dummy fact column that always has value 1

Information Systems Department 41


Example: Fact-less Fact Tables
Examples:
• Department/Student mapping fact table

• What is the major for each student?

• Which students did not enroll in ANY course

• Promotion coverage fact table


• Which products were on promotion in which stores for which days?

• Kind of like a periodic snapshot fact

Information Systems Department 42


Handling Multi-valued Dimensions?
• One of the following approaches is adopted:

• Drop the dimension.

• Use a primary value as a single value.

• Add multiple values in the dimension table.

• Use “Helper” tables.

Information Systems Department 43


OLTP & Slowly Changing Dimensions
• OLTP systems not good at tracking the past. History never changes.

• OLTP systems are not “static” always evolving, data changing by


overwriting.

• Inability of OLTP systems to track history, purged after 90 to 180 days.

• Actually don’t want to keep historical data for OLTP system.

Information Systems Department 44


DWH Dilemma: Slowly Changing Dimensions
• The responsibility of the DWH to track the changes.

• Example: Slight change in description, but the product ID (SKU) is not


changed.

• Dilemma: Want to track both old and new descriptions, what do they
use for the key? And where do they put the two values of the
changed ingredient attribute?

Information Systems Department 45


Explanation of Slowly Changing Dimensions…
• Compared to fact tables, contents of dimension tables are relatively
stable.
• New sales transactions occur constantly.
• New products are introduced rarely.
• New stores are opened very rarely.

• The assumption does not hold in some cases


• Certain dimensions evolve with time
• e.g. description and formulation of products change with time
• Customers get married and divorced, have children, change addresses etc.
• Land changes ownership etc.
• Changing names of sales regions.

Information Systems Department 46


Explanation of Slowly Changing Dimensions…

Although these dimensions change but the change is not rapid.

Therefore called “Slowly” Changing Dimensions

Information Systems Department 47


Handling Slowly Changing Dimensions
• Option-1: Overwrite History
• Example: Code for a city, product entered incorrectly

• Just overwrite the record changing the values of modified attributes.

• No keys are affected.

• No changes needed elsewhere in the DM.

• Cannot track history and hence not a good option in DSS.

Information Systems Department 48


Handling Slowly Changing Dimensions

• Option-2: Preserve History


• Example: The packaging of a part change from glued box to stapled box, but the code
assigned (SKU) is not changed.

• Create an additional dimension record at the time of change with new attribute
values.

• Segments history accurately between old and new description

• Requires adding two to three version numbers to the end of key. SKU#+1, SKU#+2
etc.

Information Systems Department 49


Handling Slowly Changing Dimensions

• Option-3: Create current valued field


• Example: The name and organization of the sales regions change over time,
and want to know how sales would have looked with old regions.
• Add a new field called current_region rename old to previous_region.
• Sales record keys are not changed.
• Only TWO most recent changes can be tracked.

Information Systems Department 50


Pros and Cons of Handling
• Option-1: Overwrite existing value
+ Simple to implement
- No tracking of history

• Option-2: Add a new dimension row


+ Accurate historical reporting
+ Pre-computed aggregates unaffected
- Dimension table grows over time

• Option-3: Add a new field


+ Accurate historical reporting to last TWO changes
+ Record keys are unaffected
- Dimension table size increases

- Note: For Detailed analysis, refer to chapter two of Dimensional Modelling(third edition).
You will find 7 options on SCD.

Information Systems Department 51


Summary
• Dimensional Modelling
• Dimensional Modelling Process
• Issues of Dimensional Modelling

Information Systems Department 52

You might also like