100% found this document useful (1 vote)
493 views247 pages

Data Modelling 242

The document discusses data modeling concepts for data warehouses. It defines a data warehouse as a collection of corporate information used to support business decisions. It describes characteristics of data warehouses like subject-oriented data, read-only queries, and pre-aggregated data. The document also outlines components of a data warehouse like source systems, a staging area, and data marts. It discusses the purpose and optional nature of staging areas and operational data stores. Finally, it covers data modeling concepts such as the purpose of data models, impact of analysis techniques on modeling, and levels of modeling from conceptual to logical to physical.

Uploaded by

Bhaskar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
493 views247 pages

Data Modelling 242

The document discusses data modeling concepts for data warehouses. It defines a data warehouse as a collection of corporate information used to support business decisions. It describes characteristics of data warehouses like subject-oriented data, read-only queries, and pre-aggregated data. The document also outlines components of a data warehouse like source systems, a staging area, and data marts. It discusses the purpose and optional nature of staging areas and operational data stores. Finally, it covers data modeling concepts such as the purpose of data models, impact of analysis techniques on modeling, and levels of modeling from conceptual to logical to physical.

Uploaded by

Bhaskar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 247

Data Modeling

Data Warehouse Defined


A data warehouse is a collection of corporate
information, derived directly from
operational systems and some external data
sources. Its specific purpose is to support
business decisions, not business operations

Characteristics of a DW
Subject-oriented Data
collects all data for a subject, from different sources

Read-only Requests
loaded during off-hours, read-only during day hours

Interactive Features, ad-hoc query


flexible design to handle spontaneous user queries

Pre-aggregated data
to improve runtime performance

Highly denormalized data structures


fat tables with redundant columns

Components of a Data Warehouse


Source
Systems

Data
Staging
Area
Storage
Flat Files
RDBMS
Processing

DWH
Servers

Data Mart 1
Dimensional
Conforms to
DW Bus

No User
Query
Services
Data Mart 2

End User
Data Access
Query
Tools

Report
Writers
Mining
Tools

STAGING AREA - SOME


CLARITY
Staging Area

optional
to cleanse the source data
Accepts data from different sources
Data model is required at staging area
Multiple data models may be required for
parking different sources and for transformed
data to be pushed out to warehouse

ODS - SOME CLARITY


Operational Data Store
Optional
Granular, detailed level data
May feed warehouse (eg when warehouse is
aggregated)
Usually a relational model
May keep data for a smaller time period than
warehouse

Data Modeling
WHAT IS A DATA MODEL???
A data model is an abstraction of some aspect of the real
world (system).
WHY A DATA MODEL???

Helps to visualise the business


A model is a means of communication.
Models help elicit and document requirements.
Models reduce the cost of change.
Model is the essence of DW architecture based on which
DW will be implemented

What do we want to do with the


data?
Model depends on what kind of data analysis we
want to do:
Different Data Analysis Techniques
Query and reporting
Display Query Results

Multidimensional analysis
Analyse data content by looking at it in different
perspectives

Data mining
discover patterns and clustering attributes in data

Impact of Data Analysis


Techniques on DM
Query and reporting

Normalized data model


Select associated data elements
summarize and group by category
present results
direct table scan
ER with normalized / denormalized appropriate

Query and reporting

Requirements of a Decision
Support Query Environment
To provide a method for testing hypothesis (eg.
what if .)
To allow ad-hoc queries
To allow human input (DSS makes decisions
with users )
Expects user knowledge of problem
To simulate the behaviour of a real-world
problem

Impact of Data Analysis


Techniques on DM
Multidimensional analysis
Fast and easy access to data
Any number of analysis dimensions in any
combinations
ER will mean many joins
Dimensional model appropriate

Multidimensional Analysis

Data Mining
Data Mining
discovers unusual patterns
requires low level of detail data

A look at different warehouse


architectures

Operational
Data

R
Y

M
A
N
A

External
data

G
E
R

Detailed
Information

Summary
information

M
A
N
A

Meta Data

G
E
R

Warehouse Manager

OLAP

Data Warehouse Architecture - 2

Data Warehouse Architecture - 3

Data Warehouse Architecture - 4

DW Architectures
Architecture Choices depend on

Current infrastructure
Business environment
Desired management and control structure
resources
commitment ..

Data Warehouse/data mart

DW Architectures
Architecture Choices determine
Where will DW reside?
Centrally / locally / distributed

Where will it be managed from?


Centrally / independently

3 choices
Global
Independent
Interconnected

or a combination of the three

DW Architectures
Global Architecture

related to scope of data access and storage


does not mean centralized
can be physically centralized or distributed
enterprise view of data
time-consuming & costly to implement

Global Architecture

DW Architectures
Independent Architecture

stand-alone
controlled by a department
minimal integration
no global view
very fast to implement

DW Architectures
Interconnected Architecture

distributed
integrated and interconnected
gives a global view of enterprise
more complexity
who manages / controls data
another tier in architecture to share common data
between multiple data marts
have a data sharing schema across data marts

Independent and Interconnected


Architecture

Types of Data Warehouse


Enterprise Data Warehouse
Data Mart
Enterprise
Data Warehouse

Datamart

Datamart

Datamart

Enterprise data warehouse


Contains data drawn from multiple operational
systems
Supports time- series and trend analysis across
different business areas
Can be used as a transient storage area to clean
all data and ensure consistency
Can be used to populate data marts
Can be used for everyday and strategic
decision making

Data Mart

Logical subset of enterprise data warehouse


Organized around a single business process
Based on granular data
May or may not contain aggregates
Object of analytical processing by the end user.
Less expensive and much smaller than a full
blown corporate data warehouse.

Distributed and Centralized


Data warehouses
DW sitting on a monolithic machine unrealistic
Separate machines, different OS, different DB
systems - reality

Solution
Share a uniform architecture to allow them to
be fused coherently

Classical Architectures
Physical data warehouse (physical)
Data warehouse --> data marts
Data marts --> data warehouse
Parallel data warehouse and data marts

Physical data warehouse:


Data warehouse --> data marts
External
Data

SOURCE DATA

Operational Data

Data Warehouse

Data Marts

Staging Area

Physical Data Warehouse:


Data Warehouse --> Data Marts

Physical data warehouse:


Data marts --> data warehouse
External
Data

SOURCE DATA

Operational Data

Data Warehouse
Data Marts

Staging Area

Physical Data Warehouse:


Data Marts --> Data Warehouse

Physical data warehouse:


Parallel data warehouse and data
marts
Data Warehouse
External
Data

SOURCE DATA

Staging Area
Operational Data

Data Marts

Physical Data Warehouse:


Parallel Data Warehouse & Data Marts

DW Implementation Approaches

Top Down
Bottom-up
Combination of both
Choices depend on:

current infrastructure
resources
architecture
ROI
Implementation speed

Top Down Implementation

Bottom Up Implementation

DW Implementation Approaches
Top Down
More planning and design
initially
Involve people from
different work-groups,
departments
Data marts may be built
later form Global DW
Overall data model to be
decided up-front

Bottom Up
Can plan initially without
waiting for global
infrastructure
built incrementally
can be built before or in
parallel with Global DW
Less complexity in design

DW Implementation Approaches
Top Down
Consistent data definition
and enforcement of
business rules across
enterprise
High cost, lengthy
process, time consuming
Works well when there is
centralized IS department
responsible for all H/W
and resources

Bottom Up
Data redundancy and
inconsistency between
data marts may occur
Integration requires great
planning
Less cost of H/W and
other resources
Faster pay-back

DW Implementation Approaches
Combined Approach
Determine degree of planning and design for a global
approach to integrate data marts being built by bottom-up
approach
Develop base level infrastructure definition for global DW
at business level
Develop plan to handle data elements needed by multiple
data marts
Build a common data store to be used by data marts and
global DW

Levels of modeling

Business
Process
Conceptual

Logical
Model

Physical
Model

Levels of modeling
Conceptual modeling
Describe data requirements from a business
point of view without technical details

Logical modeling
Refine conceptual models
Data structure oriented, platform independent

Physical modeling
Detailed specification of what is physically
implemented using specific technology

Conceptual Model
A conceptual model shows data through
business eyes.
All entities which have business meaning.
Important relationships
Few significant attributes in the entities.
Few identifiers or candidate keys.

Sample conceptual model


Products

Customer
Invoices

Customers

Sales Reps

Customer
Addresses

Geographic
Boundaries

Sample
Conceptual
Model

Logical Model
Replaces many-to-many relationships with
associative entities.
Defines a full population of entity attributes.
May use non-physical entities for domains
and sub-types.
Establishes entity identifiers.
Has no specifics for any RDBMS or
configuration.

Sample logical model


CUSTOMER INVOICE
#INVOICE ID
#LINE ITEM SEQ
.INVOICE DATE

the bill for


purchased
by

PRODUCT
#PRODUCT CODE
.PRODUCT DESCRIPTION
sold by

the bill sent to


purchased at

CUSTOMER ADDRESS
#CUSTOMER ID
#ADDRESS ID

the bill purchased by


purchased by

CUSTOMER
#CUSTOMER ID
#SNAPSHOT DATE
.CUSTOMER NAME

for the
for the located within
customer
customer
managed by sold to by

the salesman the sales


for
manager for

SALES REP
#SALES REP ID

the salesman
for

the general location of

GEOGRAPHIC
BOUNDARY
#GEO CODE

Sample Logical Model

Physical Model
A Physical data model may include

Referential Integrity
Indexes
Views
Alternate keys and other constraints
Tablespaces and physical storage objects.

PRODUCTS
# PRODUCT_CODE
PRODUCT_DESCRIPTION
CATEGORY_CODE
CATEGORY_DESCRIPTION

SALES_REPS
#SALES_REP_ID
LAST_NAME
FIRST_NAME
oMANAGER_FIRST_NAME
oMANAGER_LAST_NAME

CUSTOMER_INVOICES

CUSTOMERS

#INVOICE_ID
#LINE_ITEM_SEQ
INVOICE_DATE
CUSTOMER_ID
BILL_TO_ADDRESS_ID
SALES_REP_ID
MANAGER_REP_ID
ORGANIZATION_ID
ORG_ADDRESS_ID
PRODUCT_CODE
QUANTITY
UNIT_PRICE
AMOUNT
oPRODUCT_COST
LOAD_DATE

#CUSTOMER_ID
#SNAPSHOT_DATE
CUSTOMER_NAME
oAGE
oMARITAL_STATUS
CREDIT_RATING

Sample Physical
Model

CUSTOMER_ADDRESSES

GEOGRAPHIC_BOUNDARIES

#CUSTOMER_ID
#ADDRESS_ID
ADDRESS_LINE1
oADDRESS_LINE2
oPOSTAL_CODE
SALES_REP_ID
GEO_CODE
LOAD_DATE

#GEO_CODE
CITY_NAME
STATE_NAME
COUNTRY_NAME
oCITY_ABBRV
oSTATE_ABBRV
oCOUNTRY_ABBRV

Data Architecting
What is data architecting???
Structure and locate data according to its
characteristics
3 Basic types of data
Real time data
Derived data
Reconciled data

Data Architecting-Real time


data

Represents current status of business


Used by operational systems to run business
Changes as operational transactions are processed
Very detailed, high level of granularity

Data Architecting - Real time


data
To use Real time data in DW:
Must be
Cleansed (comes from different sources, cleansed to to ensure
data consistency quality)
Summarized (because it contains individual,
transactional,detailed)
Transformed

into an easily understandable format for manipulation


by analysts
Eg. Different units of measure, currency, exchange rates

Data Architecting - Derived data


Data created by summarizing, aggregating,
averaging real-time data through some process
represents a view of business data at a specific
time
Historical record of business over a period
Precalculate derived data elements and summarize
detailed data to improve query processing

Data Architecting - Reconciled


data
Real-time data cleansed, adjusted, enhanced to
provided integrated source of data for analysis
Create and maintain historical data while reconciling
Normally not explicitly defined
Logical result of derivation operations
May be stored as temporary files used to transform
operational data for consistency

Enterprise Data Model (EDM)


Consistent definition of data elements common to
a business
High-level business view
Generic logical data model
Physical data design

EDM - The Phased Enterprise


Data Model

Enterprise Data Model (EDM)


Phases
Increasing order of Information required
Information Planning
Business Analysing
Logical Data Modeling
Physical Data design

Enterprise Data Model (EDM)


Information Planning
Consolidated view of the business
Identify some business concepts (20-30)
called subject areas / super entity/ business entity in which
the organization is interested Eg.customer, product
Purpose
To set up scope and architecture of DW
To provide a single comprehensive point of view

Enterprise Data Model (EDM)


Business Analyzing
Define contents of primary business concepts.
Gather and arrange business requirements
Defines business terms
Purpose
To set up scope and architecture of DW
To provide a single comprehensive point of view

Enterprise Data Model (EDM)


Logical Data Modeling
Enterprise-wide in scope
consists of several entities, relationships, attributes
complete model in 3rd Normal Form.
Can be divided into 2 types:
Generic logical data model (enterprise level)
Logical application model (application level)

Enterprise Data Model (EDM)


Physical Data Design
space
performance
physical distribution of data
Purpose:
To design for the physical implementation

Enterprise Data Model (EDM)


Is it possible to draw an EDM ???
Not always!!
Phased approach OR a simple EDM

list of subject areas (<25)

define business relationships between subject areas


define contents of each subject area

Granularity
Level of summarization of data elements
Level of detail available in the data
More the detail Lower the granularity
Why is it important in DW???
Opportunity for TRADE-OFF
performance
vs. volume of data stored
ability to access detailed data vs. cost of storage

Granularity

Granularity
To overcome trade-offs between data volume and
query capability :
Divide the data in the DW
Create 2 levels of granularity of data
Detailed Raw data
keep it on separate storage medium
load when required

Summarized data

Data Partitioning Model


WHY?
To understand, maintain and navigate a DW
TYPES of Partitioning
Logical and Physical

Data Partitioning Model


Logical Partitioning - WHY?
Goals:

Data Partitioning Model Logical Partitioning


Partition large volumes of data by splitting
Helps to make data easier to:
Restructure
Index
Sequentially scan
Reorganize
Recover
Monitor

Data Partitioning Model Logical Partitioning


Logical Partition - HOW??
Criteria
Time period (date, month, or quarter)
almost always chosen

Geography (location)
Product (more generically, by line of business)
Organizational unit
A combination of the above

Data Partitioning Model Logical Partitioning

Data Partitioning Model -Subject


Areas
Subject areas classified by the topics of interest to
the business.
5W1H rule
when, where, who, what, why, and how
eg. who could be customer, employee, manager, supplier,
business partner, competitor.

Get a candidate list of subject areas


Decompose,rearrange, select, redefine in more
detail

Data Partitioning Model -Subject


Areas
Define the business relationships among
subject areas
This will determine the dimensions used
Subject Areas help define criteria like:

Unit of the data model


Unit of an implementation project
Unit of management of the data
Basis for the integration of multiple mplementations

unit for analysis should be business process

Data Modeling - Techniques

What needs to be modeled during


a data warehouse project
STAGING AREA
YES ! (maybe multiple data models are
required)

ODS
YES !

DATAWAREHOUSE/DATAMART
YES!

Data Modeling - Techniques


Modeling techniques
E-R Modeling
Dimensional Modeling

Implementation and modeling


styles
Modeling versus implementation
Modeling: describe what should be built to
non-technical folks
Implementation: describe what is actually built
to technical folks

Implementation and modeling


styles (Contd )
Relational modeling
Use for implementation
Difficult to understand by non-technical folks

Dimensional modeling
Use for modeling during analysis and design
phases
Can be implemented using other modeling
styles e.g. object-oriented, relational

E-R Modeling
Produces a data model, using two basic
concepts entities and the relationships
between those entities.
Detailed ER models also contain attributes,
which can be properties of either the entities
or the relationships.

Conventions used in E-R


modeling
Entities

EMPLOYEE

EmpName

Address

Attributes

Relationships or Associations

Belongs
To

Entities
Principal data objects about which information
is to be collected.
Usually recognizable concepts such as person,
things, or events.
Examples : EMPLOYEES, PROJECTS,
INVOICES.

Attributes & Relationships


Attributes describe the entity of which they
are associated.
A relationship represents an association
between two or more entities. An example :
Employees are assigned to projects
Departments manage one or more projects.

Types of Data Relationships Cardinality


One - One

1: 1

One - Many

1: m

Many - Many

m:n

Recursive data relationship

Normalization

Remove data redundancy


0 NF - contains repeating values
1 NF - No repeating values
2 NF - Every attribute is dependent on the key, the
whole key and nothing but the key
3 NF - No non-key attribute is functionally
dependent on another non-key attribute
Denormalization - carefully introduced redundancy to
improve query performance

Normalization - 1NF
Eliminate Repeating groups
Person
Skills
A Oracle, DB2
B MS Access, Oracle
C Oracle, CICS, SQL
D DB2, CICS
Who are the ones who have DB2 skills???

Normalization - 2NF
Eliminate Redundant data
Skill ID Skill Description
S1 DB2
S2 Oracle
S3 MS Access
S4 CICS
S5 SQL

Normalization - 3NF
Eliminate Columns Not Dependent On Key
Memb ID Skill ID
A
S1

Comp ID. Comp Name Location


D1
Core Tech HYD

Relational modeling
Represents business entities, data items
associated with each entity, and the
relationships of business interest among the
entities
Entities are usually broken down into
smallest possible units and combined using
relationships
Diagram looks like a spiderweb

Entity Completeness Checklist


Name
to describe the data contained
to meet naming conventions/standards

Description
to describe precisely what the entity represents
required for sharing and reuse of data model
components

Category
classifies entities sharing common characteristics

Entity Completeness Checklist


(contd.)
Category Types
Fundamental entities(represents basic or core
concepts)
Associative/intersecting entities(to associate
entities to reconcile m-m relation)
Attributive entities(to describe or categorize other
entity)
Subtype entities(to represent a subset of
occurrences of parent entity)

Entity Completeness Checklist


(contd.)
Abbreviations
document the abbreviation and full definition

Acronyms
avoid (not understood by all, not unique)
if used, document them

Current Number of occurrences


to estimate entity statistics for all entity
categories

Entity Completeness Checklist


(contd.)
Authority
Metadata authority(to approve change of entities,
attributes etc.)
Data authority(to change occurrences of entity)

Primary Key/Foreign Key/Non-key attribute


names
Relationships to other entities
no entity stands by itself

Homonyms
Same or similar in sound or spelling as another
BUT DIFFERENT IN MEANING!!
Create CONFUSION!

IDENTIFY AND ELIMINATE them for


entities and attributes!!

Synonyms
Same meaning ...
Same logical concept ...
Assigned different names!!
Introduce redundancy in model!
IDENTIFY AND RESOLVE them - for entities
and attributes!!

Synonyms (contd.)
Compare Definition, Relationships to other entities, Key
structure, attributes, domain values

Attribute Completeness
Checklist
Name
to uniquely identify the attribute
to meet naming conventions/standards

Description
to describe precisely what the attribute represents

Type
refers to how the attribute is used in the datamodel

Completeness Checklist (contd.)


Key attributes
primary keys in the entity that they are defined
primary / foreign keys in other entities that they occur in)
implemented with a unique index

Non-Key attributes

contain the bulk of the information


need not be unique
candidate keys not selected as primary keys
secondary keys may be selected as access paths
implemented using non-unique index

Completeness Checklist (contd.)


Domain
set of permitted values for the attribute
Domain elements
General Domain
describes the manner in which data is represented(data type)
alphanumeric, real, integer, boolean, sound, digital video etc.

Specific Domain
Enumerated domain
specific set of values that are valid and allowed
static values (eg. Flat type : 2 bed, 3 bed, duplex etc)

Completeness Checklist (contd.)


Abbreviations
document the abbreviation and full definition

Acronyms
avoid (not understood by all, not unique)
if used, document them

Key use
applies only to primary keys
will serve as primary or foreign key in child entity

Source
whether attribute is primitive or derived

Completeness Checklist (contd.)


If derived, establish the formula
document formula
formula should identify any other attributes required to
generate value for derived attribute

Traceability
why is the attribute there
refer to source (paragraph, citation of statement, physical
data structure element ...)
mapped to metadata object that is maintained as part of
system lifecycle (eg. Critical success factor, objective,
physical system element like file, table

Derived Attributes
Created by accumulating values of multiple
instances of attributes. Eg.
Aggregation/summarization
Library Branch

BranchBranch
Holding
Holding

Branch id
Total Titles

Branch id
book id
number of copies

Total Titles = count of (Branch Holding) where


(Branch Holding) Branch id = (Library) Branch id

CalculatedAttributes
Describes a feature of a single instance of entity
Calculated from another single instance of related attribute

Attribute Metadata
TASK
Task id
Task Start date
Task End Date
Task Duration

Branch
Calculation formula for task Holding
duration:
Task duration = task end date - task start date
Derivation Dependencies :
1. Task start date and Task duration
2. Task end date and Task duration

Calculated Attributes - contd..


Should Data model contain derived attributes??
YES !!
represent information that management actually wants
users have an opportunity to specify business rules
provide an opportunity to validate that all necessary base
data is captured
design is made easier as requirements are already
mapped
In DSS environment - ESSENTIAL
NEVER use derived attributes as PRIMARY keys

Derived attributes - An example


ORDER

TIME
PERIOD

PRODUCT

Order #(PK)
order date

Product #(PK)
Product Name
Product Price

PRODUCT
ORDER

Product # (PK)
Order # (PK)
Total units sold
Total sales price

Period Start Date(PK)


Period End Date(PK)
Period Reference Name

PRODUCT
PERIOD
Product # (PK)
Period Start Date(PK)
Period End Date (PK)
Total product period sales

Attribute Names

Unique name representing its business meaning


clear, concise, self-explanatory
minimize use of special characters
length > 50 gives flexibility
limitations of 32, 33 exist in some CASE tools
standard documented abbreviations made

SHOULD NOT
replace or contradict definition of attribute
contain abbreviations not approved by authority

Attribute Names
SHOULD NOT CONTAIN

possessive forms ( Individuals birth date)


articles (a, an, the)
conjunctions (and, but)
verbs (person owns property)
prepositions (at, by, under, for, of ..)
plural words (product names..)
names of organizations, forms, screens, reports
eg. Block 61 title (refers to a specific field on a form)

Attribute Description
Builds on and is consistent with attribute name
unambiguous, clear, economically worded
stand alone (not dependent on another attribute
definition to convey meaning. BEWARE of circular
attribute definitions)
Never MISS giving a description
AVOID:
restating the name of attribute and/or characteristics (eg.
Length, data type, domain values)
using technical jargon
limiting description to direct extract from dictionary

Some attribute descriptions


Need improvement

Pretty Good

Location name - the name of a


location

Safety level quantity - The


calculated minimum quantity of a
product SKU that must be on
hand to reduce risk of out-of order line total quantity - a sixstock conditions
digit integer total
directional indicator - E, W, S,
N, NE

operating quantity - The


calculated, demand-driven
quantity of a material item that
must be maintained and
replenished for use in day-to-day
operations

Primary Key Attributes


Stable (not to change in value, cannot be null)
Minimal (in number of attributes.. Large composite
keys not advisable)
Factless(should not contain intelligent groupings of
data)
Definitive(value always exists for every occurrence)

Primary Key Attributes

Candidate Keys (Possible primary keys)


One among them is chosen as Primary key
The others are alternate keys
eg. Candidate keys for a U.S. Citizen are:

driving license #
passport #
SS #
None of them are definitive
Fingerprint ID Is DEFINITIVE

Primary Key AttributesSurrogate Keys


Use artificial key/surrogate key/pseudokey/system-generated key to ensure uniqueness
when:
no attribute possesses all PK characteristics
candidate keys are large and complex

ALWAYS USE IN DW Data Model

Relationships- Checklist
Name & Description - Optional
Type (identifying/non-identifying)
Cardinality (Degree/Nature)

one-to-one 1:1
many-to-one m:1
one-to many 1:m
many-to-many m:m(resolved using associative entities)

Deletion Integrity Rules


(cascade/disassociate/disallow)

Limitations of E-R Modeling


Poor Performance
Tend to be very complex and difficult to
navigate.

Dimensional Modeling
Dimensional modeling uses three basic
concepts : measures, facts, dimensions.
Is powerful in representing the requirements
of the business user in the context of
database tables.
Focuses on numeric data, such as values
counts, weights, balances and occurences.

Dimensional modeling
Must identify

Business process to be supported


Grain (level of detail)
Dimensions
Facts

Conventions used in Dimensional


modeling
Facts
Measures(Variables)
Dimensions
Dimension members
Dimension hierarchies

Facts
A fact is a collection of related data items,
consisting of measures and context data.
Each fact typically represents a business
item, a business transaction, or an event that
can be used in analyzing the business or
business process.
Facts are measured, continuously valued,
rapidly changing information. Can be
calculated and/or derived.

Fact Table
A table that is used to store business
information (measures) that can be used in
mathematical equations.
Quantities
Percentages
Prices

Dimensions
A dimension is a collection of members or
units of the same type of views.
Dimensions determine the contextual
background for the facts.
Dimensions represent the way business
people talk about the data resulting from a
business process, e.g., who, what, when,
where, why, how

Dimension Table
Table used to store qualitative data about
fact records

Who
What
When
Where
Why

Dimension data should be

verbose, descriptive
complete
no misspellings, impossible values
indexed
equally available
documented ( metadata to explain origin,
interpretation of each attribute)

Dimensional model
visualise a dimensional model as a CUBE
(hypercube because dimensions can be more than
3 in number)
Operations for OLAP
Drill Down :Higher level of detail
Roll Up: summarized level of data
(The navigation path is determined by hierarchies within dimensions.)

Slice: cuts through the cube.Users can focus on specific


perspectives
Dice: rotates the cube to another perspective (change the
dimension)

Drill down . Roll up

Slice and Dice

Dimensions
Collection of members or units of the same type of
views.
determine the contextual background for the facts.
the parameters over which we want to perform
OLAP (Eg. Time, Location/region, Customers)
Member is a distinct name to determine data items
position (eg. Time - Month, quarter)
Hierarchy arrange members into hierarchies or levels

Hierarchies
Allow for the rollup of data to more
summarized levels.
Time

day
month
quarter
year

Hierarchies

Aggregates
Aggregate
Tables
are
pre-stored
summarized tables created at a higher
level of granularity across any or all of the
dimensions.
If the existing granularity is Day wise sales,
then creating a separate month wise sales
table is an example of Aggregate Table.

Aggregates
The use of such aggregates is the single
most effective tool the data warehouse
designer has to improve query performance.
Usage of Aggregates can increase the
performance of Queries by several times.

Measures
A measure is a numeric attribute of a fact,
representing the performance or behaviour of the
business relative to dimensions.
The actual numbers are called as variables.
Eg. sales in money, sales volume, quantity supplied, supply cost,
transaction amount

A measure is determined by combinations of the


members of the dimensions and is located on
facts.

THE CUBE

Types of Facts
Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $

Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account
balance, Inventory balance
Added and divided by number of time period to get
a time-average

Types of Facts
Non Additive
Numeric measures that cannot be added across any
dimensions
Intensity measure averaged across all dimensions eg.
Room temperature
Textual facts - AVOID THEM

Advantages of Dimensional
Modeling
Allows complex multi-dimensional data
structure to be defined with a very simple data
model.
Reduces number of physical joins the query
has to process
Simplifies the view of data model.
Allows DWH to expand and evolve with
relatively low maintenance.

Sample business process versus


dimension table
Products
Product Sales

Customers Location

Sales
Rep

Date

Product
Manufacturing

Employee
Compensation

Sample measure versus


dimension table
Product Sales
($)

Products Customers

Location Sales
Rep

Date

Product
Manufacturing
(units)
Sales
Commission ($)

Payroll (gross)
($)

TIME PERIOD

PRODUCT
Product description
Category code
Category description

SALES REP
Last name
First name

Invoice date
Fiscal year
Quarter
Month
Week

CUSTOMER REP SALES


Customer snapshot date
Invoice date
Gross sales
Quantity
Product cost

CUSTOMERS
Customer name

ADDRESS
Address line 1
Address line 2
City name
State abbreviation
Postal code
Country name

CUSTOMER DEMOGRAPHICS
Snapshot date
Credit rating
Marital status
Age

Sample Logical Model


for Dimensional Data Mart

PRODUCT_SNAPSHOTS

PRODUCTS

#PRODUCT_CODE
#SNAPSHOT_DATE
. MSRP
. UOM
. PRIMARY_SUPPLIER_NAME
. SUPPLIER_CITY_NAME
. SUPPLIER_STATE_ABBRV
. SUPPLIER_COUNTRY_NAME

#PRODUCT_CODE
. PRODUCT_DESCRIPTION
. CATEGORY_CODE
. CATEGORY_DESCRIPTION

SALES_REPS
# SALES_REP_ID
. LAST_NAME
. FIRST_NAME
o
MANAGER_FIRST_
NAME
oMANAGER_LAST
_NAME

CUSTOMER_INVOICES
#INVOICE_ID
#LINE_ITEM_SEQ
. INVOICE_DATE
. CUSTOMER_DATE
. BILL_TO_ADDRESS_ID
. SALES_REP_ID
. MANAGER_REP_ID
. ORGANIZATION_ID
. ORG_ADDRESS_ID
. PRODUCT_CODE
. QUANTITY
. UNIT_PRICE
. AMOUNT
o PRODUCT COST
. LOAD_DATE

CUSTOMER_ADDRESSES
#CUSTOMER_ID
#ADDRESS_ID
. ADDRESS_LINE1
oADDRESS_LINE2
oPOSTAL_CODE
. SALES_REP_ID
. GEO_CODE
. LOAD_DATE

PURCHASE_INVOICES
# INVOICE_ID
#LINE_ITEM_SEQ
. INVOICE_DATE
. SUPPLIER_ID
. ADDRESS_ID
. BUDGET_ID
. REVISION_SEQ
. BUDGET_LINE_ITEM_SEQ
. PRODUCT_CODE
. QUANTITY
. UNIT_PRICE
. AMOUNT
. LOAD_DATE

CUSTOMERS
#CUSTOMER_ID
#SNAPSHOT_DATE
. CUSTOMER_NAME
oAGE
oMARITAL STATUS
. CREDIT_RATING

#BUDGET_ID
#REVISION_SEQ
#LINE_ITEM_SEQ
. BLI_TYPE_CODE
. BLI_TYPE_DESCRIPTION
. ORGANIZATION_ID
. ADDRESS_ID
. BUDGET_PERIOD
. LOAD_DATE
. BUDGET_AMOUNT
. EXPENDITURES
o PRODUCT_CODE

SUPPLIER_ADDRESSES
#SUPPLIER_ID
#ADDRESS_ID
. SUPPLIER_NAME
oPOSTAL_CODE
. GEO_CODE
. LOAD_DATE

GEOGRAPHIC_BOUNDARIES
#GEO_CODE
. CITY_NAME
. STATE_NAME
. COUNTRY_NAME
oCITY_ABBRV
oSTATE_ABBRV
oCOUNTRY_ABBRV

BUDGET_DETAILS

Sample Physic
Model
for
Data Warehous

INTERNAL_ORG_ADDRESSES
#ORGANIZATION_ID
#ADDRESS_ID
. ORG_TYPE
. ORGANIZATION_NAME
. ADDRESS_LINE1
oADDRESS_LINE2
oPOSTAL_CODE
. GEO_CODE
oPARENT_ORG_ID
. LOAD_DATE

Common structures for datamarts:


Denormalize!
Star
Single fact table surrounded by denormalised
dimension tables
The fact table primary key is the composite of the
foreign keys (primary keys of dimension tables)
Fact table contains transaction type information.
Many star schemas in a data mart
Easily understood by end users, more disk storage
required

Example of Star- schema

Common structures for datamarts:


Denormalize!
Snowflake
Single fact table surrounded by normalised dimension
tables
Normalizes dimension table to save data storage space.
When dimensions become very very large
Less intuitive, slower performance due to joins

May want to use both approaches, especially if


supporting multiple end-user tools.

Example of Snow flake schema

Snowflake - Disadvantages
Normalization of dimension makes it
difficult for user to understand
Decreases the query performance because it
involves more joins
Dimension tables are normally smaller than
fact tables - space may not be a major issue
to warrant snowflaking

Keys ..
Primary Keys
uniquely identify a record

Foreign Keys
primary key of another table referred here

Surrogate Keys
system-generated key for dimensions
key on its own has no meaning
integer key, less space

More Keys ..
Smart Keys
primary key out of various attributes of
dimension
AVOID THEM!
Join to Fact table should be on single surrogate
key

Production Keys
DO NOT USE Production defined attributes
Business may reuse/change them - DW cannot!

Basic Dimensional Modeling


Techniques

Slowing changing Dimensions


Rapidly changing Small Dimensions
Large Dimensions
Rapidly changing Large Dimensions
Degenerate Dimensions
Junk Dimensions

Slowly Changing Dimensions


A dimension is considered a Slowly
Changing Dimension when its attributes
remain almost constant over time, requiring
relatively minor alterations to represent the
evolved state.

Slowly changing DimensionOptions


Eg. Key does not change but description changes (product
description)

TYPE 1
Overwrite dimension record with new
values
used when old value of attribute has no
significance

Slowly changing DimensionOptions


TYPE 2
Create a new record using a new value of
surrogate key
used when history can be clearly partitioned
query only on new value or only old value
query on some other attributes - return all
records)

Slowly changing DimensionOptions (contd..)


TYPE 3
Create an old field in dimension to store
immediate previous value
used when change is a soft change
no perfect partition in history
may want to track for sometime with both old
or new value
do not use when there are too many such soft
changes successively

Slowly Changing DimensionAn Example


Slowly Changing Dimension

Rapidly Changing Small


Dimensions
Eg. Rapid changes to product dimension

Type 2 (use surrogate key and create a new


record)
use effective dates
use only until dimension table remains
small

Large Dimensions
Dimensions containing several million records!!!

HOW TO SUPPORT???
Database to support indexing technology
that support rapid browsing
Find and suppress duplicate entries in the
dimension (eg. Name and address
matching)
Never use Type 2 to solve changing
dimensions (i.e. adding records)

Rapidly Changing Monster


Dimensions
Dimensions containing > 100 million records!!!

HOW TO SUPPORT???
Break the Monster dimension into separate
dimension tables
Constant information into original table
New dimension table can have discrete
values for each attribute
Choose pre-defined set of values per
attribute

Rapidly Changing Monster


Dimensions (contd..)
Build the data in this dimension with all
possible combinations of values for each
attribute
Identify each combination uniquely
Everytime an event occurs and is recorded
in fact table, attach it with the unique
combination ID.

Fact Table

Customer Dimension
Customer_Key (PK)
Name
Original_Address
date_of_birth
first_order_date
..
Income
Education
Number_children
marital_status
credit_score
purchase_score

Fact Table

Any fact table


containing
customer_key as a
foreign key..

Any fact table


containing
customer_key and
demog_key as
foreign keys ..

Customer Dimension

Becomes..

Customer_Key
(PK) Name
Original_Address
date_of_birth
first_order_date
..

Demographics Dimension
Demog_Key (PK)
Income
Education
Number_children
marital_status
credit_score
purchase_score

Customer Dimension
Customer_Key (PK)
Relatively constant
attributes .

Demographics dimension

Fact Table
Any fact table containing
customer_key,

demog_Key

demog_key and

demographic attributes
.

purch_cred_demog_key

Purchase-Credit Demographics dimension


Customer_Key (PK)
Relatively constant
attributes .

as foreign keys .

Rapidly Changing Monster


Dimensions (contd..)
Advantages
No increase in data storage everytim event occurs

Drawbacks
Forced to use ranges of discrete values for
dimensional attributes
New dimension cannot be too big (not >1M)
Data in new dimension can be accessed along with
static data only through the fact table - slower
Only if event occurs, link the static and changing
portions of dimension - keep a dummy event in fact

Degenerate Dimensions
Occur in line item oriented fact tables
occur when dimension table is left only
with a single key and no other fields
all other attributes have been moved into
other dimension tables
Moved to fact table - not joined to anything

Junk Dimensions
Number of miscellaneous flags and text
attributes left over after design
WHAT TO DO WITH THEM????
DO NOT
Leave them behind in the fact table
Make each flag and attribute into its own dimension
Strip off all such flags and attributes

Junk Dimensions (contd)


DO
Grouping of random flags and attributes
take away from fact and group them into junk
dimension

eg. Open ended comments fields

Conformed Dimensions
Dimension that means the same thing with every
possible fact table that it is joined.
Dimension is identically the same dimension in each
data mart
Major responsibility of the central DWdesign team is to
establish, publish, maintain and enforce them
DW cannot function as an integrated whole without
strict adherence to conformed dimensions

Conformed Dimensions (Contd.)


When you dont need Conformed Dimensions
Several lines of business where the customers and
products are disjoint.
Dont manage these separate business lines
together

THE TIME DIMENSION


Time_key
day_of_week
day_number_in_month
day_number_overall
week_number_in_year
month
quarter
fiscal_period
holiday_flag
weekday_flag
last_day_in_month_flag
season
event

Time Dimension
An exclusive Time dimension is required
because the SQL date semantics and
functions cannot generate several important
attributes required for analytical purposes.
Attributes like weekdays, weekends, fiscal
period, holidays, season cannot be
generated by SQL statements.

Time Dimension
Moreover SQL date stamps occupy more
space largely increasing the size of the fact
table.
Joins on such SQL generated date-stamps
are costly decreasing the query speed
significantly.

Time Dimension
The Day of week(Monday, ...) is useful to
create reports comparing for ex. Monday
sales to Friday sales.
The Day number in month is useful for
comparing measures for the same day in
each month.
The last day in month flag is useful for
performing payday analysis.

Time Dimension
The holiday flag and season attributes are
useful for holiday VS non-holiday analysis
and season business analysis.
Event attribute is needed to record special
days like strike days, etc..

Case Study
on
Data Modeling

Store
Store Key
Store Id
Store Name
Locality
Region
.
.

Sales Fact
Time Key
Product Key
Store Key
Promotion Key
Sales (Rs.)

Product
Product Key
Product Id
Product category
..
Brand Name
SKU
..

Promotion
Fact
Time Key
Product key
Store key
Promotion key

Time
Time Key
Time Id
Date
Month
Year
.
.
Promotion
Promotion key
Promotion Id
Promotion Category
..
Promotion Name
..

A Retail chain sample dimensional model

Retail Chains Sample Dimensional model


The first sales fact table measures the sales
figures at a granularity of SKU, Day and
Individual Store and Promotion name.
Only the SKU s that actually sell on the
day make it into the sales fact table
irrespective of whether they are on
promotion or not.

Retail Chains Sample Dimensional model


The second promotion fact table is a
factless fact table. It has a granularity of
SKU, Day, Store and Promotion Name.
This promotion fact table records which
items are on promotion in which stores and
at what times.

Retail Chains Sample Dimensional model


Time, Product and Store are common
dimensions in both the fact tables.
Product and Promotion are Type 2 Slowly
changing dimensions.

Retail Chains Sample Dimensional model


The sales fact enables the sales monitoring and
analysis across Product, Stores, Time and
Promotion dimensions.
The second promotion fact table is needed to
answer the critical question . Which are the
products that were on promotion but did not sell
on a particular day?

Retail Chains Sample Dimensional model


The second fact table can be avoided if we keep
zero sales figures in the sales fact table. but that
would make our sales fact table very
large.because less than 5% of products which
were on promotion on a particular day actually
sell.

Retail Chains Sample Dimensional model


Bitmap Indexes on the foreign key columns in
the fact tables.
Bitmap Indexes on low cardinality columns in
dimensional tables like Month, Product
Category, Store category, etc
B-Tree Indexes on Dimension key columns.

Retail Chains Sample Dimensional model


The sales fact is partitioned across the Month
column.
Aggregates can be created in future based on
understanding of frequently needed & time
taking queries looking for summarized
information.

Aggregates
Consider a schema with Product and Time
Dimensions with a granularity of individual
product Brand and day wise sales.
The Product Hierarchy:
Category-Product-Brand
The Time Hierarchy:
Year-Month-Day

Aggregates
Product Dimension
Categories : 3
Products : 30
Brands
: 150
i.e 150 rows in the Product Dimension
Time Dimension
Year : 5
Month
: 60
Days : 365*5=1825
i.e 1825 rows in the Time Dimension

Aggregates
Assuming a transaction for each of the
Brands everyday; we have 1825*150 rows
in our sales Fact table.
A Query like: Show Category wise sales
figures for the past five years would have to
access 1825*150 rows to get the answer.

Aggregates
Aggregated Tables
Product
Category: 3
Time
Year : 5
Month: 60
There would be 60*3=180 rows in this
aggregated fact table.
The query on this table needs to access only
180 rows to get the same set of results.

Aggregates
MONTH
Time_Key
Month
Fiscal_Period
Season

CATEGORY
AGG. SALES
FACT

Category_Key

Time_Key

Department

Category_Key
Sales
Cost

Category

Aggregates
Aggregates increase the complexity of the
data model.
Aggregates increase the maintenance load
on the Data warehouse. They must be
updated as the base table data gets updated.
Aggregates occupy storage space. Hence
aggregates should be created only for
frequent and time taking queries.

Aggregate Navigation
Aggregate Navigation features enable endusers to query the data mart without
bothering about the presence of aggregates.
Without Aggregate navigation, the end user
needs to be aware of the presence of
aggregates so that he can query the
aggregated table instead of detailed table
thus increasing the complexity of the user
interface.

Aggregate Navigation
An aggregate navigator intercepts the
clients SQL and if possible transforms
base-level SQL into aggregate aware SQL.
Aggregate Aware function in Business
Objects 4.1 is an example of Aggregate
navigator.

Aggregate Navigation
New features in Oracle 8i like Materialized
views, Query rewrite
enable aggregate navigation to be built
within the data mart DBMS instead of front
end access tools.
enables all front end access tools to utilize
the aggregate navigation feature.

Factless Fact table


Factless fact tables are fact tables that do
not have any measures.
These kind of fact tables arise when there
are no obvious measures for the business
area.
Daily attendance tracking is one such
example of a business area having no
concrete measures.

Factless fact tables


TIME

STUDENT
Time_Key
Student_Key

COURSE

Course_Key

TEACHER

Teacher_Key
attendance=1
The grain of this fact table is individual attendance event.
Dummy measure-attendance included to make the SQL
more readable.

Store
Store Key
Store Id
Store Name
Locality
Region
.
.

Sales Fact
Time Key
Product Key
Store Key
Promotion Key
Sales (Rs.)

Product
Product Key
Product Id
Product category
..
Brand Name
SKU
..

Promotion
Fact
Time Key
Product key
Store key
Promotion key

Time
Time Key
Time Id
Date
Month
Year
.
.
Promotion
Promotion key
Promotion Id
Promotion Category
..
Promotion Name
..

A Retail chain sample dimensional model

When to start data modeling???


When requirements address these questions:
. Who (people, groups, organizations) is of interest to the
user?
What (functions) is the user trying to analyze?
Why does the user need the data?
When (for what point in time) does the data need to be
recorded?
Where (geographically, organizationally) do relevant
processes occur?
How do we measure the performance or state of the
functions being analyzed?

Approaches to Data Gathering


1. Source Driven
define requirements by using the source data in
production operational systems.
by analyzing an ER model of source data OR
by analyzing the actual physical record layouts and
selecting data elements deemed to be of interest.
Advantages
Know data that you can supply
Minimize user involvement in early stages of project

Disadvantages
Increased risk of producing wrong set of requirements

Approaches to Data Gathering


2. User Driven
define requirements by investigating the functions the
users perform
done through a series of meetings and/or interviews
with users.
Advantages
Focus on what is needed rather than what is available

Disadvantages
Expectations to be closely managed.

Combine both: Identify Subject areas (Source driven) and


define specific requirements in a Subject area (User driven)

Data Modeling for Data


Warehouse - Steps
12 Steps to Data modeling for Data Warehouse
1. Study ER
3. Review Dimension
5. Identify Facts
7. Merge Facts
9. Name Facts
11. Record Metadata

2.Evaluate and Analyse


4. Add Time Dimension
6. Granularity
8. Review Facts
10. Size the model
12. Validate model

Case Study
CelDial Case Study

Case study (contd..)


1. Study the ER
Step 1: Remove all entities that act as
associative entities and all subtype
entities.
(eg.Product Component, Inventory,
Order Line, Order, Retail Store, and
Corporate Sales Office)
Note: Be careful to create all the
many-to-many relationships that
replace these entities

Case study (contd..)


Step 2: Roll up the entities at the end of each of
the many-to-many relationships into single
entities.
For each new entity, consider which attributes
in the original entities would be useful
constraints on the new dimension.
Note : Remember to consider attributes of any
subtype entities removed in the first step.
Logical Model is a logical representation:
remove individual keys and replace with
generic key for each dimension.

Case study (contd..)


Note:
Roll the salesperson up into the sales
dimension
implies (correctly) that the relationships
among outlet, salesperson and customer
roll up into the sales to customer
relationship.

The many-to-many relationship


between customer and sales prevents
the erroneous rollup of customer into
sales person and ultimately into sales.

Case study (contd..)


2. Evaluate and Analyze business of the
organization
Requirements that are collected must represent
these :
what is being analyzed (Dimensions)
evaluation criteria for what is being analyzed
(Measures)
IDENTIFY the measures and dimensions
Analyze the questions, define measures and
dimensions to meet requirements.

Case study (contd..)


Advantage of the approach:
Used all information available

Corporate Dimensions (from ER)

From requirements gathered


Disadvantage of using only requirements:

More time consuming

Miss some dimensions altogether


Eg. Customer and Component dimensions and the Number of
Cash Registers and Floor Space, attributes of the Sales
dimension

Case study (contd..)


3. Review the dimensions
Do we have all data to answer all the questions?
a. Sales and Manufacturing??? Yes
b. Product
Q2, Q3 can they be answered? NO!
Whats MISSING??
Unit cost of model at any point in time is required.
History of unit cost required. Add begin and end
date in product dimension.
Unit cost Derivation rule?? Given (Defining Cost and
Revenue)

Case study (contd..)


4. Add Time Dimensions
Lowest level of Time - DAY
Reporting requirements ???
By day, week and month
Final Dimension List

Case study (contd..)


5. Identify Facts
One set of dimensions and its associated measures
make up what is called a fact.
Organizing the dimensions and measures into facts .
The process of grouping dimensions and measures
together in a manner that can address the specified
requirements. HOW?

First create an initial fact for each of the queries in


the case study.
Note: For any measures that describe exactly the same set of
dimensions, create only one fact

Case study (contd..)


Note:
Q6, Q8,Q9 do not have any measures
If we did not:
merge Q6 with Q5, Q7 in Fact 4
merge Q8 and Q9 with Q2 in Fact 2

left with Factless Facts (fact with no measures)


the sale of a product at a point in time (facts 2 and 3) at
a specific location (fact 2 only), has occurred. No other
measurement is required.

Case study (contd..)


6. Determine Granularity
Level of detail at which fact is recorded
Try to keep at most detailed level (summarize if required)

Additivity : ability of measure to be


summarized
fully additive additive across all dimensions -advised)
non-additive adding % of 2 facts - not possible)
semi-additive adding balances of same account at 2
different points in time. Additive only across some
dimensions)

Case study (contd..)


Fact 1 :Average quantity on hand (monthly)
Total cost and total revenue (daily)
Solution a. Split into 2 facts
b. Make the time dimension consistent
Make time to lowest level - DAY
Average quantity on hand - non-additive
Solution store actual quantity on hand and let the
query calculate average.

Case study (contd..)


Fact 2:
Two different levels of granularity
Q2
(daily)
Q8, Q9 (month)
Solution: Since measures are fully additive, set the
grain of time to a day. A query can handle any
summarization to the monthly level.

Case study (contd..)


Fact 3,4:
Two different grains of time. Neither can roll up to the other.
Options:
a. Change grain. But measures are non-additive
b. Split into multiple facts. But both facts have same
measures with only time grain different
Solution: Change time grain to DAY
Change non-additive measure to additive by storing atomic
elements of %.

Case study (contd..)


Solution
Replace % (Fact 3) with quantity of models sold
through: - retail outlet,
- corporate sales office
- salesperson.
Total quantity sold is already present. % can be calculated

Replace % (Fact 4) with :


- number of models eligible for discount,
- quantity of models eligible for discount actually sold
- quantity of models sold at a discount.

Case study (contd..)


7. Merge Facts
Consolidate facts where possible WHY?
Easier for a user to find the data needed to satisfy a query if
there are fewer places to look.
Expand the analysis potential because you can relate more
measures to more dimensions at a higher level of
granularity.
Fewer facts - lesser administration

HOW??
Determine for each measure which additional dimensions can
be added to increase its granularity

Case study (contd..)


Fact 1: No finer breakdown for quantity on hand or reorder
level
Fact 2: Already has all the dimensions in Fact 3 and 4
Fact 3 : Add Sales dimension to break up
Total Cost, Total Revenue, Total quantity sold
The sales dimension contains both outlet type and

salesperson data. Using this structure we can


classify the total quantity sold, negating the need
to store the three individual totals.
Solution: Merge Fact 3 into Fact 2

Case study (contd..)


Fact 4: Add Product dimension.
Number of models eligible for discount can be calculated
directly from the product dimension. Not needed in
consolidated Fact
Product dimension tells whether an individual model is
eligible for discount
Use the total quantity sold (consolidated from fact 3) to
represent the quantity of models eligible for discount
actually sold.

Case study (contd..)


Fact 4: Add Product dimension.
Quantity of models sold at a discount - Retain

OR
record the discount amount and generate the
quantity sold at a discount by adding up the
quantity sold where the discount amount is not
zero.
Solution: Merge Fact 2, 3, 4

Case study (contd..)


8. Review the facts for opportunities to add other
dimensions, increasing the potential for valuable analysis.
Fact 1 : Can it be broken down further ?? NO
Fact 2 : YES! manufacturing and customer dimensions can
be applied ; Component Dimension cannot be applied
dimensions are: sales, product, manufacturing, customer,
time.
All can be identified at the time an order is placed.
ADD order as a dimension (to increase analysis potential)
Order has no attributes. Add order key to fact as a
degenerate dimension

Case study (contd..)


9. Name the facts
Fact 1 - Inventory Fact
Fact 2 - Sales Fact
10. Size the model
calculate the size of the data in a table
number of rows * length of each row
To calculate row length:
4 bytes for each numeric or date attribute
number of characters for character attribute
number of digits in a decimal attribute / 2 and rounded up.

Case study (contd..)


To calculate Number of rows:
No history maintained. Use from operational system.
Seller = 48 ( 3 corp+ 15retail + 30 salesmen)
Customer = 3000
Manufacturing plants = 7
How long should we keep data? 3 complete yrs
Time : 1 row per day = 1461 (4 * 365 + 1 day for leap year)
No. of models of products = 300
No. of models experiencing changes = 10 per week =
10 * 52 * 4 = 2080
No. of product rows = 300 + 2080 = 2324380

Case study (contd..)


Size of Inventory Fact =

7 plants x 300 models x 1,461 = 3,068,100 rows


Size of Sales Fact
Corporate Sales
500 sales x 10 models x 5 days x 52 weeks x 4 years =
5,200,000 rows
Retail Sales
1000 sales x 2 models x 7 days x 52 weeks x 4 years =
2,912,000 rows

Size of Sale Fact = 8,112,000 rows

Case study (contd..)

Case Study (contd..)


11. Record Metadata
Model (Name, Definition, Purpose,Contact Person, List of
Facts, dimensions and measures)

Fact (Name,Definition, Alias, Load Frequency, Measure,


Grain of time, dimensions, contact person)

Dimension (Name, Definition, Alias, hierarchy, change rule,


load frequency, attribute, fact, measure, contact person)

Attribute (Name, Definition, Alias, change rule, data type,


domain, derivation rule)

Measure (Name, Definition, Alias, data type, domain,


derivation rule, fact, dimension)

Case Study (contd..)


12. Validate the Model with user
Confirms that model meets user requirements
Confirms that user understands the model.
Validated portion goes through design
Remaining goes back in iterative development of model

You might also like