DataModel Session ELTP
DataModel Session ELTP
Operational Systems
Traditional Applications designed to run the day-to-day business of the Enterprise
External Systems ***
Data used within an Enterprise that is obtained from outside sources
Staging Areas ***
Created to aid in the collection and transformation of data that is targeted for a Data
Warehouse
Operational Data Store ***
W. H. Inmon and Claudia Imhoff definition: ―A subject-oriented, integrated, volatile, current
valued data store containing only corporate detailed data‖.
Data Warehouse (DW)
W. H. Inmon definition: ―A subject-oriented, integrated, non-volatile, time-variant collection of data
organized to support management needs‖.
Student is an entity.
Student
First Name
Attributes Address
City
A specific entity will have a value E.g.: A specific employee entity may have
for each of its attributes Name='John Smith', SSN='123456789',
Each attribute has a value set (or E.g.: integer, string, date , enumerated type,
data type) associated with it …
Simple Attributes
•Each entity has a single atomic value for E.g. SSN or Sex
the attribute.
Composite Attributes
•The attribute may be composed of several E.g.: Address (Apt#, House#, Street,
components. City, State, Zip_Code, Country)
or
•Composition may form a hierarchy where Name (First_Name, Middle_Name,
some components are themselves Last_Name).
composite
Multi-valued Attributes
•An entity may have multiple values for the E.g.: Color of a CAR or
attribute. Previous Degrees of a STUDENT.
Nested Attributes
In general, composite and multi-valued E.g.: Previous Degrees of a STUDENT is a
attributes may be nested arbitrarily to any composite multi-valued attribute denoted by
number of levels although this is rare. {Previous Degrees (College, Year, Degree,
Field)}.
1 0
1:N APARTMENT
BUILDING
Classified by their
Degree
Connectivity
Cardinality
Direction
Existence.
One entity
related to Entities of two
another of the different types Entities of three
same entity related to each different types
type other related to each
other
•Maximum Cardinality
The maximum number
One-to-one (1:1)
One-to-many (1:N) or Many-to-one (N:1)
Many-to-many
Next step is to build the ER Diagram from the entities and data
items identified in the requirements.
An entity that does not relate to any other entity may end up
as a “stand alone” table with no defined relationships.
The XYZ Company wants Satyam to design and develop a database system for
its regular operations.
There are managers who manages and monitors the work done by the
employees. Suppose an employee is assigned to a project, the hours are
calculated based on number of hours the employee is scheduled to work on a
project.
Although most employees have managers, senior staff. The date on which a
manager started managing the department could be stored as an attribute of
department.
A department may be spread over many locations. The department name and
number are unique for the department. Employee may have number of
dependants.
Number of
Fname Mname Lname employees
Address Dnumber
N 1
Name Dname Dlocation
Salary WORKS_FOR
Sex
SSNO
Department
Employee Startdate
Bdate 1
1 1
CONTROLS
MANAGES
Hours N
supervisor supervisee
M N
SUPERVISION
WORKS_ON Project
1 N
1
DEPENDANTS_OF
Pname Pnumber Plocation
Dependant
1 1
Employee
MANAGES Department
(Manager)
1 N
Department WORKFOR Employees
N 1
Dependants DEPENDANT_OF Employees
Works On
Employee Project
Have
These 2 entities have 2 relationships - 1 to many in
each direction - resulting in a many-many
relationship.
Employees are optionally assigned to one or more
Projects, as appropriate. A Project must have at
least 1 employee.
What kind of table design does this suggest?
2 Tables plus a table with a column for each entity.
(Employee, Project, Employee_Project)
© Mahindra Satyam 2009 38
RECURSIVE RELATIONSHIPS
MANAGES
EMPLOYEE
(1,N) (1,1)
The Specialization
Generalization
43
© Mahindra Satyam 2009 43
Dimensional Model
Definition
Logical data model used to represent the measures and dimensions that
pertain to one or more business subject areas
Dimensional Model = Star Schema
Understandable
Systematically represents history
Reliable join paths
Enterprise scalability
Subject
area E/R
models
Manufacturing and Shipping and Sales Order Entry Customer Support
Process Control Inventory and Campaign and Relationship
Management Management Management
Subject area
dimensional
models
Enterprise
Scope E/R
model
Enterprise
scope
dimensional
model
Measures
Metrics or indicators by which people evaluate a business
process
Referred to as “Facts”
Examples Coffee Maker Fulfillment Report
Margin
Inventory Amount
Brand Product Units Sold Units Shipped % Shipped
Return Rate
Maker
Deluxe
Coffee 2,073 1,658 80%
Maker
All
Products 9,473 7,090 75%
Facts
Dimension tables
Store dimension values Dimension
Dimension
Dimension attributes
Specify the way in which Dimension
Key
attribute
measures are viewed: Key
attribute
rolled up, broken out or attribute
attribute
summarized attribute
Often follow the word ―by‖ attribute
as in ―Show me Sales by
Region and Quarter‖ Dimension
Process measures
Start by assigning one fact
table per business subject
area Fact Table
Fact tables store the
process measures (aka
Facts) fact1
Compared to dimension fact2
tables, fact tables usually fact3
Grain
The level of detail represented by a
row in the fact table
Must be identified early
Cause of greatest confusion during Fact Table
design process
Example
Each row in the fact table represents
the daily item sales total
Scenario
Industry: Automobile manufacturing
Company: Millennium Motors
Value chain focus: Sales
Sample business questions:
What are the top 10 selling car models this month?
How do this months top 10 selling models compare to the top 10 over
the last six months?
Show me dealer sales by region by model by day
What is the total number of cars sold by month by dealer by state?
List facts and dimensions
Facts
Sales revenue
Quantity sold
Dimensions
Model name
Month
Dealer name
Region
State
Date
Sales Facts
model_key
dealer_key
time_key
revenue
quantity
Sales Facts
Fully additive
Can be summed across any and all dimensions
Stored in fact table Time
Examples: revenue, quantity time_key
year
Model
Sales Facts quarter
model_key
model_key month
dealer_key date
brand
category time_key
line
model revenue
quantity
Dealer
dealer_key
region
state
city
dealer
Semi-additive
Can be summed across most dimensions but not all
Examples: Inventory quantities, account balances, or personnel counts
Anything that measures a ―level‖
Must be careful with ad-hoc reporting
Often aggregated across the ―forbidden dimension‖ by averaging
Time
Model time_key
Sales Facts
model_key model_key
year
dealer_key
brand quarter
time_key
category month
line date
model inventory Dealer
dealer_key
region
state
city
dealer
© Mahindra Satyam 2009 61
Facts
Non-Additive
Cannot be summed across any dimension
All ratios are non-additive
Break down to fully additive components, store them in fact table
Time
Model Sales Facts
time_key
model_key model_key
dealer_key year
brand time_key quarter
category month
line revenue date
model margin_amt
Dealer
dealer_key
Margin_rate is non-additive
region
Margin_rate = margin_amt/revenue state
city
© Mahindra Satyam 2009 62 dealer
Unit Amounts
Dealer Dimension
dealer_key region state city dealer
1 Northeast Massachusetts Boston Honest Ted's
2 Northeast Massachusetts Boston Stoller Co.
3 Southwest Arizona Tucson Wright Motors
12 Southwest California San Diego American
245 Central Illinois Chicago Lugwig Motors
Characteristics
Hold the dimensional attributes
Usually have a large number of attributes (―wide‖)
Add flags and indicators that make it easy to perform specific types of reports
Have small number of rows in comparison to fact tables (most of the time)
123 Sue Jones S $30K 1 123 Sue Jones S $30K 0 1 1 $40 1 1/31/01
123 Sue Smith M $60K 1 123 Sue Smith M $60K 0 1 1 $40 1 1/31/01
1 2 $50 2 2/01/01
123 Sue Jones S 30K 1 123 Sue Jones S $30K 0 1 1 $40 1 1/31/01
123 Sue Smith M $60K 1 123 Sue Jones S $30K 1 1 1 $40 1 1/31/01
Aggregates
Pre-stored fact summaries
Along one or more dimensions
The most effective tool for improving performance
Examples
Summary of sales by region, by product, by category
Monthly sales
Aggregate rationale
Improve end user query performance
Reduce required CPU cycles
Powerful cost saving tool
Restrictions
Additive facts only
Must use dimensional design
Aggregate Guidelines
Separate Tables
Separate fact table for every aggregate
Separate dimension table for every aggregate dimension
Same number of fact records as level field tables
Advantage
Removes possibility of double counting
Schema clarity
Month
month_key
Mthly Sales Year
One Way Facts Agg Fiscal Period
Month
Aggregate month_key
product_key
market_key
Quantity Market
Amount
market_key
Product Region District
State
product_key Sales Facts City
Category
Brand time_key
Product product_key Time
Diet Indicator market_key
Quantity time_key
Amount Year
Fiscal Period
Month
Day
Day of Week
Shipper
shipper_key
name
Shipment Facts type
time_key mode
product_key address
shipper_key Time
market_key
Product time_key
Quantity
Year
product_key Weight
Fiscal Period
Category Brand
Month
Product
Day
Diet Indicator Sales Facts Day of Week
time_key
Market
product_key
market_key
market_key
Region District
Quantity
State
Amount
City
Drilling down
Adding dimensional detail Quarterly Auto Sales Summary
Further breaks out a Region Units Sold Revenue
Southeast
Central
Northwest
Southwest
Northeast Maine
New York
Massachusetts
Southeast Florida
Georgia
Virginia
Northeast
State
Maine
Units Sold Revenue
New York
Massachusetts
Southeast Florida
Georgia
Virginia
Northeast
Southeast
Central
Northwest
Southwest
Definition
Dimensions are conformed when they are the same
-or-
When one dimension is a strict rollup of another
Rolled up dimension
When one dimension is a strict rollup of another
Which means
Two conformed dimensions can be combined into a single
logical dimension by creating a union of the attributes
Description
Shared common dimensions
Integrates logical design
Ensures consistency between data marts
Allows incremental development
Independent of physical location
Some re-work may be required
Advantages
Enables an incremental development approach
Easier and cheaper to maintain
Drastically reduces extraction and loading complexity
Answers business questions that cross data marts
Supports both centralized and distributed architectures
Creating a Model
Templates - To save time, you can also start working from a
template that you or others in your workgroup have created. When
you create a model from a template, all the objects and display
settings in the template are automatically applied to the new
model.
Subject Areas - For each new model, ERwin also automatically
creates a subject area (Main Subject Area). You can create
additional subject areas.
Stored Displays – Represent a different view of a subject area
without the need to change setting repeatedly.
Model Types – Logical, Physical , Logical/Physical or
Logical/Dimensional
Modeling Preferences - You can customize your working
environment using ERwin's many display options and model
preferences. You can also choose to create your model using
IDEF1X or IE notation.
Creating a Model
Reverse Engineering - Create a model by reverse
engineering an existing database.
ER1 - Standard ERwin file format. ERwin version 3.5.2 and later are
supported.
XML - ERwin metamodel saved as an Extensible Markup Language file.
When you open an ERwin model saved in XML format, ERwin reads the
data structure specified in the XML file and automatically reverse
engineers the database and creates a matching data model diagram.
ERS,SQL DDL (Data Definition Language) - schema script text file.
When you open a text file with this extension, ERwin reads the data
structure specified in the text file and automatically reverse engineers
the database and creates a matching data model.
DBF- A file name with this extension is a database file in dBASE
format. When you open a DBF file, ERwin automatically reverse
engineers the database and creates a matching data model.
MDB - A file name with this extension is a database file in Microsoft
Access format. When you open an *.mdb file, ERwin automatically
reverse engineers the database and creates a matching data model.
Data problems
– lack of resources, data hoarding, lack of data knowledge
System users
– not committed, not convinced, lack of time
Cost
114
© Mahindra Satyam 2009
CASE STUDY - 1
PURPOSE:-
The aim of the case study is to introduce you to the concepts
and principles involved in dimensional modeling design and
development.
The company has 3 manufacturing plant units (Pune, Lucknow and JSR)
The company has 5 cost centers. (cost centers categorized by product group)
(HCV,MCV,LCV, Tata Indica and Tata Safari)
Each plant has several store locations
The company has 5 product groups. Each product group has several models