Architecture of Three Tier Data Warehouse
Users
Users
Relational views
with OLAP
SQL query
OLAP
command
----------------------------------------------Top Tier Front-end Processing--OR
MOLAP
OR
HOLAP
OLAP implementation
Star Schema design
Data
storage
----Middle Tier OLAP Server---
Dimension
table 1
:
:
:
:
:
Source databases
2008/2/4
Dimension
table 2
Fact
table
Dimension
table n
Source
Database
1
ROLAP
Source
Database
2
-Bottom TierData Warehouse ServerData
extraction
Source
Database
m
Data Warehouse for Decision Support
A data base is a collection of data organized by a database
management system.
A data warehouse is a read-only analytical database used for
a decision support system operation.
A data warehouse for decision support is often taking data
from various platforms, databases, and files as source data.
The use of advanced tools and specialized technologies may
be necessary in the development of decision support
systems, which affects tasks, deliverables, training, and
project timelines.
2008/1/29
Data Warehouse for end users
A data warehouse is readily user-friendly by the
analyst for end users, even those who are not
familiar with database structure.
Data warehouse is a collection of integrated denormalized databases for fast response
performance.
In general, a data warehousing storage is for at
least 5 years long term capacity planning growth.
2008/1/29
Phases of the Decision Support Life Cycle
1. Planning
2. Gathering Data Requirements and Modeling
3. Physical Database Design and Development
4. Data Mapping and Transformation
5. Data Extraction and Load
6. Automating the Data Management Process
7. Application Development-Creating the starter sets
of reports
8. Data Validation and Testing
9. Training
10. Rollout
2008/1/29
Phase 1: Planning
Planning for a data warehouse is concerned with:
Defining the project scope
Creating the project plan
Defining the necessary resources, both internal and
external
Defining the tasks and deliverables
Defining timelines
Defining the final project deliverables
2008/1/29
Capacity Planning
Calculate the record size for each table
Estimate the number of initial records for
each table
Review the data warehouse access
requirements to predict index requirements
Determine the growth factor for each table
Identify the largest target table expected
over the selected period of time and add
approximately 25-30% overhead to the table
size to determine temporary storage size
2008/1/29
Phase 2: Gathering data requirements and Modeling
Gathering Data Requirements:
How the user does business?
How the users performance is measured?
What attributes does the user need?
What are the business hierarchies?
What data do users use now and what would they
like to have?
What levels of detail or summary do the users need?
2008/1/29
Data Modeling
A logical data model covering the scope of the
development project including relationships,
cardinality, attributes, and candidate keys.
or
A Dimensional Business Model that diagrams the
facts, dimensions, hierarchies, relationships and
candidate keys for the scope of the development
project
2008/1/29
Phase 3: Physical Database
Design and Development
Designing the database, including fact
tables, relationship tables, and description
(lookup) tables.
Denormalizing the data.
Identifying keys.
Creating indexing strategies.
Creating appropriate database objects.
2008/1/29
Phase 4: Data Mapping and
Transformation
Defining the source systems.
Determining file layouts.
Developing
written
transformation
specifications
for
sophisticated
transformations.
Mapping source to target data.
Reviewing capacity plans.
2008/1/29
10
Phase 5: Populating the data
warehouse
Developing procedures to extract and move the
data.
Developing procedures to load the data into the
warehouse.
Developing programs or use data transformation
tools to transform and integrate data.
Testing extract, transformation and load
procedures
2008/1/29
11
Phase 6: Automating Data
Management Procedures
Automating and scheduling the data load
process.
Creating backup and recovery procedures.
Conducting a full test of all of the
automated procedures.
2008/1/29
12
Phase 7: Application Development Creating the Starter Set of Reports
Creating the starter set of predefined
reports.
Developing core reports.
Testing reports.
Documenting applications.
Developing navigation paths.
2008/1/29
13
Phase 8: Data Validation and
Testing
Validating Data using the starter set of
reports.
Validating Data using standard processes.
Iteratively changing the data.
2008/1/29
14
Phase 9: Training
To gain real business value from your warehouse
development, users of all levels will need to be
trained in:
The scope of the data in the warehouse.
The front end access tool and how it works.
The DSS application or starter set of reports - the
capabilities and navigation paths.
Ongoing training/user assistance as the system
evolves
2008/1/29
15
Phase 10: Rollout
Installing the physical infrastructures for all users.
Developing the DSS application.
Creating procedures for adding new reports and
expanding the DSS application.
Setting up procedures to backup the DSS
application, not just the data warehouse.
Creating procedures for investigating and
resolving data integrity related issues.
2008/1/29
16
Star Schema Database Design
The goals of a decision support database are often
achieved by a database design called a star schema.
A star schema design is a simple structure with
relatively few tables and well-defined join paths.
This database design, in contrast to the normalized
structure used for transaction-processing databases,
provides fast query response time and a simple
schema that is readily understood by the analysts
and end users.
2008/1/29
17
Understanding Star Schema
Design - Facts and Dimensions
A star schema contains two types of tables, fact tables and
dimension tables. Fact tables contain the quantitative or
factual data about a business - the information being
queried. This information is often numerical measurements
and can consist of many columns and millions of rows.
Dimension tables are smaller and hold descriptive data that
reflect the dimensions of a business. SQL queries then use
predefined and user-defined join paths between fact and
dimension tables to return selected information.
2008/1/29
18
Identifying Facts and Dimensions
Look for the elemental transactions within the business
process. This identifies entities that are candidates to be
fact table.
Determine the key dimensions that apply to each fact. This
identifies entities that are candidates to be dimension
tables.
Check that a candidate fact is not actually a dimension
with embedded facts.
Check that a candidate dimension is not actually a fact
table
within the context of the decision support
2008/1/29
19
requirement.
Step 1 Look for the elemental transactions within the business process
The first step in the process of identifying
fact tables is where we examine the
business, and identify the transactions that
may be of interest. They will tend to be
transactions
that
describe
events
fundamentals to the business.
2008/1/29
20
Step 2 Determine the key dimension that apply to each fact
The next step is to identify the main
dimensions for each candidate fact table.
This can be achieved by looking at the
logical model, and finding out which entities
are associated with the entity representing
the fact table. The challenge here is to focus
on the key dimension entities.
2008/1/29
21
Step 3 Check that a candidate fact is not actually a
dimension table with denormalized facts
Look for denormalized dimensions within
candidate fact tables. It may be the case that the
candidate fact table is a dimension containing
repeating groups of factual attributes.
2008/1/29
22
Step 4 Check that a candidate dimension is not a fact table
If the business requirement is geared toward
analysis of the entity that is currently a
candidate dimension, chances are that it is
probably more appropriate to make it a fact
table.
2008/1/29
23
Simple Star Schemas
Each table must have a primary key, which is a
column or group of columns whose contents
uniquely identify each row. In a simple star schema,
the primary key for the fact table is composed of
one or more foreign keys. When a database is
created, the SQL statements used to create the
tables will designate the columns that are to form
the primary and foreign keys.
2008/1/29
24
A sales database with a simple star schema
Sales Table
(Fact Table)
Period Table
(dimension table)
Period_Id
Product_Id
Period_Id
Period_Desc
Quarter
Year
Product
Table
(dimension
Table )
Product_Id
Period_Id
Prod_Desc
Brand
Size
2008/1/29
Market_Id
Units
Dollars
Discount%
Market
Table
(dimension
Table)
Market_Id
Market_Desc
District
Region
25
Multiple Fact Tables
A star schema can contain multiple fact tables.
Multiple fact tables exist because they contain
unrelated facts or because periodicity of the load
times differs. In other cases, multiple fact tables
exist because they improve performance. Creating
different tables for different levels of aggregation is
a common design technique for a data warehouse
database so that any single request is against a table
of reasonable size.
2008/1/29
26
Sales Table
(Fact Table)
Period Table
(dimension table)
Period_Id
Product_Id
Period_Id
Period_Desc
Quarter
Year
Product
Table
(dimension
Table )
Product_Id
Prod_Desc
Brand
Size
Group table
Market_Id
Units
Dollars
Discount%
Product_Group
table(fact table)
Period_Id
Market
Table
(dimension
Table)
Market_Id
Market_Desc
District
Region
Group_Id
Group_Id
2008/1/29 Group_Desc
27
Outboard Tables
Dimension tables can also contain a foreign
key that references the primary key in
another dimension table. The referenced
dimension tables are sometimes referred to
as outboard, outrigger, or secondary
dimension tables.
2008/1/29
28
Sales Table
(Fact Table)
Period Table
(dimension table)
Period_Id
Product_Id
Period_Id
Period_Desc
Quarter
Year
Product
Table
(dimension
Table )
Product_Id
Prod_Desc
Brand
Size
Market_Id
Units
Dollars
Discount%
District table
District_Id
Market
Table
(dimension
Table)
Market_Id
Market_Desc
District
Region
District_Desc
Region table
Region_Id
2008/1/29
Region_Desc
29
Multi-Star Schema
In some applications the concatenated foreign keys
might not provide a unique identifier for each row
in the fact table. These applications require a multistar schema.
In a multi-star schema, the fact table has both a set of
foreign keys, which reference dimension tables, and
a primary key, which is composed of one or more
columns that provide a unique identifier for each
row.
2008/1/29
30
Retail sales database designed as a multi-star schema with
two secondary dimension tables
Transaction Table
Store Table
Store_Id
Store_Id
SKU Table
SKU_Id
Class Table
SKU_Id
Class_Id
Class_Desc
Dept_Id
Class_Id
Dept_Id
Item
Date
Store_Name
Region
Manager
Receipt_Nbr
Receipt_
Line_Item
Units
Price
Amount
Dept_Desc
2008/1/29
31
Snowflake Schema
Snowflake schema is a star schema which
stores all dimensional information in third
normal form, while keeping fact table
structures the same.
2008/1/29
32
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
2008/1/29
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
33
Data Warehouse architectures
Source
User
Source
Data
Transformation
&
Integration
Data
Warehouse
User
Source
User
2008/1/29
34
Case study of building a data warehouse
Step 1 Planning
2008/1/29
35
Capacity planning
Given time dimension:
2 years x 365 days
Product dimension:
average 5 product per transaction
Promotion dimension:
1 promotion type per transaction
Store dimension:
10 local country stores
Customer dimension:
1 customer per transaction
Number of sales transaction:
200 per day for major customers
As a result, the number of base fact records = 2 x 365 x 5 x 1 x 200 =
7.3 million records
Assume number of key field = 5, number of fact field = 7, which
implies total fields = 12
Thus, the base fact table size = 7.3 million x 12 x 4 bytes per field =
350 MB (the size of dimension tables are negligible).
2008/1/29
36
Step 2 Data Requirements and Modeling
Dimension
Time
Dimension
Deal
Dimension
Product
FACTS
Dimension
Store Sales
Distribution
Center
Dimension
Dimension
Dimension
Store
Promotion
Customer
Brand
Company
2008/1/29
Dimension
37
Step 3 Physical database design and development
Example: Design a Simple Star Schema from a relational schema
Identify measurable fields in a Fact table.
Identify selection criteria of the measurement as
keys in a Fact table.
Construct the dimension tables derived from the
keys in the Fact table.
Validate the Simple Star Schema as SR1 type
relation.
2008/1/29
38
Example
Given
Relation A (a1, a2, a3)
Relation B (b1, b2, b3)
Relation C (*a1, *b1, m1, m2)
Derived Simple Star Schema
FACT TABLE
DIMENSION TABLE A
a1
a2
a3
a1
b1
DIMENSION TABLE B
b1
b2
b3
m1
m2
2008/1/29
39
2008/1/29
40
Step 4 Map Corporate model into a data warehouse
Data Mapping and Transformation
2008/1/29
41
2008/1/29
42
2008/1/29
43
2008/1/29
44
2008/1/29
45
2008/1/29
46
2008/1/29
47
2008/1/29
48
Step 5 Data Extraction and Load
Technical infrastructures should be in place to assist with
these middle phases of data mapping, transformation,
extracting and loading including:
1.
2.
3.
4.
5.
6.
7.
Database administration expertise
Data transformation tool training / expertise
Update / refresh strategies
Load strategies
Operations /job scheduling
Quality assurance procedures
Capacity planning expertise
2008/1/29
49
Step 6 Automating Data Management Process
A data warehouse has very bimodal usage.
Most data warehouses are online 16 to 22
hours per day in a read-only mode. The data
warehouse goes off-line for 2 to 8 hours in
the wee hours of the morning for data
loading, data indexing, data quality
assurance, and data release.
2008/1/29
50
Step 7 Application Development-Creating starter set of reports
Reports for Executive Information Systems such as:
Is it worthwhile to stock so many individual sizes of certain
products?
Which items are cannibalized when I promote a particular
product like Absolute Vodka?
What are the top 10 items my competitors are selling that I
dont sell at all?
Which season sold the most Cognac last year?
Which product item is the most profitable in year 2001 in
Macau?
Which customer/Outlet buy the most in terms of cases sales in
year 2001?
2008/1/29
51
What is the total gross profit in April this year?
Reading assignment
Data Mining: Concepts and Techniques, by
Jiawei Han and Micheline Kamber, Morgan
Kaufmann Publishers, 2nd edition, 2007,
Chapter 3 Data Warehouse and OLAP
Technology, pp.105-134
2008/1/29
52
Lecture review question 4
Compare database with data warehouse in
performance, user friendliness, capacity
planning and data manipulation language
operations?
2008/1/29
53
Tutorial Question 4
You are to design a data warehouse to track the sales of salad dressing products in
supermarkets at weekly intervals over a four-year period and it is a typical
consumer-goods marketing database. The salad dressing product category contains
14000 items at the universal product code (UPC) level. Data are summarized for
each of 120 geographic areas (markets) in the United States, and are also
summarized for each of 208 weekly time periods spanning over four years. The
followings are the tables:
Product Table (Product_id, Prod_Desc, Brand, Manufacturer, Pack, Class, Flavor, Size)
Sales Table (*Period_id, *Product_id, *Market_id, Units, Dollars, Discount, Selling_Price,
Large_Ads, Medium_Ads, Small_Ads)
Period Table (Period_id, Period_Desc, Quarter, Fiscal_Year, Calendar_Year, Agg_Level)
Market Table (Market_id, Market_Desc, District, Region)
Show a simple star schema design for the application.
2008/1/29
54