Unit 1-1
Unit 1-1
Data Warehousing
1
Points to discuss
3
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■A decision support database that is maintained
separately from the organization’s operational database
■ Support information processing by providing a solid
platform of consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses
4
Data Warehouse—Subject-
Oriented
5
Data Warehouse—Integrated
8
Database (OLTP) vs Data
Warehouse (OLAP)
9
OLTP (Database) vs. OLAP (Data
Warehouse)
10
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
12
Data Warehouse: A Multi-Tiered
Architecture
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Extract
Operational Query
DBs Transform
Data Server
Reports
Load Warehouse Data mining
Refresh
Data
Marts
Data Sources Data Storage OLAP Engine Front-End Tools
13
Building a Data Warehouse
14
Extraction, Transformation, and
Loading (ETL)
■ Data extraction
■get data from multiple, heterogeneous, and
external sources
■ Data cleaning
■detect errors in the data and rectify them when
possible
■ Data transformation
■convert data from legacy or host format to
warehouse format
■ Load
■sort, summarize, consolidate, compute views,
check integrity, and build indicies and partitions
■ Refresh
■propagate the updates from the data sources to
the warehouse 15
Three Data Warehouse Models
■ Enterprise warehouse
■collects all of the information about subjects
spanning the entire organization
■ Data Mart
■a subset of corporate-wide data that is of value to
a specific groups of users. Its scope is confined
to specific, selected groups, such as marketing
data mart
■Independent vs. dependent (directly from warehouse)
data mart
■ Virtual warehouse
■A set of views over operational databases
■Only some of the possible summary views may16
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn,
data mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation
path), currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error
reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data
warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data 17
Relational data base technology for data warehouse
■Random portioning
■Intelligent partitioning
39
Multidimensional Data
Office Day
Mont
h 40
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and
viewed in multiple dimensions
■ Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold)
and keys to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a
base cuboid. The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex cuboid.
The lattice of cuboids forms a data cube.
41
Cube: A Lattice of Cuboids
all
0-D (apex)
cuboid
tim ite locatio supplie
e m n r 1-D
cuboids
time,location item,location location,supplier
time,item 2-D
time,supplier item,supplier cuboids
time,location,supplie
r 3-D
time,item,location cuboids
time,item,supplie item,location,supplier
r
4-D (base)
time, item, location, supplier cuboid
42
Conceptual Modeling of Data
Warehouses
■ Modeling data warehouses: dimensions &
measures
■Star schema: A fact table in the middle
connected to a set of dimension tables
■Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
■Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
43
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch branch_key location
branch_key location_key
location_key street
branch_name
branch_type units_sold city
state_or_province
dollars_sold country
Measures avg_sales
44
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch branch_key location
location_key
branch_key location_key street
branch_name
city_key
branch_type units_sold city
dollars_sold city_key
city
Measures avg_sales state_or_province
country
45
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key
from_location
branch location
branch_key location_key to_location
location_key
branch_name street dollars_cost
branch_type units_sold
city
dollars_sold province_or_state units_shipped
country shipper
Measures avg_sales
shipper_key
shipper_name
location_key
shipper_type 46
A Concept Hierarchy:
Dimension (location)
al al
l l
47
Data Cube Measures: Three
Categories
TV r
r r r m
od
PC U.S.
Pr
VCR A
Country
su
Canad
m
a
Mexic
o
su
m
49
Cuboids Corresponding to the Cube
al
l 0-D (apex)
product countr cuboid
date
y 1-D
cuboids
product,dat product,countr date,
e y country 2-D
cuboids
3-D (base)
product, date, cuboid
country
50
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■by climbing up hierarchy or by dimension
reduction
■ Drill down (roll down): reverse of roll-up
■from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■reorient the cube, visualization, 3D to series of 2D
planes
■ Other operations
■drill across: involving (across) more than one fact
table 51
Data Warehouse Design
52
Design of Data Warehouse: A
Business Analysis Framework
■ Four views regarding the design of a data
warehouse
■Top-down view
■allows selection of the relevant information necessary
for the data warehouse
■Data source view
■exposes the information being captured, stored, and
managed by operational systems
■Data warehouse view
■consists of fact tables and dimension tables
■Business query view
■sees the perspectives of data in the warehouse from
the view of end-user
53
Data Warehouse Design
Process
■ Top-down, bottom-up approaches or a combination of
both
■ Top-down: Starts with overall design and planning
(mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step
before proceeding to the next
■ Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices,
etc.
■ Choose the grain (atomic level of data) of the business
process 54
Data Warehouse
Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
56
Data Warehouse Usage
■ Three kinds of data warehouse applications
■ Information processing
■supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
■ Analytical processing
■multidimensional analysis of data warehouse data
■supports basic OLAP operations, slice-dice, drilling,
pivoting
■ Data mining
■knowledge discovery from hidden patterns
■supports associations, constructing analytical models,
performing classification and prediction, and
presenting the mining results using visualization tools
57
Types of OLAP Servers
58
OLAP Server Architectures