Unit2 Datawarehouse
Unit2 Datawarehouse
Unit 2
— Chapter 3 —
Data Warehousing and OLAP Technology:
An Overview
(Han and Kamber : 2nd Edition)
1
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
2
Data Warehouse—Subject-Oriented
3
Data Warehouse—Integrated
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, attribute
4
Data Warehouse—Time Variant
5
Data Warehouse—Nonvolatile
6
Other names for Data warehouse
Write 7 differences between Operational databases
Systems and Data Warehouses
OLTP OLAP
OLTP vs OLAP
2. Data Contents
3. Database Design
4. View
5. Access Patterns
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
process online transactional system online analysis and data retrieving
process
10
Why a Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
11
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
13
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
14
Three types of datawarehouse
(Architectural point of view)
Enterprise warehouse:
Information about subjects spanning the entire organization.
Data mart:
A subset of corporate-wide data
Virtual warehouse:
Set of views over operational databases.
15
Three types of datawarehouse
(Architectural point of view)
Enterprise warehouse: An enterprise warehouse collects all of the
information about subjects spanning the entire organization. It provides
corporate-wide data integration. Typically contains detailed data as well
as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. It requires extensive
business modeling and may take years to design and build.
Data mart: A data mart contains a subset of corporate-wide data that
is of value to a specific group of users. The scope is confined to specific
selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts
tend to be summarized. Data marts are usually implemented on low-
cost departmental servers.
Virtual warehouse: A virtual warehouse is a set of views over
operational databases. For efficient query processing, only some of the
possible summary views may be materialized. A virtual warehouse is
easy to build but requires excess capacity on operational database
servers.
16
Type Of OLAP Servers
Three types of OLAP
servers are:-
1 Relational OLAP
(ROLAP)
2 Multidimensional OLAP
(MOLAP)
3 Hybrid OLAP
(HOLAP)
OLAP Server Architectures
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
20
Conceptual Modeling of Data Warehouses
21
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
22
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
23
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
27
Snowflake Schema- DMQL
define cube sales_snowflake [time, item, branch, location]: dollars_sold
= sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type, supplier
(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city
(city_key, city, province_or_state, country))
28
Fact Constellation- DMQL
define cube shipping [time, item, shipper, from _location, to_location]:
dollars_cost = sum(cost__in_dollars), units_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from _location as location in cube sales
define dimension to_location as location in cube sales
29
Data Cube Measures: Three Categories
The cube created in the above query is the base cuboid of the
sales_star data cube.
31
A Concept Hierarchy: Dimension
(location)
Hierarchy and A lattice of time
Alternatively, the attributes of a dimension may be organized in a
partial order, forming a lattice.
33
Set-Grouping hierarchy
Concept hierarchies may also be defined by discretizing or grouping
values for a given dimension or attribute, resulting in a set-grouping
hierarchy.
34
OLAP Operations in DBMS
36
Fig. 3.10 Typical OLAP
Operations
37
THANK YOU