0% found this document useful (0 votes)
30 views38 pages

Unit2 Datawarehouse

This document provides an overview of data warehousing and OLAP technology. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. The document discusses key characteristics of data warehouses including their multi-tiered architecture, use of ETL processes, metadata repositories and different types of OLAP servers. It also compares operational databases with data warehouses.

Uploaded by

oggy wilson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views38 pages

Unit2 Datawarehouse

This document provides an overview of data warehousing and OLAP technology. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. The document discusses key characteristics of data warehouses including their multi-tiered architecture, use of ETL processes, metadata repositories and different types of OLAP servers. It also compares operational databases with data warehouses.

Uploaded by

oggy wilson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

DWDM

Unit 2

— Chapter 3 —
Data Warehousing and OLAP Technology:
An Overview
(Han and Kamber : 2nd Edition)

1
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

2
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer,


product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

3
Data Warehouse—Integrated

 Constructed by integrating multiple, heterogeneous data


sources
 relational databases, flat files, on-line transaction

records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, attribute

measures, etc. among different data sources


 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted.

4
Data Warehouse—Time Variant

 The time horizon for the data warehouse is significantly


longer than that of operational systems

 Operational database: current value data


 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

 Every key structure in the data warehouse


 Contains an element of time

5
Data Warehouse—Nonvolatile

 A physically separate store of data transformed from the


operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data

6
Other names for Data warehouse
Write 7 differences between Operational databases
Systems and Data Warehouses

OLTP OLAP
OLTP vs OLAP

1. Users and System Orientation

2. Data Contents

3. Database Design

4. View

5. Access Patterns
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
process online transactional system online analysis and data retrieving
process

10
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

11
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


12
Extraction, Transformation, and Loading (ETL)
 Data extraction
 get data from multiple, heterogeneous, and external
sources
 Data cleaning
 detect errors in the data and rectify them when possible

 Data transformation
 convert data from legacy or host format to warehouse
format
 Load
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse
13
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:

 Description of the structure of the data warehouse


 Operational meta-data
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Business data

14
Three types of datawarehouse
(Architectural point of view)

 Enterprise warehouse:
Information about subjects spanning the entire organization.

 Data mart:
A subset of corporate-wide data

 Virtual warehouse:
Set of views over operational databases.

15
Three types of datawarehouse
(Architectural point of view)
 Enterprise warehouse: An enterprise warehouse collects all of the
information about subjects spanning the entire organization. It provides
corporate-wide data integration. Typically contains detailed data as well
as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. It requires extensive
business modeling and may take years to design and build.
 Data mart: A data mart contains a subset of corporate-wide data that
is of value to a specific group of users. The scope is confined to specific
selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts
tend to be summarized. Data marts are usually implemented on low-
cost departmental servers.
 Virtual warehouse: A virtual warehouse is a set of views over
operational databases. For efficient query processing, only some of the
possible summary views may be materialized. A virtual warehouse is
easy to build but requires excess capacity on operational database
servers.
16
Type Of OLAP Servers
Three types of OLAP
servers are:-
1 Relational OLAP
(ROLAP)

2 Multidimensional OLAP
(MOLAP)

3 Hybrid OLAP
(HOLAP)
OLAP Server Architectures

 Relational OLAP (ROLAP)


 Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake schemas
18
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
19
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

20
Conceptual Modeling of Data Warehouses

 Modeling data warehouses: dimensions & measures


 Star schema: A fact table in the middle connected to a
set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

21
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

22
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

23
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 24
Q1.An operational system is which of the following?

A. A system that is used to run the business in real time


and is based on historical data.
B. A system that is used to run the business in real time
and is based on current data.
C. A system that is used to support decision making and is
based on current data.
D. A system that is used to support decision making and is
based on historical data
Q2.A data warehouse is which of the following?
A. Can be updated by end users.
B. Contains numerous naming conventions and formats.
C. Organized around important subject areas.
D. Contains only current data
1.Data about data is called ___.
2___ and ___ are the key to emerging Business
Intelligence technologies.
3.Online Analytical Processing (OLAP) is a technology that
is used to create ___ software.
4.OLAP Supports ___ user access and multiple queries.
5. A data warehouse refers to a database that is
maintained separately from an organization’s operational
databases. (True/False)
6 A data warehouse is usually constructed by integrating
multiple heterogeneous sources. (True/False)
Star Schema- DMQL
 define cube <cube name>[<dimension list>]: <measure list>
 define dimension <dimension name> as (<attribute or dimension
list>)

define cube sales_star [time, item, branch, location]: dollars_sold =


sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

27
Snowflake Schema- DMQL
define cube sales_snowflake [time, item, branch, location]: dollars_sold
= sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type, supplier
(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city
(city_key, city, province_or_state, country))

28
Fact Constellation- DMQL
define cube shipping [time, item, shipper, from _location, to_location]:
dollars_cost = sum(cost__in_dollars), units_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from _location as location in cube sales
define dimension to_location as location in cube sales

29
Data Cube Measures: Three Categories

 Distributive: if the result derived by applying the function


to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
 E.g., avg()
 Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
 E.g., median(), mode()
30
Interpreting measures for data cubes

 Many measures of a data cube can be computed by relational


aggregation operations.
 Ex. Star Schema in DMQL will aggregate measures by following SQL
Query
select s.time_key, s.item_key, s.branch_key, s.location_key,
sum(s.number_of_units_sold * s.price), sum(s.number_of _units_sold)
from time t, item i, branch b, location l, sales s,
where s.time_key = t.time_key and s.item_key = i.item_key
and s.branch_key = b.branch_key and s.location_key = l.location_key
group by s.time_key, s.item_key, s.branch_key, s.location_key

 The cube created in the above query is the base cuboid of the
sales_star data cube.

31
A Concept Hierarchy: Dimension
(location)
Hierarchy and A lattice of time
Alternatively, the attributes of a dimension may be organized in a
partial order, forming a lattice.

33
Set-Grouping hierarchy
Concept hierarchies may also be defined by discretizing or grouping
values for a given dimension or attribute, resulting in a set-grouping
hierarchy.

a user may prefer to organize price by defining ranges for inexpensive,


moderately priced, and expensive.

34
OLAP Operations in DBMS

OLAP stands for Online Analytical Processing Server. It is a software technology


that allows users to analyze information from multiple database systems at the same
time. It is based on multidimensional data model and allows the user to query on
multi-dimensional data
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction

 Drill down (roll down): reverse of roll-up


 from higher level summary to lower level summary or
detailed data, or introducing new dimensions

 Slice and dice: project and select


 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes

36
Fig. 3.10 Typical OLAP
Operations

37
THANK YOU

You might also like