0% found this document useful (0 votes)

30 views38 pages

Unit2 Datawarehouse

This document provides an overview of data warehousing and OLAP technology. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. The document discusses key characteristics of data warehouses including their multi-tiered architecture, use of ETL processes, metadata repositories and different types of OLAP servers. It also compares operational databases with data warehouses.

Uploaded by

oggy wilson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views38 pages

Unit2 Datawarehouse

Uploaded by

oggy wilson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

DWDM

Unit 2

— Chapter 3 —
Data Warehousing and OLAP Technology:
An Overview
(Han and Kamber : 2nd Edition)

1
What is a Data Warehouse?
 Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from
the organization’s operational database
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
 “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

2
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer,

product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
 Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process

3
Data Warehouse—Integrated

 Constructed by integrating multiple, heterogeneous data

sources
 relational databases, flat files, on-line transaction

records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, attribute

measures, etc. among different data sources

 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is
converted.

4
Data Warehouse—Time Variant

 The time horizon for the data warehouse is significantly

longer than that of operational systems

 Operational database: current value data

 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

 Every key structure in the data warehouse

 Contains an element of time

5
Data Warehouse—Nonvolatile

 A physically separate store of data transformed from the

operational environment
 Operational update of data does not occur in the data
warehouse environment
 Does not require transaction processing, recovery,
and concurrency control mechanisms
 Requires only two operations in data accessing:
 initial loading of data and access of data

6
Other names for Data warehouse
Write 7 differences between Operational databases
Systems and Data Warehouses

OLTP OLAP
OLTP vs OLAP

1. Users and System Orientation

2. Data Contents

3. Database Design

4. View

5. Access Patterns
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
process online transactional system online analysis and data retrieving
process

10
Why a Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

11
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

12
Extraction, Transformation, and Loading (ETL)
 Data extraction
 get data from multiple, heterogeneous, and external
sources
 Data cleaning
 detect errors in the data and rectify them when possible

 Data transformation
 convert data from legacy or host format to warehouse
format
 Load
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
 Refresh
 propagate the updates from the data sources to the
warehouse
13
Metadata Repository
 Meta data is the data defining warehouse objects. It stores:

 Description of the structure of the data warehouse

 Operational meta-data
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Business data

14
Three types of datawarehouse
(Architectural point of view)

 Enterprise warehouse:
Information about subjects spanning the entire organization.

 Data mart:
A subset of corporate-wide data

 Virtual warehouse:
Set of views over operational databases.

15
Three types of datawarehouse
(Architectural point of view)
 Enterprise warehouse: An enterprise warehouse collects all of the
information about subjects spanning the entire organization. It provides
corporate-wide data integration. Typically contains detailed data as well
as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. It requires extensive
business modeling and may take years to design and build.
 Data mart: A data mart contains a subset of corporate-wide data that
is of value to a specific group of users. The scope is confined to specific
selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts
tend to be summarized. Data marts are usually implemented on low-
cost departmental servers.
 Virtual warehouse: A virtual warehouse is a set of views over
operational databases. For efficient query processing, only some of the
possible summary views may be materialized. A virtual warehouse is
easy to build but requires excess capacity on operational database
servers.
16
Type Of OLAP Servers
Three types of OLAP
servers are:-
1 Relational OLAP
(ROLAP)

2 Multidimensional OLAP
(MOLAP)

3 Hybrid OLAP
(HOLAP)
OLAP Server Architectures

 Relational OLAP (ROLAP)

 Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake schemas
18
From Tables and Spreadsheets to
Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
19
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier

1-D cuboids

time,location item,location location,supplier

time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid

time, item, location, supplier

20
Conceptual Modeling of Data Warehouses

 Modeling data warehouses: dimensions & measures

 Star schema: A fact table in the middle connected to a
set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

21
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

22
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

23
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 24
Q1.An operational system is which of the following?

A. A system that is used to run the business in real time

and is based on historical data.
B. A system that is used to run the business in real time
and is based on current data.
C. A system that is used to support decision making and is
based on current data.
D. A system that is used to support decision making and is
based on historical data
Q2.A data warehouse is which of the following?
A. Can be updated by end users.
B. Contains numerous naming conventions and formats.
C. Organized around important subject areas.
D. Contains only current data
1.Data about data is called ___.
2___ and ___ are the key to emerging Business
Intelligence technologies.
3.Online Analytical Processing (OLAP) is a technology that
is used to create ___ software.
4.OLAP Supports ___ user access and multiple queries.
5. A data warehouse refers to a database that is
maintained separately from an organization’s operational
databases. (True/False)
6 A data warehouse is usually constructed by integrating
multiple heterogeneous sources. (True/False)
Star Schema- DMQL
 define cube <cube name>[<dimension list>]: <measure list>
 define dimension <dimension name> as (<attribute or dimension
list>)

define cube sales_star [time, item, branch, location]: dollars_sold =

sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

27
Snowflake Schema- DMQL
define cube sales_snowflake [time, item, branch, location]: dollars_sold
= sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type, supplier
(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city
(city_key, city, province_or_state, country))

28
Fact Constellation- DMQL
define cube shipping [time, item, shipper, from _location, to_location]:
dollars_cost = sum(cost__in_dollars), units_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)
define dimension from _location as location in cube sales
define dimension to_location as location in cube sales

29
Data Cube Measures: Three Categories

 Distributive: if the result derived by applying the function

to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
 E.g., avg()
 Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
 E.g., median(), mode()
30
Interpreting measures for data cubes

 Many measures of a data cube can be computed by relational

aggregation operations.
 Ex. Star Schema in DMQL will aggregate measures by following SQL
Query
select s.time_key, s.item_key, s.branch_key, s.location_key,
sum(s.number_of_units_sold * s.price), sum(s.number_of _units_sold)
from time t, item i, branch b, location l, sales s,
where s.time_key = t.time_key and s.item_key = i.item_key
and s.branch_key = b.branch_key and s.location_key = l.location_key
group by s.time_key, s.item_key, s.branch_key, s.location_key

 The cube created in the above query is the base cuboid of the
sales_star data cube.

31
A Concept Hierarchy: Dimension
(location)
Hierarchy and A lattice of time
Alternatively, the attributes of a dimension may be organized in a
partial order, forming a lattice.

33
Set-Grouping hierarchy
Concept hierarchies may also be defined by discretizing or grouping
values for a given dimension or attribute, resulting in a set-grouping
hierarchy.

a user may prefer to organize price by defining ranges for inexpensive,

moderately priced, and expensive.

34
OLAP Operations in DBMS

OLAP stands for Online Analytical Processing Server. It is a software technology

that allows users to analyze information from multiple database systems at the same
time. It is based on multidimensional data model and allows the user to query on
multi-dimensional data
Typical OLAP Operations
 Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction

 Drill down (roll down): reverse of roll-up

 from higher level summary to lower level summary or
detailed data, or introducing new dimensions

 Slice and dice: project and select

 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes

36
Fig. 3.10 Typical OLAP
Operations

37
THANK YOU

Application of Data Science To Reduce Employee Attrition: Author: Clara Cabañas Pujadas Tutor: Julio Villena Román
No ratings yet
Application of Data Science To Reduce Employee Attrition: Author: Clara Cabañas Pujadas Tutor: Julio Villena Román
81 pages
BMNG7321 - Lu4 Notes
No ratings yet
BMNG7321 - Lu4 Notes
11 pages
Fast Data Use Cases For Telco
100% (1)
Fast Data Use Cases For Telco
29 pages
Unit 5
No ratings yet
Unit 5
17 pages
3 - Data Warehousing and Business Intelligence
No ratings yet
3 - Data Warehousing and Business Intelligence
58 pages
DMDW Assignment-1-2021-22
No ratings yet
DMDW Assignment-1-2021-22
2 pages
Data Warehousing
No ratings yet
Data Warehousing
61 pages
Unit 2 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Data Mining - WWW - Rgpvnotes.in
14 pages
Lecture - Module 7 - Notes
No ratings yet
Lecture - Module 7 - Notes
2 pages
DZ Data Pipeline Essentials 2024
No ratings yet
DZ Data Pipeline Essentials 2024
6 pages
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Reports and Dashboards
0% (1)
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Reports and Dashboards
51 pages
Spatial Online Analytical Processing of Geographic Data Through The Google Earth Interface
No ratings yet
Spatial Online Analytical Processing of Geographic Data Through The Google Earth Interface
2 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
58 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
24 pages
Enterprise Data Warehouse (EDW) Full Guide
No ratings yet
Enterprise Data Warehouse (EDW) Full Guide
20 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
TMZ Apollo Tyres Case Study
No ratings yet
TMZ Apollo Tyres Case Study
8 pages
Mscit
No ratings yet
Mscit
55 pages
What Is A Data Warehouse?: Data Warehouse Architecture From Data Warehousing To Data Mining
No ratings yet
What Is A Data Warehouse?: Data Warehouse Architecture From Data Warehousing To Data Mining
27 pages
Send18 Whiteboard: o o o o o
No ratings yet
Send18 Whiteboard: o o o o o
74 pages
Final Exam Name
No ratings yet
Final Exam Name
13 pages
Hyperion System 9 BI+ Analytic Services Courses
No ratings yet
Hyperion System 9 BI+ Analytic Services Courses
2 pages
CCS341-DATA WAREHOUSING - 1805692571-Ccs341-Question-Bank
No ratings yet
CCS341-DATA WAREHOUSING - 1805692571-Ccs341-Question-Bank
10 pages
DM104 - Evaluation of Business Performance
No ratings yet
DM104 - Evaluation of Business Performance
15 pages
Business Intelligence With Sap
No ratings yet
Business Intelligence With Sap
2 pages
Mce 401 Computer Integrated Manufacturing Systems: Secti On-A
No ratings yet
Mce 401 Computer Integrated Manufacturing Systems: Secti On-A
24 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
SAP BW Modelling, Extraction and Reporting
No ratings yet
SAP BW Modelling, Extraction and Reporting
49 pages
Etl Testing Material
100% (2)
Etl Testing Material
17 pages
Online Discussion Project Report
0% (1)
Online Discussion Project Report
75 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Unit2 Datawarehouse

Uploaded by

Unit2 Datawarehouse

Uploaded by

DWDM

 Organized around major subjects, such as customer,

 Constructed by integrating multiple, heterogeneous data

measures, etc. among different data sources

 The time horizon for the data warehouse is significantly

 Operational database: current value data

 Every key structure in the data warehouse

 A physically separate store of data transformed from the

1. Users and System Orientation

Data Sources Data Storage OLAP Engine Front-End Tools

 Description of the structure of the data warehouse

 Relational OLAP (ROLAP)

time item location supplier

time,location item,location location,supplier

4-D (base) cuboid

 Modeling data warehouses: dimensions & measures

branch location_key location to_location

A. A system that is used to run the business in real time

define cube sales_star [time, item, branch, location]: dollars_sold =

 Distributive: if the result derived by applying the function

 Many measures of a data cube can be computed by relational

a user may prefer to organize price by defining ranges for inexpensive,

OLAP stands for Online Analytical Processing Server. It is a software technology

 Drill down (roll down): reverse of roll-up

 Slice and dice: project and select

You might also like