0% found this document useful (0 votes)
6 views51 pages

CH3 Data Warehousing

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views51 pages

CH3 Data Warehousing

Uploaded by

Hunzila Nisar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

ARIN2137

KNOWLEDGE DISCOVERY AND DATA


MINING

TOPIC 3 :
Data Warehousing

1
Course Overview

• Introduction
• Data Warehousing

2
• Data Warehousing, OLAP
Introduction

and data mining:


what and why (now)?
• Relation to OLTP

3
A producer wants to know….

Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
Who
Whoare
aremy
mycustomers
customers
What and
andwhat
whatproducts
Whatisisthe
themost
most products
effective are
arethey
theybuying?
effectivedistribution
distribution buying?
channel?
channel?

What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest are mostlikely
most likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue 4
and
andmargins?
margins?
Data, Data everywhere
yet ... • I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences

 I can’t get the data I need


 need an expert to get the data
 I can’t understand the data I
found
 available data poorly documented

 I can’t use the data I found


 results are unexpected
 data needs to be transformed
from one form to other
5
What is a Data Warehouse?

A single, complete and


consistent store of data
obtained from a variety of
different sources made
available to end users in a what
they can understand and use in
a business context.
[Barry Devlin]

6
What are the users saying...
• Data should be integrated
across the enterprise
• Summary data has a real
value to the organization
• Historical data holds the
key to understanding data
over time
• What-if capabilities are
required

7
What is Data Warehousing?

A process of
Information transforming data into
information and making
it available to users in a
timely enough manner
to make a difference

[Forrester Research, April 1996]

Data
8
Evolution
• 60’s: Batch reports
– hard to find and analyze information
– inflexible and expensive, reprogram every new request

• 70’s: Terminal-based DSS and EIS (executive


information systems)
– still inflexible, not integrated with desktop tools

• 80’s: Desktop data access and analysis tools


– query tools, spreadsheets, GUIs
– easier to use, but only access operational databases

• 90’s: Data warehousing with integrated OLAP


engines and tools
9
Warehouses are Very Large Databases

35%

30%

25%
Respondents

20%

15%

10%
Initial
5% Projected 2Q96

Source: META Group, Inc.


0%
5GB 10-19GB 50-99GB 250-499GB
5-9GB 20-49GB 100-249GB 500GB-1TB
10
Very Large Data Bases

• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes

• Petabytes -- 10^15 bytes:


Geographic Information
Systems
• Exabytes -- 10^18 bytes: National Medical Records

• Zettabytes -- 10^21 bytes:


Weather images
• Zottabytes -- 10^24 bytes:
Intelligence Agency
Videos
Data Warehousing --
It is a process
• Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
• A decision support database
maintained separately from
the organization’s operational
database

12
Data Warehouse
• A data warehouse is a
– subject-oriented
– integrated
– time-varying
– non-volatile

collection of data that is used primarily in


organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
13
Explorers, Farmers and Tourists
Tourists: Browse information harvested
by farmers

Farmers: Harvest information


from known access paths

Explorers: Seek out the unknown and previously


unsuspected rewards hiding in the detailed data

14
Data Warehouse for Decision Support & OLAP

• Putting Information technology to help the


knowledge worker make faster and better
decisions
– Which of my customers are most likely to go to the
competition?
– What product promotions have the biggest impact on
revenue?
– How did the share price of software companies correlate
with profits over last 10 years?
16
Decision Support
• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can be ad-


hoc

• Used by managers and end-users to understand the


business and make judgements
17
Data Mining works with Warehouse Data

• Data Warehousing provides


the Enterprise with a
memory

 Data Mining provides


the Enterprise with
intelligence
18
We want to know ...
• Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
• Which types of transactions are likely to be fraudulent given
the demographics and transactional history of a particular
customer?
• If I raise the price of my product by Rs. 2, what is the effect
on my ROI?
• If I offer only 2,500 airline miles as an incentive to purchase
rather than 5,000, how many lost responses will result?
• If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
• Which of my customers are likely to be the most loyal?

Data Mining helps extract such information 19


Application Areas

Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providersValue added data
Utilities Power usage analysis

20
Why Separate Data Warehouse?
• Performance
– Op dbs designed & tuned for known txs & workloads.
– Complex OLAP queries would degrade perf. for op txs.
– Special data organization, access & implementation methods
needed for multidimensional views & queries.

 Function
 Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
 Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
 Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled. 21
What are Operational Systems?
• They are OLTP systems
• Run mission critical
applications
• Need to work with
stringent performance
requirements for
routine tasks
• Used to run a business!

22
RDBMS used for OLTP

• Database Systems have been used traditionally for OLTP


– clerical data processing tasks
– detailed, up to date data
– structured repetitive tasks
– read/update a few records
– isolation, recovery and integrity are critical

23
Operational Systems
• Run the business in real time
• Based on up-to-the-second data
• Optimized to handle large numbers
of simple read/write transactions
• Optimized for fast response to
predefined transactions
• Used by people who deal with
customers, products -- clerks,
salespeople etc.
• They are increasingly used by
customers
24
Examples of Operational Data
Data Industry Usage Technology Volumes

Customer All Track Legacy application, flat


Small-medium
File Customer files, main frames
Details
Account Finance Control Legacy applications, Large
Balance account hierarchical databases,
activities mainframe
Point-of- Retail Generate ERP, Client/Server, Very Large
Sale data bills, manage
relational databases
stock
Call Telecomm- Billing Legacy application, Very Large
Record unications hierarchical database,
mainframe
ProductionManufact- Control ERP, Medium
Record uring Production relational databases,
AS/400
25
So, what is the difference?
Application-Orientation vs. Subject-
Orientation
Application-Orientation Subject-Orientation

Operation Data
al Warehouse
Database
Credit
Loans Card Customer

Vendor
Product
Trust

Savings Activity 27
OLTP vs. Data Warehouse

• OLTP systems are tuned for known transactions


and workloads while workload is not known a priori
in a data warehouse
• Special data organization, access methods and
implementation methods are needed to support
data warehouse queries (typically
multidimensional queries)
– e.g., average amount spent on phone calls between 9AM-
5PM in Pune during the month of December

28
OLTP vs Data Warehouse
• OLTP • Warehouse (DSS)
– Application Oriented – Subject Oriented
– Used to run business – Used to analyze
– Detailed data business
– Current up to date – Summarized and
refined
– Isolated Data
– Snapshot data
– Repetitive access
– Integrated Data
– Clerical User
– Ad-hoc access
– Knowledge User
(Manager)
29
OLTP vs Data Warehouse

• OLTP • Data Warehouse


– Performance Sensitive – Performance relaxed
– Few Records accessed at a – Large volumes accessed
time (tens) at a time(millions)
– Mostly Read (Batch
– Read/Update Access Update)
– Redundancy present
– No data redundancy – Database Size 100
– Database Size 100MB - GB - few terabytes
100 GB

30
OLTP vs Data Warehouse

• Data
OLTP Warehouse
– Query
Transaction
throughput
throughput
is theisperformance
the performance
metric
metric
– Hundreds
Thousandsofofusers
users
– Managed byin entirety
subsets

31
To summarize ...
• OLTP Systems are
used to “run” a business

• The Data Warehouse


helps to “optimize”
the business

32
Wal*Mart Case Study
• Founded by Sam Walton
• One the largest Super Market Chains in the US

• Wal*Mart: 2000+ Retail Stores


• SAM's Clubs 100+Wholesalers Stores

• This case study is from Felipe Carino’s (NCR Teradata) presentation


made at Stanford Database Seminar

33
Old Retail Paradigm
• Wal*Mart
• Suppliers
– Inventory Management
– Accept Orders
– Merchandise Accounts
– Promote Products
Payable
– Provide special Incentives
– Purchasing
– Monitor and Track The Incentives
– Supplier Promotions:
– BillNational,
and Collect Receivables
Region, Store
Level Retailer Demands
– Estimate

34
New (Just-In-Time) Retail Paradigm
• No more deals
• Shelf-Pass Through (POS Application)
– One Unit Price
• Suppliers paid once a week on ACTUAL items sold
– Wal*Mart Manager
• Daily Inventory Restock
• Suppliers (sometimes SameDay) ship to Wal*Mart

• Warehouse-Pass Through
– Stock some Large Items
• Delivery may come from supplier
– Distribution Center
• Supplier’s merchandise unloaded directly onto Wal*Mart Trucks
35
Wal*Mart System
• NCR 5100M 96 24 TB Raw Disk; 700 - 1000 Pentium
CPUs
Nodes;
> 5 Billions
• Number of Rows:
• Historical Data: 65 weeks (5 Quarters)
• New Daily Volume: Current Apps: 75 Million
New Apps: 100 Million +
• Number of Users: Thousands

• Number of Queries: 60,000 per week


36
Data Warehouse Design
“From Tables and Spreadsheets to Data Cubes”

• Using multidimensional modeling which views


data in the form of a data cube (collection of
logically related attributes). Location

• To support complex query (eg: the sales


manager want to obtain sales report based
on location-time-product basis)  three
dimensional representation!
Date
Product

The data is represented as “cube” The 3-D data
(eg: cube representation for location, time cube 37
and product data)
3 - Dimensional Cubes

•Can be viewed through


“aggregation hierarchy”
E

LOCATION
•Eg : If chosen aggregation T
DA
hierarchy is “LOCATION”

CITY

STATE

COUNTRY

CONTINENT

PRODUCT 38
Representing Data in Cube

Type of Area Sales Unit


car
January February March April

Central 220 230 250 360

East 140 160 100 100


Civic
South 50 80 90 90
MyVI Viva

Type of car
220 230 250 360 Central
Area East
140 160 100 100
South
50 80 90 90

Jan FebMarch Apr


Month
Representing Data in Cube:: Try your self!
Product Location Time Units
Camera BWP July 1200
Camera BWP August 1500
Camera BWP Sept 2100

Location

BWP
1200 1500 2100 Product
(camera)

July August Sept


Time
Conceptual Modeling of Data Warehouses

• Method to visualize the structures of multidimensional


data warehouse design.
• Basic schemas : “Star” and “Snowflake” schemas
(Based on data complexity).

•Star Schema  fact tables


(major tables) stores the Snowflake Schema 
properties of main table extensions of “star schema”.
(which will be divided into Each properties is separated
another related table ~ into their own schema (if
known as “dimension @ expandable or contains related
minor tables”. aggregation).
41
Star Schema
PRODUCT
Product Code FACT @ DAY
MAJOR TABLE
Description Day Code
Type SALES Month
Product Code Year
Day Code
Location Code Dimension @
LOCATION Minor Table
Quantity
Location Code
Unit Price
Post Code
City

42
Snowflake Schema
PRODUCT
Product Code FACT @ DAY
MAJOR TABLE
Description Day Code
Type SALES Month
Product Code Year
LOCATION Day Code
Location Code Dimension @
Location Code
Minor Table
Post Code Quantity
Unit Price

POST CODE
State
“Expanded from 43
LOCATION minor table”
Data Warehouse
vs. Data Marts
What comes first
From the Data Warehouse to Data Marts
Information

Individually Less
Structured

History
Departmentally
Normalized
Structured
Detailed

Organizationally More
Structured Data Warehouse

Data 45
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured

Organizationally structured
Atomic
Detailed Data Warehouse Data

46
Characteristics of the Departmental Data Mart

• OLAP
• Small
• Flexible
• Customized by
Department
• Source is departmentally
structured data
warehouse

47
Techniques for Creating Departmental Data Mart

• OLAP

Sales Finance Mktg. • Subset

• Summarized

• Superset

• Indexed

• Arrayed

48
Data Mart Centric
Data Sources

Data Marts

Data Warehouse

49
Problems with Data Mart Centric Solution

If you end up creating multiple warehouses,


integrating them is a problem

50
True Warehouse
Data Sources

Data Warehouse

Data Marts

51

You might also like