Data Warehousing
Data Warehousing
Failures
3 May 1999
Patricia L. Carbone
[email protected]
1 MITRE
Overview
O Transaction databases
O Data warehouses
O Data marts
O Comparison of the three
O Examples of data warehousing failures
O OLTP versus OLAP
O Examples of relational and multidimensional modeling
O Summary: Classic mistakes in data warehousing
2 MITRE
Transaction Databases
Data Warehouse Architecture
Transactional Databases
Data Database Operations Network Applications
Analyst Admin Manager Admin Developer
IT Users
Getting Data In
Operational
Users Columbus Ogden Slidell
ECPN ECPN ECPN CCR
Data
Accessing Capturing Extracting Filtering Scrubbing
Transformation
Reconciling Conditioning Condensing Householding Loading
3 MITRE
Transaction Database Definition
A transaction-oriented, • Focus on a business transaction, not the
users
• Stable design because business processes
more stable than user needs
non-integrated, • No data correlation across systems
• May have duplicates
• Quality assurance not fundamental
time-invariant, • A snapshot of a moment in time, rather than
a history of data over time
and volatile • Repetitions of the same query can give
different results
• Constantly updated as transactions occur
collection of data in support of • Focus on day-to-day operations, not long-
business operations. term planning
4 MITRE
Transaction Database (continued)
O Application Types: On-Line Transaction Processing
(OLTP)
O Examples
– Existing
O CCR
O ECI
– Potential
O ECI Transaction Archive
O Transaction Billing System
O Sample Usages
– Query for failed transaction
– Issue bill to government agency
5 MITRE
Data Warehouse Architecture
ECI
Customer Service Target Procurement Target
Data Mart or
Departmental
Warehouse
Getting Information Out
Knowledge
ΕωΙ+0
Neural Nets
Discovery/
Data Mining
Clustering Statistical Artificial Intelligence
Information
Access Tools
Multi- Data
dimensional Visualization EIS/DSS Spreadsheets Development
Business
Users
6 MITRE
Data Warehouse Definition
A subject-oriented, • Focus on a subject as defined by users
• Contains all data needed by the users to
understand the subject
• Users change requirements rapidly
integrated, • Data combined across systems and
transactions
• No duplicates
• Quality assured.
time-variant, • A history of the subject over time, not a
single moment in time
and non-volatile • Doesn’t change while a query is running
collection of data in support of • Focus on planning for the future, not on
management’s decision-making day-to-day operations
process.
7 MITRE
Data Warehouse (continued)
O Application Types
– On-Line Analytical Processing (OLAP) e.g. purchasing
analyzer
– Data Mining e.g. AT&T Worldnet’s problem predictor
O Examples
– Existing
O MCI Network Usage Data Warehouse
O Wal-Mart
O Citibank
– Potential
O DOD Procurement Data Warehouse
8 MITRE
Data Warehouse (continued)
O Sample Usage
– Alert for channel nearing capacity
– Identify causes of network problems
– Allocate budget underrun
– Alert for budget overrun
9 MITRE
Process Flow
11 MITRE
Data Mart (continued)
O Scope
– Very limited
– Frequently defined using time-box techniques. (Time boxing is a
project management techniques focusing on 3 key project dimensions
(time, resources and scope), only one of which can vary for a project.
For data marts, time and resources are invariant, while scope can
vary.)
O Examples
– Customer Service Data Mart
– Quotes Data Mart
O Application Types
– OLAP
– Data Mining
O Can be precursor or successor to data warehouse
12 MITRE
Transaction Database, Data Warehouse
& Data Mart
Transaction Data Data
Database Warehouse Mart
Objective Pull data in for Push data out to Push data out to
transaction decision makers decision makers
processing
Focus Transactions Subjects of interest Subjects of interest
to an enterprise to a department
Ownership Fiefdom Enterprise Enterprise
Consistency Microscopic Global (enterprise Global (department
(transaction level) level) level)
Users “Turn the wheels of “Watch the wheels “Watch the wheels
the organization” of the organization” of the organization”
e.g. sysops e.g. upper e.g. middle
management management
13 MITRE
Transaction Database, Data Warehouse
& Data Mart (continued)
14 MITRE
Definition of Warehouse “Failure” (or
“Disappointment”)
O Version 1
– Warehouse project begun
– Became apparent that project would take much more time than
originally planned
– Hardware was not able to handle the volume of data
– Software could not handle the data; vendor dropped support
for the software
– Upper management became disillusioned and halted the project
O Version 2
– Now focusing on subject area data mart
– Have plans to add additional subject areas until create the
enterprise-wide warehouse
16 MITRE
Example 1: Large Retailer (2 of 2)
17 MITRE
Example 2: Government Research
Laboratory (1 of 2)
O Description
– 15 laboratories each have finance department reporting to
national office
– All data stored via COBOL
– If reports differed from standard, would need IS support to
generate new report
O Solution 1
– Construct data warehouse oriented to finance department
– Assigned 2 people full time to build warehouse in 4 months
– In timeframe, passed summary data to warehouse - access via
PowerBuilder
– Simultaneously, mainframe system was drastically modified -
not in alignment with data warehouse project
– Data warehouse became end goal - modifications and extensions
after initial version were not allocated for
– No solution to original problems
18 MITRE
Example 2: Government Research
Laboratory (2 of 2)
O Solution 2
– Began 3 years after first attempt
– Project manager lined up funding to enable solving multiple
problems
– Access to data warehouse via web-based reports
O Observations
– Warehouse initiative should have been done with the
mainframe restructuring
– Planning and resourcing needed to be projected further into the
future
– A pilot might have identified a number of technical problems
– Reasonable deadlines
– “It could have been done right ... for the right reasons”
19 MITRE
Example 3: North American Federal
Government (1 of 2)
O Description
– Proposal put forth for data warehouse at a cost of $800,000
taking 8 months to build
– IT department assumed proposal was accepted, but did not wait
for concurrence from business unit (who was supposed to
provide $ and manpower)
– Actual time spent: 2 years
O Problems
– Business unit stretched the detailed data analysis from 1.5
months to 9 months
– Scope creep - planned users for system grew from 250 to 2500
– Acquiring correct technological tools took formal approval
process exceeding 1 year
– 3 weeks prior to delivery, IT director canceled the project
20 MITRE
Example 3: North American Federal
Government (1 of 2)
O Problems (concluded)
– 6 weeks after cancellation, new interest in populating the
warehouse was generated - nothing delivered
– Final cost - $2.5 million
O Observations
– Lack of focus of project - business unit could not identify scope
of project
– Milestones were pushed back, implying that project was not
urgent or important
– Negative internal politics - business leader did not allow project
analysts to talk to end users; business leader reassigned IT staff
without telling IT project lead
21 MITRE
Reasons for Data Warehousing Failures
22 MITRE
Some Basic Questions
O Why are you building your warehouse?
O Who will use the warehouse?
– The entire enterprise
– One particular department
O What is the goal of the warehouse?
– To provide a historical perspective of the aggregated data
O What kind of a data model do you expect to use?
– Relational
– Multidimensional
O What kind of analysis do you expect your users will
need?
– OLAP
– Data mining
23 MITRE
Techniques for Using Data
24 MITRE
Processing OLAP data (1 of 2)
O Relational database
– Not the obvious choice to perform complex multidimensional
calculations
– Complex multi-pass SQL is necessary to achieve more than the
most trivial functionality
– Tools can have limited range of calculations in SQL, with
results being used as input by a multidimensional engine on the
client or mid-tier server
O Multidimensional server engine
– Most obvious and popular place to perform calculations
– Good performance - engine and database can be optimized to
work together
– Plenty of memory on a server enables large scale array
calculations to be performed efficiently
25 MITRE
Processing OLAP data (2 of 2)
O Client
– Vendors aiming to take advantage of desktop PC
power to perform multidimensional calculations
– Popularity of thin clients is requiring that vendors
move most of the client-based processing to new Web
application servers
26 MITRE
Comparison to OLTP (Transaction
Processing)
27 MITRE
More OLAP vs OLTP
OLTP OLAP
O Real time, read/write to O As long as it takes, read-only
corporate data stores access to corporate data stores
O Many simultaneous internal O Small number of primarily
and external users internal users
O Short, repetitive, simple O Long, often unique, process
processing tasks intensive tasks
O Supports commerce and O Supports decision making and
monitoring discovery
O Integrity and guaranteed O Accuracy and completeness of
completion of tasks information and results
O Fixed, well defined processes O Ad hoc explorations as well as
with few if any exceptions fixed reports
28 MITRE
Relational versus 2-Dimensional: A Simple
Example
Relational Representation
Cargo Port Weight
Hogs Singapore 50 2-Dimensional Representation
Hogs New Orleans 60
Hogs Perth 100 Singapore New Perth
Cars Singapore 40 Orleans
Cars New Orleans 70 Hogs 50 60 100
Cars Perth 80 Cars 40 70 80
Oil Singapore 90 Oil 90 120 140
Oil New Orleans 120 Corn 20 10 30
Oil Perth 140
Corn Singapore 20 Query 1: How much oil is shipped
Corn New Orleans 10 from Singapore?
Corn Perth 30 Query 2: What is the total weight
29 shipped from Perth?
MITRE
Consolidation (or Pre-Aggregation):
Relational versus 2-Dimensional
Relational Representation
Cargo Port Weight
Hogs Singapore 50
Hogs New Orleans 60
Hogs Perth 100 2-Dimensional Representation
Hogs Total 210
Cars Singapore 40 Singapore New Perth Total
Cars New Orleans 70 Orleans
Cars Perth 80
Hogs 50 60 100 210
Cars Total 190
Oil Singapore 90 Cars 40 70 80 190
Oil New Orleans 120 Oil 90 120 140 350
Oil Perth 140 Corn 20 10 30 60
Oil Total 350 Total 200 260 350 810
Corn Singapore 20
Corn New Orleans 10
Corn Perth 30
Corn Total 60
Total Singapore 200
Total New Orleans 260
Total Perth 350
Total Total 810
30 MITRE
Moving to Multiple Dimensions with
Hierarchy
Region Total
31 MITRE
Multidimensional with Hierarchical (and
Drill Down)
Cargo
Dimension
Region
Dimension Indochina
Indonesia
Singapore
Thailand
…
North America
Canada
United States
New Orleans
New York
Oakland
Time Dimension
Can now query on cities, countries, or regions
32 MITRE
Standard SQL Approach
SELECT sum(Event.Weight)
FROM Event, Port, Cargo
WHERE Event.OriginPortID = Port.PortID and
Port. Name = “New Orleans” and
Event.CargoID = Cargo.CargoID and
Cargo.Name = “corn” and
Event.Date contains “April”
O Hard to formulate
– Who is going to write this query?
O Time consuming to compute
33 MITRE
The OLAP Data Cube Approach
“How many tons of corn left New Orleans in April?”
hogs “21 tons”
Cargo oil
Dimension cars
corn
Singapore
Port of
Origin
New Orleans 21
Murmansk
Dimension
Bremerhafen
Perth
Jan Feb Mar Apr ...........
Time34Dimension
MITRE
The Star Schema: A “Multidimensional”
View
Event Types
EType Date Month Year
Event “Fact” Table
Name Month Year
Descr DateID
VesselID Day Year
Category
EType Month
CatDescr
DateId Year
OriginID
DestID
Origin CargoID Vessel
Weight VesselID
OriginID Value Name
Country
Alias
Type Organization
Cargo Registration
OrgID
Destination Organization
SourceID Name
DestID Name Membership
Country Location
Type 35 MITRE
Summary
O Up-keep of technology
O Managing multiple users with various needs
O Lack of integration/integrating data marts into data
warehouses, after the fact
O Unclear business objectives; not knowing the
information requirements
O Lack of effective project sponsorship
O Lack of data quality
O Lack of user input
O Unrealistic expectations - cost
37 MITRE
Classic Mistakes in Data Warehousing (2 of
3)