Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)
Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)
Comp 150
Data Warehousing Introduction
(not in book)
Instructor: Dan Hebert
Data Warehousing/Mining 1
Outline of Lecture
Data Warehousing and Information
Integration
Brief History of Data Warehousing
What is a Data Warehouse?
Types of Data and Their Uses
Data Warehouse Architectures
Issues in Data Warehousing
Data Warehousing/Mining 2
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
Data Warehousing/Mining 3
Problem: Data Management in
Large Enterprises
Vertical fragmentation of informational systems
(vertical stove pipes)
Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...
Data Warehousing/Mining 4
Goal: Unified Access to Data
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
...
Wrapper Wrappe Wrapper
r
...
Source Source Source
Data Warehousing/Mining 6
Disadvantages of Query-Driven
Approach
Delay in query processing
– Slow or unavailable information sources
– Complex filtering and integration
Inefficient and potentially expensive for
frequent queries
Competes with local processing at sources
Hasn’t caught on in industry
Data Warehousing/Mining 7
The Warehousing Approach
Clients
Information
integrated in Data
advance Warehouse
Stored in wh for
direct querying Integration System Metadata
and analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor
...
Source Source Source
Data Warehousing/Mining 8
Advantages of Warehousing Approach
High query performance
– But not necessarily most current information
Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
Has caught on in industry
Data Warehousing/Mining 9
Not Either-Or Decision
Data Warehousing/Mining 10
Data Warehouse Evolution
Relational Company “Building the Data Replication
Databases DWs DW” Tools
Inmon (1992)
1960 1975 1980 1985 1990 1995 2000
Information-
TIME
“Prehistoric “Middle Data Based
Times” Ages” Revolution
Management
Data Warehousing/Mining 12
A Data Warehouse is...
Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
Optimized differently from transaction-
oriented db
User interface aimed at executive
Data Warehousing/Mining 13
A Data Warehouse is... (continued)
Data
Data Warehouse
Warehouse
Catalog
Data Warehouse
Population
Enterprise
Modeling
Operational Systems
Data Warehousing/Mining 15
Warehouse is a Specialized DB
Standard DB Warehouse
Mostly updates Mostly reads
Many small transactions Queries are long and complex
Mb - Gb of data Gb - Tb of data
Current snapshot History
Index/hash on p.k. Lots of scans
Raw data Summarized, reconciled data
Thousands of users (e.g., Hundreds of users (e.g.,
clerical users) decision-makers, analysts)
Data Warehousing/Mining 16
Warehousing and Industry
Data Warehousing/Mining 17
Types of Data
Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book
Data Warehousing/Mining 18
Data Warehouse Architectures:
Conceptual View Operational Informational
systems systems
Single-layer
– Every data element is stored once only “Real-time data”
– Virtual warehouse
Real-time data
Data Warehousing/Mining 19
Three-layer Architecture:
Conceptual View
Transformation of real-time data to derived
data really requires two steps
Operational Informational
systems systems
View level
“Particular informational
Derived Data
needs”
Reconciled Data
Physical Implementation
of the Data Warehouse
Real-time data
Data Warehousing/Mining 20
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
Both rich research areas
Industry has focused on (2)
Data Warehousing/Mining 21
Issues in Data Warehousing
Warehouse Design
Extraction
– Wrappers, monitors (change detectors)
Integration
– Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
Data Warehousing/Mining 22
Data Extraction
Source types
– Relational, flat file, WWW, etc.
How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”
Data Warehousing/Mining 23
Warehouse Architecture
Client Client
Query & Analysis
Warehouse
Integrator Metadata
Data Warehousing/Mining 24
Issues (1)
Data Warehousing/Mining 25
Issues (2)
Data Warehousing/Mining 26
Wrapper
Converts data and queries from one data model to
another
Data Queries Data
Model Model
A Data B
Data Warehousing/Mining 27
Wrapper Generation
Wrapper
Wrapper Definition
Generator
Data Warehousing/Mining 28
Wrapper Approach
Data Warehousing/Mining 29
Routine When...
Many tools for dealing with “standard situations”
– Standard sources with full/many capabilities
e.g., most commercial DBMSs, all ODBC-compliant sources
– Standard interactions
e.g., pass-through queries, extraction from rel. tables, replication
– Cooperative sources or sources under our control
Tools
– Replication tools, ODBC, report writers, third-party
“wrappers”
Data Warehousing/Mining 30
Not So Routine When...
“Non-standard situations”
– Unstructured or semistructured sources with little
or no explicit schema
– Uncooperative sources
– Sources with limited capabilities (e.g., legacy
sources, WWW)
Few commercial tools
Mostly research
Data Warehousing/Mining 31
Data Transformations
Data Warehousing/Mining 32
Monitors
Data Warehousing/Mining 33
Data Integration
Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
Rule-based
Actions
– Resolve inconsistencies
– Eliminate duplicates
– Integrate into warehouse (may not be empty)
– Summarize data
– Fetch more data from sources (wh updates)
– etc.
Data Warehousing/Mining 34
Data Cleansing
Data Warehousing/Mining 35