0% found this document useful (0 votes)
63 views35 pages

Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)

This document provides an overview of data warehousing and mining. It discusses the goals of data warehousing, which include providing unified access to heterogeneous data sources and storing integrated data in advance to enable direct querying and analysis. The document contrasts the traditional query-driven approach with the warehousing approach, noting advantages of the latter such as high query performance and the ability to modify and annotate data. It defines key characteristics of a data warehouse and describes different types of data and common data warehouse architectures.

Uploaded by

AbhinavVerma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views35 pages

Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)

This document provides an overview of data warehousing and mining. It discusses the goals of data warehousing, which include providing unified access to heterogeneous data sources and storing integrated data in advance to enable direct querying and analysis. The document contrasts the traditional query-driven approach with the warehousing approach, noting advantages of the latter such as high query performance and the ability to modify and annotate data. It defines key characteristics of a data warehouse and describes different types of data and common data warehouse architectures.

Uploaded by

AbhinavVerma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Warehousing/Mining

Comp 150
Data Warehousing Introduction
(not in book)
Instructor: Dan Hebert

Data Warehousing/Mining 1
Outline of Lecture
 Data Warehousing and Information
Integration
 Brief History of Data Warehousing
 What is a Data Warehouse?
 Types of Data and Their Uses
 Data Warehouse Architectures
 Issues in Data Warehousing

Data Warehousing/Mining 2
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases

World
Scientific Databases
Wide
Web
Digital Libraries
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
Data Warehousing/Mining 3
Problem: Data Management in
Large Enterprises
 Vertical fragmentation of informational systems
(vertical stove pipes)
 Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...

Data Warehousing/Mining 4
Goal: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

• Collects and combines information


• Provides integrated view, uniform user interface
• Supports sharing
Data Warehousing/Mining 5
The Traditional Research Approach
 Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...
Wrapper Wrappe Wrapper
r

...
Source Source Source

Data Warehousing/Mining 6
Disadvantages of Query-Driven
Approach
 Delay in query processing
– Slow or unavailable information sources
– Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry

Data Warehousing/Mining 7
The Warehousing Approach
Clients

 Information
integrated in Data
advance Warehouse

 Stored in wh for
direct querying Integration System Metadata

and analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Data Warehousing/Mining 8
Advantages of Warehousing Approach
 High query performance
– But not necessarily most current information
 Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
 Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
 Has caught on in industry

Data Warehousing/Mining 9
Not Either-Or Decision

 Query-driven approach still better for


– Rapidly changing information
– Rapidly changing information sources
– Truly vast amounts of data from large numbers of
sources
– Clients with unpredictable needs

Data Warehousing/Mining 10
Data Warehouse Evolution
Relational Company “Building the Data Replication
Databases DWs DW” Tools
Inmon (1992)
1960 1975 1980 1985 1990 1995 2000

Information-

TIME
“Prehistoric “Middle Data Based
Times” Ages” Revolution
Management

PC’s and End-user 1st DW DW Vendor DW


Spreadsheets Interfaces Article Confs. Frameworks
Data Warehousing/Mining 11
What is a Data Warehouse?
A Practitioners Viewpoint

“A data warehouse is simply a single, complete,


and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it
in a business context.”
-- Barry Devlin, IBM Consultant

Data Warehousing/Mining 12
A Data Warehouse is...
 Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
 Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
 Optimized differently from transaction-
oriented db
 User interface aimed at executive

Data Warehousing/Mining 13
A Data Warehouse is... (continued)

 Large volume of data (Gb, Tb)


 Non-volatile
– Historical
– Time attributes are important
 Updates infrequent
 May be append-only
 Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios
Data Warehousing/Mining 14
Summary
Business Business Information
Information Guide Interface

Data
Data Warehouse
Warehouse
Catalog
Data Warehouse
Population

Enterprise
Modeling
Operational Systems

Data Warehousing/Mining 15
Warehouse is a Specialized DB

Standard DB Warehouse
 Mostly updates  Mostly reads
 Many small transactions  Queries are long and complex
 Mb - Gb of data  Gb - Tb of data
 Current snapshot  History
 Index/hash on p.k.  Lots of scans
 Raw data  Summarized, reconciled data
 Thousands of users (e.g.,  Hundreds of users (e.g.,
clerical users) decision-makers, analysts)

Data Warehousing/Mining 16
Warehousing and Industry

 Warehousing is big business


– $2 billion in 1995
– $3.5 billion in early 1997
– Predicted: $8 billion in 1998 [Metagroup]
 WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7TB in warehouse
– 40-50GB per day

Data Warehousing/Mining 17
Types of Data
 Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
 Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
 Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book
Data Warehousing/Mining 18
Data Warehouse Architectures:
Conceptual View Operational Informational
systems systems

 Single-layer
– Every data element is stored once only “Real-time data”

– Virtual warehouse

 Two-layer Operational Informational

– Real-time + derived data systems systems

– Most commonly used approach in


industry today Derived Data

Real-time data

Data Warehousing/Mining 19
Three-layer Architecture:
Conceptual View
 Transformation of real-time data to derived
data really requires two steps
Operational Informational
systems systems

View level
“Particular informational
Derived Data
needs”

Reconciled Data
Physical Implementation
of the Data Warehouse

Real-time data

Data Warehousing/Mining 20
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
 Both rich research areas
 Industry has focused on (2)

Data Warehousing/Mining 21
Issues in Data Warehousing

 Warehouse Design
 Extraction
– Wrappers, monitors (change detectors)
 Integration
– Cleansing & merging
 Warehousing specification & Maintenance
 Optimizations
 Miscellaneous (e.g., evolution)

Data Warehousing/Mining 22
Data Extraction

 Source types
– Relational, flat file, WWW, etc.
 How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”

Data Warehousing/Mining 23
Warehouse Architecture
Client Client
Query & Analysis

Warehouse

Integrator Metadata

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

Source Source ... Source

Data Warehousing/Mining 24
Issues (1)

 Warehouse uses relational data model or multi-


dimensional data model (e.g., data cube)
 On the other hand, source types
– Relational, OO, hierarchical, legacy
– Semistructured: flat file, WWW
 How do we get the data out?

Data Warehousing/Mining 25
Issues (2)

 Warehouse must be kept current in light of


changes to underlying sources
 How do we detect updates in sources?

Data Warehousing/Mining 26
Wrapper
Converts data and queries from one data model to
another
Data Queries Data
Model Model
A Data B

Extends query capabilities for sources with


limited capabilities

Queries Wrapper Source

Data Warehousing/Mining 27
Wrapper Generation

 Solution 1: Hard code for each source


 Solution 2: Automatic wrapper generation

Wrapper
Wrapper Definition
Generator

Data Warehousing/Mining 28
Wrapper Approach

 Source-specific adapter (a.k.a. wrapper,


translator)
 “Thickness” of adapter depends on source
– Data model used (e.g. rel. schema vs.
unstructured)
– Interface (i.e., query language, API)
– Active capabilities (i.e., triggers)
– Degree of autonomy (e.g., same owner &
modifiable vs. controlled by external entity & no
changes possible)
– Cooperation (e.g., friendly vs. uncooperative)

Data Warehousing/Mining 29
Routine When...
 Many tools for dealing with “standard situations”
– Standard sources with full/many capabilities
 e.g., most commercial DBMSs, all ODBC-compliant sources
– Standard interactions
 e.g., pass-through queries, extraction from rel. tables, replication
– Cooperative sources or sources under our control
 Tools
– Replication tools, ODBC, report writers, third-party
“wrappers”

Data Warehousing/Mining 30
Not So Routine When...
 “Non-standard situations”
– Unstructured or semistructured sources with little
or no explicit schema
– Uncooperative sources
– Sources with limited capabilities (e.g., legacy
sources, WWW)
 Few commercial tools
 Mostly research

Data Warehousing/Mining 31
Data Transformations

 Convert data to uniform format


– Byte ordering, string termination
– Internal layout
 Remove, add & reorder attributes
– Add key
– Add data to get history
 Sort tuples

Data Warehousing/Mining 32
Monitors

 Goal: Detect changes of interest and


propagate to integrator
 How?
– Triggers
– Replication server
– Log sniffer
– Compare query results
– Compare snapshots/dumps

Data Warehousing/Mining 33
Data Integration
 Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
 Rule-based
 Actions
– Resolve inconsistencies
– Eliminate duplicates
– Integrate into warehouse (may not be empty)
– Summarize data
– Fetch more data from sources (wh updates)
– etc.

Data Warehousing/Mining 34
Data Cleansing

 Find (& remove) duplicate tuples


– e.g., Jane Doe vs. Jane Q. Doe
 Detect inconsistent, wrong data
– Attribute values that don’t match
 Patch missing, unreadable data
 Notify sources of errors found

Data Warehousing/Mining 35

You might also like