Lecture # 1-2-Intro
Lecture # 1-2-Intro
Book:
Building the Data Warehouse
W. H. Inmon
Fourth Edition
John Wiley & Sons.
2005.
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
USER INTERFACE:
how user enters problem DSS SOFTWARE
& receives answers
SYSTEM
USER
MODELS
DSS DATABASE: INTERFACE
current data from OLAP TOOLS
applications or groups
DATA MININGTOOLS
DATA MINING:
technology for finding
relationships in large data USER
bases for prediction
6
Why we uses DSS?
Increasing complexity of decisions
– Technology
– Information:
“Data, data everywhere, and not the time to think!”
– Number and complexity of options
– Pace of change
Increasing availability of computerized support
– Inexpensive high-powered computing
– Better software
– More efficient software development process
Increasing usability of computers
7
Operational Database
Operational database management systems are used to
manage dynamic data in real-time.
10
Data warehouse Introduction
Subject
“Data Warehouse is a Oriented
subject oriented,
integrated, time-
variant and non-
Non- Data
volatile collection of volatile Warehouse
Integrated
data in support of
management’s decision
making process.” – W.
H. Inmon Time
Variant
Data warehouse Usage
Three kinds of data warehouse applications
– Information processing
supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
– Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools.
Differences among the three tasks
12
Data warehouse: Subject Oriented
Organized around major subjects, such as customer, product,
sales.
13
Data warehouse: Subject Oriented
Data
Operational
Warehouse
14
Data warehouse: Integrated
Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
15
Data warehouse: Time Varying
The time horizon for the data warehouse is significantly longer
than that of operational systems.
– Operational database: current value data.
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”.
16
Data warehouse: Time Varying
Data
Operational
Warehouse
17
Data warehouse: Non-Volatile
A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data
warehouse environment.
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
initial loading of data and access of data.
18
Data warehouse: Non-Volatile
insert change
Operational Data
Warehouse
insert
delete
load
read only
access
replace
change
19
Data, Data everywhere yet ...
• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented
6
Difference between Database and data warehouse
FEATURES DATABASE DATAWAREHOUSE
Characteristic It is based on Operational Processing. It is based on Informational Processing.
Data It mainly stores the Current data which It usually stores the Historical data whose
always guaranteed to be up-to-date. accuracy is maintained over time.
User The common users are clerk, DBA, The common users are knowledge worker
database professional. (e.g., manager, executive, analyst)
Unit of work Its work consists of short and simple The operations on it consists of complex
transaction. queries..
Summarization The data is primitive and highly The data is summarized and in consolidated
detailed. form.
View The view of the data is flat relational. The view of the data is multidimensional.
22
Difference between Database and data warehouse
FEATURES DATABASE DATAWAREHOUSE
Function It is used for day-to-day operations. It is used for long-term informational
requirements and decision support.
User The common users are clerk, DBA, The common users are knowledge worker
database professional. (e.g., manager, executive, analyst)
Access The most frequent type of access type is It mostly use the read access for the
read/write. stored data.
Operations The main operation is index/hash on For any operation it needs a lot of scans.
primary key.
Number of A few tens of records. A bunch of millions of records.
records accessed
Metadata Warehouse
Integration
Source Source
Source
Why a Warehouse?
Two Approaches:
Query-Driven (Lazy)
Warehouse (Eager)
Source Source
The Traditional Research Approach
Query-driven (lazy, on-demand)
Clients
...
...
Source Source Source
Disadvantages of Query-Driven
Approach
Delay in query processing
Slow or unavailable information sources
Complex filtering and integration
Inefficient and potentially expensive for frequent
queries
Competes with local processing at sources
Hasn’t caught on in industry
The Warehousing Approach
Information Clients
integrated in
advance Data
Warehouse
Stored in wh
for direct
Integration System Metadata
querying and
analysis ...
...
Source Source Source
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesn’t interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
Not Either-Or Decision
Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large numbers of
sources
Clients with unpredictable needs
Data Warehouse? A Practitioners Viewpoint
“A data warehouse is simply a single, complete, and
consistent store of data obtained from a variety of
sources and made available to end users in a way they
can understand and use it in a business context.”
-- Barry Devlin, IBM Consultant
Data Warehouse Architectures: Conceptual View
Operational Informational
Two-layer
Real-time + derived data Operational Informational
systems systems
Most commonly used approach in
industry today
Derived Data
Real-time data
Three-layer Architecture: Conceptual View
Transformation of real-time data to derived data
really requires two steps
Operational Informational
systems systems
View level
“Particular informational
Derived Data
needs”
Physical Implementation
Reconciled Data
of the Data Warehouse
Real-time data
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
Both rich research areas
Industry has focused on (2)
Issues in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites
Middle tier
Bottom tier
Data
warehouse
server
Backend tools
fig:- A three tier data warehousing
1)Bottom tier:-The bottom tier is a warehouse database
server that is always a relational database system.
Back-end tools and utilities are used to feed data into the
bottom tier from operational databases or other external
sources. These tools and utilities perform data
extraction,cleaning and transformation as well as load and
refresh functions to update the data warehouse.
The date extracted using application program
interfaces known as gateways.
Example of gateways are ODBC(open database
connection)and OLEDB(Open Linking and embedding for
database) by microsoft and jdbc(java database
connecton).
This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2.)Middle tier:- The middle tier is an OLAP server
that is typically implemented using either:-
Note:-
OLAP – Online Analytical Processing:
This is the major task of Data Warehousing
System.
Useful for complex data analysis and
decision making.
Market oriented –used by
managers,executives and data analyst.
Needs for Data Warehousing
Better business intelligence for end-users