Lecture 4
Lecture 4
Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from
the organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses
2
Data Warehouse—Subject-Oriented ASET
3
Data Warehouse—Integrated ASET
4
Data Warehouse—Time Variant ASET
5
Data Warehouse—Nonvolatile ASET
6
Why a Separate Data ASET
• Warehouse?
High performance for both systems
– DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
– Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
• Different functions and different data:
– missing data: Decision support requires historical data which
operational DBs do not typically maintain
– data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
– data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
• Note: There are more and more systems which perform OLAP
analysis directly on relational databases
7
Data Warehouse: A Multi-Tiered Architecture ASET
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
• Models
Enterprise warehouse
– collects all of the information about subjects spanning
the entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
9
Extraction, Transformation, and Loading (ETL) ASET
• Data extraction
– get data from multiple, heterogeneous, and external
sources
• Data cleaning
– detect errors in the data and rectify them when possible
• Data transformation
– convert data from legacy or host format to warehouse
format
• Load
– sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
• Refresh
– propagate the updates from the data sources to the
warehouse
10
Metadata Repository ASET