Data Warehouse
Data Warehouse
Data Warehouse
刘莹,博士,教授
中国科学院大学计算机科学与技术学院
中国科学院大学数据挖掘与高性能计算实验室
Knowledge Discovery (KDD) Process
▪ Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Selection and
Transformation
Data Warehouse
Data Cleaning
and Integration
2022-03-02 3
What is Data Warehouse?
◼ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.” — W. H.
Inmon
◼ Defined in many different ways, but not rigorously
▪ A decision support database that is maintained separately from
the organization’s operational database
▪ Support information processing by providing a solid platform of
consolidated, historical data for analysis
2022-03-02 4
Data Warehouse
◼ 数据仓库将分布在企业网络中不同信息岛上的业务数据
集成到一起,存储在一个单一的集成关系型数据库中,
利用这样的集成信息,可方便用户对信息访问,可使决
策人员对一段时间内的历史数据进行分析,研究事务的
发展走势—Informix 公司
◼ 数据仓库是一种管理技术,旨在通过通畅、合理、全面
的信息管理,达到有效的决策支持—SAS软件研究所
◼ 数据仓库是集成信息的存储中心,这些信息可用于查询
或分析—Stanford University
2022-03-02 5
Example
◼ Customer relationship management
2022-03-02 7
Data Warehouse—Subject-Oriented
◼ Organized around major subjects, such as customer,
product, sales
◼ Focus on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
◼ Provide a simple and concise view around particular
subject issues by excluding data that are not useful
in the decision support process
2022-03-02 8
Data Warehouse—Integrated
◼ Constructed by integrating multiple, heterogeneous
data sources
▪ relational databases, flat files, on-line transaction records
◼ Data cleaning and data integration techniques are
applied
▪ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
▪ When data is moved to the warehouse, it is converted
2022-03-02 9
Data Warehouse—Time Variant
◼ The time horizon for the data warehouse is
significantly longer than that of operational systems
▪ Operational database: current value data
▪ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
◼ Every key structure in the data warehouse
▪ Contains an element of time, explicitly or implicitly
▪ But the key of operational data may or may not contain
“time element”
2022-03-02 10
Data Warehouse—Nonvolatile
◼ A physically separate store of data transformed
from the operational environment
◼ Operational update of data does not occur in the
data warehouse environment
▪ Does not require transaction processing, recovery,
and concurrency control mechanisms
▪ Requires only two operations in data accessing:
• initial loading of data and access of data
2022-03-02 11
Data Warehouse vs. Operational DBMS
◼ OLTP (on-line transaction processing)
▪ Major task of traditional relational DBMS
▪ Day-to-day operations: e.g. purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
◼ OLAP (on-line analytical processing)
▪ Major task of data warehouse system
▪ Data analysis and decision making
◼ Distinct features (OLTP vs. OLAP):
▪ User and system orientation: customer vs. market
▪ Data contents: current, detailed vs. historical, consolidated
▪ View: current, local vs. evolutionary, integrated
▪ Access patterns: update vs. read-only but complex queries
2022-03-02 12
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
2022-03-02 13
Data Warehouse
2022-03-02 14
From Tables and Spreadsheets to Data Cubes
2022-03-02 16
Conceptual Modeling of Data Warehouses
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
2022-03-02 19
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
2022-03-02 22
Defining Snowflake Schema in DMQL
2022-03-02 25
How to Generate a Specified Data Cube?
◼ DMQL specification is translated into SQL query
define cube sales_star [time, item, branch, location]:
dollars_sold, units_sold, units_sold
translator
select s.time_key, s.item_key, s.branch_key, s.location_key,
sum(s.number_of_units_sold*s.price), sum(s.number_of_units_sold)
from time t, item i, branch b, location l, sales s,
where s.time_key = t.time_key and s.item_key = i.item_key
and s.branch_key = b.branch_key and s.location_key = l.location_key
group by s.time_key, s.item_key, s.branch_key, s.location_key
2022-03-02 28
A Concept Hierarchy: Dimension (location)
all all
2022-03-02 29
A Concept Hierarchy: Dimension (time)
year
quarter
month
week
day
2022-03-02 30
A Concept Hierarchy for Numeric Values
$0…$1000
2022-03-02 31
Multidimensional Data
◼ Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
time
2022-03-02 32
Typical OLAP Operations
◼ Roll up (drill-up): summarize data
▪ by climbing up hierarchy or by dimension
reduction
◼ Drill down (roll down): reverse of roll-up
▪ from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
◼ Slice and dice: project and select
◼ Pivot (rotate):
▪ reorient the cube, visualization, 3D to series
of 2D planes
2022-03-02 33
A Sample Data Cube
Time Total annual sales
2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
Total annual
sales of TV
sum
2022-03-02 34
2022-03-02 35
OLAP Operations
◼ Other operations
▪ drill across: involving (across) more than one fact
table
▪ drill through: through the bottom level of the cube
to its back-end relational tables (using SQL)
▪ rank top N or bottom N items in lists
▪ Compute average, variance, deviation
2022-03-02 36
Exercise
1. Suppose that a data warehouse consists of three
dimensions time, doctor, and patient, and two
measures count and charge, there charge is the fee
that a doctor charges a patient for a visit.
2022-03-02 37
Data Warehouse
2022-03-02 39
Data Warehouse: A Three-Layer Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
▪ Utility mining
C_id T_id A Profit(A) B Profit(B) C Profit(C) D Profit(D) …
2022-03-02 44
Metadata Repository
▪ The algorithms used for summarization
▪ The mapping from operational environment to the
data warehouse
▪ Data related to system performance
• warehouse schema, view and derived data definitions
▪ Business data
• business terms and definitions, ownership of data,
charging policies
2022-03-02 45
OLAP Server Architectures
2022-03-02 46
OLAP Server Architectures
2022-03-02 47
Data Warehouse
2022-03-02 48
Data Warehouse Usage
◼ Three kinds of data warehouse applications
▪ Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
▪ Analytical processing
• supports basic OLAP operations, slice-dice, drilling, pivoting
▪ Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
2022-03-02 49
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
◼ Why online analytical mining?
▪ High quality of data in data warehouses
• DW contains integrated, consistent, cleaned data
▪ Available information processing structure surrounding data
warehouses
• ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
▪ OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
▪ On-line selection of data mining functions
• Integration and swapping of multiple mining functions,
algorithms, and tasks
2022-03-02 50
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data
2022-03-02 52