2025-Handouts - OLAP - Lecture 1
2025-Handouts - OLAP - Lecture 1
Source: Jiawei Han, Micheline Kamber, and Jian Pei ◼ Data Generalization by Attribute-Oriented
University of Illinois at Urbana-Champaign & Induction
Simon Fraser University
◼ Summary
©
1 2
1 2
3 4
3 4
◼ Constructed by integrating multiple, heterogeneous data ◼ The time horizon for the data warehouse is significantly
sources longer than that of operational systems
◼ relational databases, flat files, on-line transaction
◼ Operational database: current value data
records
◼ Data cleaning and data integration techniques are ◼ Data warehouse data: provide information from a
applied. historical perspective (e.g., past 5-10 years)
◼ Ensure consistency in naming conventions, encoding ◼ Every key structure in the data warehouse
structures, attribute measures, etc. among different
◼ Contains an element of time, explicitly or implicitly
data sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc. ◼ But the key of operational data may or may not
◼ When data is moved to the warehouse, it is contain “time element”
converted.
5 6
5 6
1
Data Warehouse—Nonvolatile OLTP vs. OLAP
◼ Operational update of data does not occur in the data DB design application-oriented subject-oriented
data current, up-to-date historical,
warehouse environment detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
◼ Does not require transaction processing, recovery, usage repetitive ad-hoc
access read/write lots of scans
and concurrency control mechanisms index/hash on prim. key
unit of work short, simple transaction complex query
◼ Requires only two operations in data accessing: # records accessed tens millions
7 8
7 8
9 10
◼ A set of views over operational databases integrity, and build indicies and partitions
◼ Only some of the possible summary views may be
◼ Refresh
◼ propagate the updates from the data sources to the
materialized
warehouse
11 12
11 12
2
Chapter 4: Data Warehousing and On-line
Metadata Repository Analytical Processing
◼ Meta data is the data defining warehouse objects. It stores:
◼ Description of the structure of the data warehouse ◼ Data Warehouse: Basic Concepts
◼ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents ◼ Data Warehouse Modeling: Data Cube and OLAP
Operational meta-data
Data Warehouse Design and Usage
◼
◼
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
◼ Data Warehouse Implementation
information (warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization ◼ Data Generalization by Attribute-Oriented
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance Induction
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ Summary
◼ business terms and definitions, ownership of data, charging policies
13 14
13 14
15 16
17 18
17 18
3
Example of Snowflake Schema Example of Fact Constellation
time time
item time_key item Shipping Fact Table
time_key
day item_key
day item_key supplier
Sales Fact Table day_of_the_week Sales Fact Table item_name time_key
day_of_the_week item_name supplier_key month brand
month brand supplier_type quarter item_key
time_key time_key type
quarter type year supplier_type shipper_key
year item_key supplier_key item_key
branch_key from_location
branch_key
branch location to_location
branch location_key location
location_key
location_key
branch_key branch_key location_key dollars_cost
units_sold street branch_name
units_sold
branch_name street
city_key branch_type units_shipped
branch_type
dollars_sold city dollars_sold city
province_or_state
city_key avg_sales
avg_sales city
country shipper
state_or_province Measures shipper_key
Measures country shipper_name
location_key
19 shipper_type 20
19 20
A Concept Hierarchy:
Dimension (location) Data Cube Measures: Three Categories
21 22
Month
23 24
23 24
4
A Sample Data Cube Cuboids Corresponding to the Cube
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum all
TV
PC U.S.A 0-D (apex) cuboid
VCR product date country
Country
sum 1-D cuboids
Canada
product,date product,country date, country
Mexico 2-D cuboids
sum
3-D (base) cuboid
product, date, country
25 26
27 28
27 28
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
Visualization
DISTRICT
◼
REGION
DIVISION ◼ OLAP capabilities
Each circle is
Location
called a footprint Promotion Organization ◼ Interactive manipulation
29 30
29 30
5
Chapter 4: Data Warehousing and On-line Design of Data Warehouse: A Business
Analytical Processing Analysis Framework
◼ Data Warehouse: Basic Concepts ◼ Four views regarding the design of a data warehouse
◼ Top-down view
◼ Data Warehouse Modeling: Data Cube and OLAP ◼ allows selection of the relevant information necessary for the
data warehouse
◼ Data Warehouse Design and Usage ◼ Data source view
◼ Data Warehouse Implementation ◼ exposes the information being captured, stored, and
managed by operational systems
◼ Data Generalization by Attribute-Oriented ◼ Data warehouse view
consists of fact tables and dimension tables
Induction ◼
31 32
33 34
◼ supports basic OLAP operations, slice-dice, drilling, pivoting reporting and OLAP tools
◼ Data mining ◼ OLAP-based exploratory data analysis
◼ knowledge discovery from hidden patterns ◼ Mining with drilling, dicing, pivoting, etc.
35 36
6
Chapter 4: Data Warehousing and On-line
Analytical Processing Efficient Data Cube Computation
◼ Data cube can be viewed as a lattice of cuboids
◼ Data Warehouse: Basic Concepts ◼ The bottom-most cuboid is the base cuboid
◼ Data Warehouse Modeling: Data Cube and OLAP ◼ The top-most cuboid (apex) contains only one cell
◼ How many cuboids in an n-dimensional cube with L
◼ Data Warehouse Design and Usage levels? n
T = ( Li +1)
i =1
◼ Data Warehouse Implementation
◼ Materialization of data cube
◼ Data Generalization by Attribute-Oriented ◼ Materialize every (cuboid) (full materialization),
Induction none (no materialization), or some (partial
materialization)
◼ Summary ◼ Selection of which cuboids to materialize
◼ Based on size, sharing, access frequency, etc.
37 38
37 38
39 40
◼ Join index: JI(R-id, S-id) where R (R-id, …) S ◼ Determine which operations should be performed on the available cuboids
(S-id, …) ◼ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
◼ Traditional indices map the values to a list of e.g., dice = selection + projection
record ids
Determine which materialized cuboid(s) should be selected for OLAP op.
◼ It materializes relational join in JI file and
◼
speeds up relational join ◼ Let the query to be processed be on {brand, province_or_state} with the
◼ In data warehouses, join index relates the values condition “year = 2004”, and there are 4 materialized cuboids available:
of the dimensions of a start schema to rows in
1) {year, item_name, city}
the fact table.
◼ E.g. fact table: Sales and two dimensions city 2) {year, brand, country}
and product 3) {year, brand, province_or_state}
◼ A join index on city maintains for each
4) {item_name, province_or_state} where year = 2004
distinct city a list of R-IDs of the tuples
recording the Sales in the city Which should be selected to process the query?
◼ Join indices can span multiple dimensions ◼ Explore indexing structures and compressed vs. dense array structs in MOLAP
41 42
41 42
7
OLAP Server Architectures Chapter 4: Data Warehousing and On-line
Analytical Processing
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and manage ◼ Data Warehouse: Basic Concepts
warehouse data and OLAP middle ware
◼ Include optimization of DBMS backend, implementation of
◼ Data Warehouse Modeling: Data Cube and OLAP
aggregation navigation logic, and additional tools and services
◼ Data Warehouse Design and Usage
◼ Greater scalability
◼ Multidimensional OLAP (MOLAP) ◼ Data Warehouse Implementation
◼ Sparse array-based multidimensional storage engine
◼ Fast indexing to pre-computed summarized data ◼ Data Generalization by Attribute-Oriented
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
◼
Induction
◼ Flexibility, e.g., low level: relational, high-level: array
◼ Specialized SQL servers (e.g., Redbricks) ◼ Summary
◼ Specialized support for SQL queries over star/snowflake schemas
43 44
43 44
45 46
45 46
47 48
8
Attribute-Oriented Induction: Basic
Presentation of Generalized Results
Algorithm
◼ Generalized relation:
◼ InitialRel: Query processing of task-relevant data, deriving
Relations where some or all attributes are generalized, with counts
the initial relation.
◼
49 50
51 52
51 52
Analytical Mining)
◼ Data Generalization by Attribute-Oriented ◼ Implementation: Efficient computation of data cubes
Partial vs. full vs. no materialization
Induction
◼
53 54
9
References (I) References (II)
◼ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
◼ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
Dimensional Techniques. John Wiley, 2003
◼ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data ◼ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
warehouses. SIGMOD’97
◼ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
◼ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 Modeling. 2ed. John Wiley, 2002
◼ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM ◼ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
SIGMOD Record, 26:65-74, 1997 11, Sept. 1995.
◼ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July ◼ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
1993. ◼ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
◼ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab http://www.microsoft.com/data/oledb/olap, 1998
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. ◼ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
◼ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and ◼ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
Applications. MIT Press, 1999. ◼ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
◼ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
◼ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
1998.
◼ J. Widom. Research problems in data warehousing. CIKM’95
◼ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
◼ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
SIGMOD’96
on Database Systems (TODS), 31(1): 1-38, 2006
◼ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
55 56
55 56
57 58
10