04DWH & Olap
04DWH & Olap
Processing
1
What is a Data Warehouse?
◼ Defined in many different ways, but not rigorously.
◼ A decision support database that is maintained separately from
the organization’s operational database
◼ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
◼ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
◼ Data warehousing:
◼ The process of constructing and using data warehouses
2
Data Warehouse—Subject-Oriented
3
Data Warehouse—Integrated
records
◼ Data cleaning and data integration techniques are
applied.
◼ Ensure consistency in naming conventions, encoding
4
Data Warehouse—Time Variant
5
Data Warehouse—Nonvolatile
6
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
7
Why a Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
◼ Different functions and different data:
◼ missing data: Decision support requires historical data which
operational DBs do not typically maintain
◼ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
◼ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
8
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
materialized
10
Extraction, Transformation, and Loading (ETL)
◼ Data extraction
◼ get data from multiple, heterogeneous, and external
sources
◼ Data cleaning
◼ detect errors in the data and rectify them when possible
◼ Data transformation
◼ convert data from legacy or host format to warehouse
format
◼ Load
◼ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
◼ Refresh
◼ propagate the updates from the data sources to the
warehouse
11
Metadata Repository
◼ Meta data is the data defining warehouse objects. It stores:
◼ Description of the structure of the data warehouse
◼ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
◼ Operational meta-data
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ business terms and definitions, ownership of data, charging policies
12
Chapter 4: Data Warehousing and On-line
Analytical Processing
13
From Tables and Spreadsheets to
Data Cubes
◼ A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
◼ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
◼ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
◼ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
◼ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
14
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
15
Conceptual Modeling of Data Warehouses
16
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
17
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
18
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
20
Data Cube Measures: Three Categories
Specification of hierarchies
◼ Schema hierarchy
day < {month <
quarter; week} < year
◼ Set_grouping hierarchy
{1..10} < inexpensive
22
Multidimensional Data
Office Day
Month
23
A Sample Data Cube
Country
sum
Canada
Mexico
sum
24
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
25
Typical OLAP Operations
◼ Roll up (drill-up): summarize data
◼ by climbing up hierarchy or by dimension reduction
◼ Drill down (roll down): reverse of roll-up
◼ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
◼ Slice and dice: project and select
◼ Pivot (rotate):
◼ reorient the cube, visualization, 3D to series of 2D planes
◼ Other operations
◼ drill across: involving (across) more than one fact table
◼ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
26
Fig. 3.10 Typical OLAP
Operations
27
November 2, 2022 Data Mining: Concepts and Techniques 28
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
29
Browsing a Data Cube
◼ Visualization
◼ OLAP capabilities
◼ Interactive manipulation
30
Chapter 4: Data Warehousing and On-line
Analytical Processing
31
Design of Data Warehouse: A Business
Analysis Framework
◼ Four views regarding the design of a data warehouse
◼ Top-down view
◼ allows selection of the relevant information necessary for the
data warehouse
◼ Data source view
◼ exposes the information being captured, stored, and
managed by operational systems
◼ Data warehouse view
◼ consists of fact tables and dimension tables
◼ Business query view
◼ sees the perspectives of data in the warehouse from the view
of end-user
32
Data Warehouse Design Process
◼ Top-down, bottom-up approaches or a combination of both
◼ Top-down: Starts with overall design and planning (mature)
◼ Bottom-up: Starts with experiments and prototypes (rapid)
◼ From software engineering point of view
◼ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◼ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
◼ Typical data warehouse design process
◼ Choose a business process to model, e.g., orders, invoices, etc.
◼ Choose the grain (atomic level of data) of the business process
◼ Choose the dimensions that will apply to each fact table record
◼ Choose the measure that will populate each fact table record
33
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
35
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
◼ It integrates OLAP with Data Mining to uncover
knowledge in multi-dimensional databases.
◼ Why online analytical mining?
◼ High quality of data in data warehouses
◼ DW contains integrated, consistent, cleaned data
OLAP tools
◼ OLAP-based exploratory data analysis
◼ Mining with drilling, dicing, pivoting, etc.
37
Efficient Data Cube Computation
◼ Data cube can be viewed as a lattice of cuboids
◼ The bottom-most cuboid is the base cuboid
◼ The top-most cuboid (apex) contains only one cell
◼ How many cuboids in an n-dimensional cube with L
levels? n
T = ( Li +1)
i =1
◼ Materialization of data cube
◼ Materialize every (cuboid) (full materialization),
none (no materialization), or some (partial
materialization)
◼ Selection of which cuboids to materialize
◼ Based on size, sharing, access frequency, etc.
38
The “Compute Cube” Operator
◼ Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
◼ Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96) ()
SELECT item, city, year, SUM (amount)
FROM SALES (city) (item) (year)
and product
◼ A join index on city maintains for each
41
Efficient Processing OLAP Queries
◼ Determine which operations should be performed on the available cuboids
◼ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
◼ Determine which materialized cuboid(s) should be selected for OLAP op.
◼ Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
◼ Explore indexing structures and compressed vs. dense array structs in MOLAP
42
OLAP Server Architectures
44
Chapter 4: Data Warehousing and On-line
Analytical Processing
45
Summary
◼ Data warehousing: A multi-dimensional model of a data warehouse
◼ A data cube consists of dimensions & measures
◼ Star schema, snowflake schema, fact constellations
◼ OLAP operations: drilling, rolling, slicing, dicing and pivoting
◼ Data Warehouse Architecture, Design, and Usage
◼ Multi-tiered architecture
◼ Business analysis design framework
◼ Information processing, analytical processing, data mining, OLAM (Online
Analytical Mining)
◼ Implementation: Efficient computation of data cubes
◼ Partial vs. full vs. no materialization
◼ Indexing OALP data: Bitmap index and join index
◼ OLAP query processing
◼ OLAP servers: ROLAP, MOLAP, HOLAP
46
References (I)
◼ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
◼ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
◼ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
◼ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
◼ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July
1993.
◼ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
◼ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and
Applications. MIT Press, 1999.
◼ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
1998.
◼ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
◼ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
47
References (II)
◼ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. John Wiley, 2003
◼ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
◼ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2ed. John Wiley, 2002
◼ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
11, Sept. 1995.
◼ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
◼ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
◼ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
◼ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
◼ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
◼ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
◼ J. Widom. Research problems in data warehousing. CIKM’95
◼ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
on Database Systems (TODS), 31(1): 1-38, 2006
48
Surplus Slides
49
Compression of Bitmap Indices
◼ Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
◼ Two compression schemes:
◼ Byte-aligned Bitmap Code (BBC)
◼ Word-Aligned Hybrid (WAH) code
◼ Time and space required to operate on compressed
bitmap is proportional to the total size of the bitmap
◼ Optimal on attributes of low cardinality as well as those of
high cardinality.
◼ WAH out performs BBC by about a factor of two
50