0% found this document useful (0 votes)
11 views10 pages

2025-Handouts - OLAP - Lecture 1

Chapter 4 discusses the fundamental concepts and techniques of data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data that supports decision-making processes. The chapter also covers data warehouse architecture, models, and the extraction, transformation, and loading (ETL) processes necessary for effective data management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

2025-Handouts - OLAP - Lecture 1

Chapter 4 discusses the fundamental concepts and techniques of data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data that supports decision-making processes. The chapter also covers data warehouse architecture, models, and the extraction, transformation, and loading (ETL) processes necessary for effective data management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 4: Data Warehousing and On-line

Data Mining: Analytical Processing

Concepts and Techniques ◼ Data Warehouse: Basic Concepts


(3rd ed.) ◼ Data Warehouse Modeling: Data Cube and OLAP
◼ Data Warehouse Design and Usage
— Chapter 4 — ◼ Data Warehouse Implementation

Source: Jiawei Han, Micheline Kamber, and Jian Pei ◼ Data Generalization by Attribute-Oriented
University of Illinois at Urbana-Champaign & Induction
Simon Fraser University
◼ Summary
©
1 2

1 2

What is a Data Warehouse? Data Warehouse—Subject-Oriented


◼ Defined in many different ways, but not rigorously.
◼ Organized around major subjects, such as customer,
◼ A decision support database that is maintained separately from
product, sales
the organization’s operational database
◼ Focusing on the modeling and analysis of data for
◼ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
decision makers, not on daily operations or transaction

◼ “A data warehouse is a subject-oriented, integrated, time-variant,


processing
and nonvolatile collection of data in support of management’s ◼ Provide a simple and concise view around particular
decision-making process.”—W. H. Inmon subject issues by excluding data that are not useful in
◼ Data warehousing: the decision support process
◼ The process of constructing and using data warehouses

3 4

3 4

Data Warehouse—Integrated Data Warehouse—Time Variant

◼ Constructed by integrating multiple, heterogeneous data ◼ The time horizon for the data warehouse is significantly
sources longer than that of operational systems
◼ relational databases, flat files, on-line transaction
◼ Operational database: current value data
records
◼ Data cleaning and data integration techniques are ◼ Data warehouse data: provide information from a
applied. historical perspective (e.g., past 5-10 years)
◼ Ensure consistency in naming conventions, encoding ◼ Every key structure in the data warehouse
structures, attribute measures, etc. among different
◼ Contains an element of time, explicitly or implicitly
data sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc. ◼ But the key of operational data may or may not
◼ When data is moved to the warehouse, it is contain “time element”
converted.

5 6

5 6

1
Data Warehouse—Nonvolatile OLTP vs. OLAP

◼ A physically separate store of data transformed from the OLTP OLAP


users clerk, IT professional knowledge worker
operational environment function day to day operations decision support

◼ Operational update of data does not occur in the data DB design application-oriented subject-oriented
data current, up-to-date historical,
warehouse environment detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
◼ Does not require transaction processing, recovery, usage repetitive ad-hoc
access read/write lots of scans
and concurrency control mechanisms index/hash on prim. key
unit of work short, simple transaction complex query
◼ Requires only two operations in data accessing: # records accessed tens millions

◼ initial loading of data and access of data #users thousands hundreds


DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

7 8

7 8

Why a Separate Data Warehouse? Data Warehouse: A Multi-Tiered Architecture


◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing, concurrency Monitor
OLAP Server
control, recovery Other Metadata &
Integrator
◼ Warehouse—tuned for OLAP: complex OLAP queries, sources
multidimensional view, consolidation Analysis
◼ Different functions and different data: Operational Extract Query
DBs Transform Data Serve
◼ missing data: Decision support requires historical data which Reports
Load
operational DBs do not typically maintain Refresh
Warehouse Data mining
◼ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Data Marts
◼ Note: There are more and more systems which perform OLAP
analysis directly on relational databases Data Sources Data Storage OLAP Engine Front-End Tools
9 10

9 10

Three Data Warehouse Models Extraction, Transformation, and Loading (ETL)

◼ Enterprise warehouse ◼ Data extraction


◼ get data from multiple, heterogeneous, and external
◼ collects all of the information about subjects spanning
sources
the entire organization
◼ Data cleaning
◼ Data Mart ◼ detect errors in the data and rectify them when possible
◼ a subset of corporate-wide data that is of value to a
◼ Data transformation
specific groups of users. Its scope is confined to ◼ convert data from legacy or host format to warehouse
specific, selected groups, such as marketing data mart format
◼ Independent vs. dependent (directly from warehouse) data mart ◼ Load
◼ Virtual warehouse ◼ sort, summarize, consolidate, compute views, check

◼ A set of views over operational databases integrity, and build indicies and partitions
◼ Only some of the possible summary views may be
◼ Refresh
◼ propagate the updates from the data sources to the
materialized
warehouse
11 12

11 12

2
Chapter 4: Data Warehousing and On-line
Metadata Repository Analytical Processing
◼ Meta data is the data defining warehouse objects. It stores:
◼ Description of the structure of the data warehouse ◼ Data Warehouse: Basic Concepts
◼ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents ◼ Data Warehouse Modeling: Data Cube and OLAP
Operational meta-data
Data Warehouse Design and Usage


◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
◼ Data Warehouse Implementation
information (warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization ◼ Data Generalization by Attribute-Oriented
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance Induction
◼ warehouse schema, view and derived data definitions

◼ Business data
◼ Summary
◼ business terms and definitions, ownership of data, charging policies
13 14

13 14

From Tables and Spreadsheets to


Data Cubes Cube: A Lattice of Cuboids
◼ A data warehouse is based on a multidimensional data model all
0-D (apex) cuboid
which views data in the form of a data cube
◼ A data cube, such as sales, allows data to be modeled and viewed in time item location supplier
multiple dimensions 1-D cuboids

◼ Dimension tables, such as item (item_name, brand, type), or


time,location item,location location,supplier
time(day, week, month, quarter, year)
time,item 2-D cuboids
time,supplier
◼ Fact table contains measures (such as dollars_sold) and keys item,supplier

to each of the related dimension tables time,location,supplier


3-D cuboids
◼ In data warehousing literature, an n-D base cube is called a base time,item,location
time,item,supplier item,location,supplier
cuboid. The top most 0-D cuboid, which holds the highest-level of
4-D (base) cuboid
summarization, is called the apex cuboid. The lattice of cuboids
time, item, location, supplier
forms a data cube.
15 16

15 16

Conceptual Modeling of Data Warehouses Example of Star Schema

◼ Modeling data warehouses: dimensions & measures time


time_key item
◼ Star schema: A fact table in the middle connected to a day item_key
day_of_the_week
set of dimension tables month
Sales Fact Table item_name
brand
time_key
◼ Snowflake schema: A refinement of star schema quarter type
year supplier_type
item_key
where some dimensional hierarchy is normalized into a
branch_key
set of smaller dimension tables, forming a shape branch location
location_key
similar to snowflake branch_key location_key
branch_name units_sold street
◼ Fact constellations: Multiple fact tables share branch_type city
dollars_sold state_or_province
dimension tables, viewed as a collection of stars, country
avg_sales
therefore called galaxy schema or fact constellation Measures

17 18

17 18

3
Example of Snowflake Schema Example of Fact Constellation
time time
item time_key item Shipping Fact Table
time_key
day item_key
day item_key supplier
Sales Fact Table day_of_the_week Sales Fact Table item_name time_key
day_of_the_week item_name supplier_key month brand
month brand supplier_type quarter item_key
time_key time_key type
quarter type year supplier_type shipper_key
year item_key supplier_key item_key
branch_key from_location
branch_key
branch location to_location
branch location_key location
location_key
location_key
branch_key branch_key location_key dollars_cost
units_sold street branch_name
units_sold
branch_name street
city_key branch_type units_shipped
branch_type
dollars_sold city dollars_sold city
province_or_state
city_key avg_sales
avg_sales city
country shipper
state_or_province Measures shipper_key
Measures country shipper_name
location_key
19 shipper_type 20

19 20

A Concept Hierarchy:
Dimension (location) Data Cube Measures: Three Categories

all all ◼ Distributive: if the result derived by applying the function


to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
region Europe ... North_America
◼ E.g., count(), sum(), min(), max()
◼ Algebraic: if it can be computed by an algebraic function
country Germany ... Spain Canada ... Mexico with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
city Frankfurt ... Vancouver ... Toronto ◼ E.g., avg(), min_N(), standard_deviation()
◼ Holistic: if there is no constant bound on the storage size
office L. Chan ... M. Wind needed to describe a subaggregate.
◼ E.g., median(), mode(), rank()
21 22

21 22

View of Warehouses and Hierarchies Multidimensional Data

◼ Sales volume as a function of product, month,


and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Specification of hierarchies
◼ Schema hierarchy Industry Region Year

day < {month < Category Country Quarter


quarter; week} < year
Product

Product City Month Week


◼ Set_grouping hierarchy
{1..10} < inexpensive Office Day

Month
23 24

23 24

4
A Sample Data Cube Cuboids Corresponding to the Cube
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum all
TV
PC U.S.A 0-D (apex) cuboid
VCR product date country

Country
sum 1-D cuboids
Canada
product,date product,country date, country
Mexico 2-D cuboids

sum
3-D (base) cuboid
product, date, country

All, All, All


25 26

25 26

Typical OLAP Operations


◼ Roll up (drill-up): summarize data
◼ by climbing up hierarchy or by dimension reduction
◼ Drill down (roll down): reverse of roll-up
Fig. 3.10 Typical OLAP
◼ from higher level summary to lower level summary or Operations

detailed data, or introducing new dimensions


◼ Slice and dice: project and select
◼ Pivot (rotate):
◼ reorient the cube, visualization, 3D to series of 2D planes
◼ Other operations
◼ drill across: involving (across) more than one fact table
◼ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

27 28

27 28

A Star-Net Query Model Browsing a Data Cube


Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY

Visualization
DISTRICT

REGION
DIVISION ◼ OLAP capabilities
Each circle is
Location
called a footprint Promotion Organization ◼ Interactive manipulation
29 30

29 30

5
Chapter 4: Data Warehousing and On-line Design of Data Warehouse: A Business
Analytical Processing Analysis Framework

◼ Data Warehouse: Basic Concepts ◼ Four views regarding the design of a data warehouse
◼ Top-down view
◼ Data Warehouse Modeling: Data Cube and OLAP ◼ allows selection of the relevant information necessary for the
data warehouse
◼ Data Warehouse Design and Usage ◼ Data source view
◼ Data Warehouse Implementation ◼ exposes the information being captured, stored, and
managed by operational systems
◼ Data Generalization by Attribute-Oriented ◼ Data warehouse view
consists of fact tables and dimension tables
Induction ◼

◼ Business query view


◼ Summary ◼ sees the perspectives of data in the warehouse from the view
of end-user
31 32

31 32

Data Warehouse Development:


Data Warehouse Design Process
A Recommended Approach
◼ Top-down, bottom-up approaches or a combination of both
◼ Top-down: Starts with overall design and planning (mature) Multi-Tier Data
◼ Bottom-up: Starts with experiments and prototypes (rapid) Warehouse
Distributed
◼ From software engineering point of view Data Marts
◼ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◼ Spiral: rapid generation of increasingly functional systems, short Enterprise
turn around time, quick turn around Data Data
Data
◼ Typical data warehouse design process
Mart Mart
Warehouse
◼ Choose a business process to model, e.g., orders, invoices, etc.
◼ Choose the grain (atomic level of data) of the business process Model refinement Model refinement
◼ Choose the dimensions that will apply to each fact table record
◼ Choose the measure that will populate each fact table record Define a high-level corporate data model
33 34

33 34

From On-Line Analytical Processing (OLAP)


Data Warehouse Usage
to On Line Analytical Mining (OLAM)
◼ Three kinds of data warehouse applications ◼ Why online analytical mining?
◼ Information processing ◼ High quality of data in data warehouses
supports querying, basic statistical analysis, and reporting
◼ DW contains integrated, consistent, cleaned data

using crosstabs, tables, charts and graphs


◼ Available information processing structure surrounding
Analytical processing

data warehouses
multidimensional analysis of data warehouse data
◼ ODBC, OLEDB, Web accessing, service facilities,

◼ supports basic OLAP operations, slice-dice, drilling, pivoting reporting and OLAP tools
◼ Data mining ◼ OLAP-based exploratory data analysis
◼ knowledge discovery from hidden patterns ◼ Mining with drilling, dicing, pivoting, etc.

supports associations, constructing analytical models,


◼ On-line selection of data mining functions

performing classification and prediction, and presenting the


◼ Integration and swapping of multiple mining
mining results using visualization tools
functions, algorithms, and tasks
35 36

35 36

6
Chapter 4: Data Warehousing and On-line
Analytical Processing Efficient Data Cube Computation
◼ Data cube can be viewed as a lattice of cuboids
◼ Data Warehouse: Basic Concepts ◼ The bottom-most cuboid is the base cuboid
◼ Data Warehouse Modeling: Data Cube and OLAP ◼ The top-most cuboid (apex) contains only one cell
◼ How many cuboids in an n-dimensional cube with L
◼ Data Warehouse Design and Usage levels? n
T =  ( Li +1)
i =1
◼ Data Warehouse Implementation
◼ Materialization of data cube
◼ Data Generalization by Attribute-Oriented ◼ Materialize every (cuboid) (full materialization),
Induction none (no materialization), or some (partial
materialization)
◼ Summary ◼ Selection of which cuboids to materialize
◼ Based on size, sharing, access frequency, etc.
37 38

37 38

The “Compute Cube” Operator Indexing OLAP Data: Bitmap Index


◼ Index on a particular column
◼ Cube definition and computation in DMQL
◼ Each value in the column has a bit vector: bit-op is fast
define cube sales [item, city, year]: sum (sales_in_dollars) ◼ The length of the bit vector: # of records in the base table
compute cube sales ◼ The i-th bit is set if the i-th row of the base table has the value for
the indexed column
◼ Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96) () ◼ not suitable for high cardinality domains
◼ A recent bit compression technique, Word-Aligned Hybrid (WAH),
SELECT item, city, year, SUM (amount)
makes it work for high cardinality domain as well [Wu, et al. TODS’06]
FROM SALES (city) (item) (year)
Base table Index on Region Index on Type
CUBE BY item, city, year Cust Region Type RecIDAsia Europe America RecID Retail Dealer
◼ Need compute the following Group-Bys C1 Asia Retail 1 1 0 0 1 1 0
(city, item) (city, year) (item, year)
(date, product, customer), C2 Europe Dealer 2 0 1 0 2 0 1
(date,product),(date, customer), (product, customer), C3 Asia Dealer 3 1 0 0 3 0 1
(date), (product), (customer) (city, item, year) C4 America Retail 4 0 0 1 4 1 0
() C5 Europe Dealer 5 0 1 0 5 0 1
39 40

39 40

Indexing OLAP Data: Join Indices Efficient Processing OLAP Queries

◼ Join index: JI(R-id, S-id) where R (R-id, …)  S ◼ Determine which operations should be performed on the available cuboids
(S-id, …) ◼ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
◼ Traditional indices map the values to a list of e.g., dice = selection + projection
record ids
Determine which materialized cuboid(s) should be selected for OLAP op.
◼ It materializes relational join in JI file and

speeds up relational join ◼ Let the query to be processed be on {brand, province_or_state} with the
◼ In data warehouses, join index relates the values condition “year = 2004”, and there are 4 materialized cuboids available:
of the dimensions of a start schema to rows in
1) {year, item_name, city}
the fact table.
◼ E.g. fact table: Sales and two dimensions city 2) {year, brand, country}
and product 3) {year, brand, province_or_state}
◼ A join index on city maintains for each
4) {item_name, province_or_state} where year = 2004
distinct city a list of R-IDs of the tuples
recording the Sales in the city Which should be selected to process the query?
◼ Join indices can span multiple dimensions ◼ Explore indexing structures and compressed vs. dense array structs in MOLAP
41 42

41 42

7
OLAP Server Architectures Chapter 4: Data Warehousing and On-line
Analytical Processing
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and manage ◼ Data Warehouse: Basic Concepts
warehouse data and OLAP middle ware
◼ Include optimization of DBMS backend, implementation of
◼ Data Warehouse Modeling: Data Cube and OLAP
aggregation navigation logic, and additional tools and services
◼ Data Warehouse Design and Usage
◼ Greater scalability
◼ Multidimensional OLAP (MOLAP) ◼ Data Warehouse Implementation
◼ Sparse array-based multidimensional storage engine
◼ Fast indexing to pre-computed summarized data ◼ Data Generalization by Attribute-Oriented
Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

Induction
◼ Flexibility, e.g., low level: relational, high-level: array
◼ Specialized SQL servers (e.g., Redbricks) ◼ Summary
◼ Specialized support for SQL queries over star/snowflake schemas
43 44

43 44

Attribute-Oriented Induction Attribute-Oriented Induction: An Example


Example: Describe general characteristics of graduate
◼ Proposed in 1989 (KDD ‘89 workshop)
students in the University database
◼ Not confined to categorical data nor particular measures
◼ Step 1. Fetch relevant set of data using an SQL
◼ How it is done?
statement, e.g.,
◼ Collect the task-relevant data (initial relation) using a
Select * (i.e., name, gender, major, birth_place,
relational database query
birth_date, residence, phone#, gpa)
◼ Perform generalization by attribute removal or
from student
attribute generalization
where student_status in {“Msc”, “MBA”, “PhD” }
◼ Apply aggregation by merging identical, generalized
◼ Step 2. Perform attribute-oriented induction
tuples and accumulating their respective counts
◼ Step 3. Present results in generalized relation, cross-tab,
◼ Interaction with users for knowledge presentation
or rule forms

45 46

45 46

Class Characterization: An Example Basic Principles of Attribute-Oriented Induction


Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67 ◼ Data focusing: task-relevant data, including dimensions,
Initial
Woodman
Relation Scott M CS
Canada
Montreal, Que, 28-7-75
Richmond
345 1st Ave., 253-9106 3.70
and the result is the initial relation
Attribute-removal: remove attribute A if there is a large set
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83 ◼

of distinct values for A but (1) there is no generalization


… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,.. operator on A, or (2) A’s higher level concepts are
Prime
Gender Major Birth_region Age_range Residence GPA Count
expressed in terms of other attributes
M Science Canada 20-25 Richmond Very-good 16
Generalized
Relation
F

Science

Foreign

25-30

Burnaby

Excellent

22

◼ Attribute-generalization: If there is a large set of distinct
Birth_Region
values for A, and there exists a set of generalization
Gender
Canada Foreign Total
operators on A, then select an operator and generalize A
M 16 14 30 ◼ Attribute-threshold control: typical 2-8, specified/default
F 10 22 32
Total 26 36 62 ◼ Generalized relation threshold control: control the final
relation/rule size
47 48

47 48

8
Attribute-Oriented Induction: Basic
Presentation of Generalized Results
Algorithm
◼ Generalized relation:
◼ InitialRel: Query processing of task-relevant data, deriving
Relations where some or all attributes are generalized, with counts
the initial relation.

or other aggregation values accumulated.


◼ PreGen: Based on the analysis of the number of distinct
◼ Cross tabulation:
values in each attribute, determine generalization plan for
◼ Mapping results into cross tabulation form (similar to contingency
each attribute: removal? or how high to generalize? tables).
◼ PrimeGen: Based on the PreGen plan, perform ◼ Visualization techniques:
generalization to the right level to derive a “prime ◼ Pie charts, bar charts, curves, cubes, and other visual forms.
generalized relation”, accumulating the counts. ◼ Quantitative characteristic rules:
◼ Presentation: User interaction: (1) adjust levels by drilling, ◼ Mapping generalized result into characteristic rules with quantitative
(2) pivoting, (3) mapping into rules, cross tabs, information associated with it, e.g.,
visualization presentations. grad ( x)  male( x) 
birth _ region( x) ="Canada"[t :53%]  birth _ region( x) =" foreign"[t : 47%].
49 50

49 50

Mining Class Comparisons Concept Description vs. Cube-Based OLAP


◼ Comparison: Comparing two or more classes ◼ Similarity:
◼ Method: ◼ Data generalization
◼ Partition the set of relevant data into the target class and the ◼ Presentation of data summarization at multiple levels of
contrasting class(es) abstraction
◼ Generalize both classes to the same high level concepts ◼ Interactive drilling, pivoting, slicing and dicing
◼ Compare tuples with the same high level descriptions ◼ Differences:
◼ Present for every tuple its description and two measures
◼ OLAP has systematic preprocessing, query independent,
◼ support - distribution within single class and can drill down to rather low level
◼ comparison - distribution between classes
◼ AOI has automated desired level allocation, and may
◼ Highlight the tuples with strong discriminant features perform dimension relevance analysis/ranking when
◼ Relevance Analysis: there are many relevant dimensions
◼ Find attributes (features) which best distinguish different classes
◼ AOI works on the data which are not in relational forms

51 52

51 52

Chapter 4: Data Warehousing and On-line


Analytical Processing Summary
◼ Data warehousing: A multi-dimensional model of a data warehouse
◼ Data Warehouse: Basic Concepts ◼ A data cube consists of dimensions & measures
◼ Star schema, snowflake schema, fact constellations
◼ Data Warehouse Modeling: Data Cube and OLAP ◼ OLAP operations: drilling, rolling, slicing, dicing and pivoting
◼ Data Warehouse Architecture, Design, and Usage
◼ Data Warehouse Design and Usage ◼ Multi-tiered architecture
◼ Business analysis design framework
◼ Data Warehouse Implementation Information processing, analytical processing, data mining, OLAM (Online

Analytical Mining)
◼ Data Generalization by Attribute-Oriented ◼ Implementation: Efficient computation of data cubes
Partial vs. full vs. no materialization
Induction

◼ Indexing OALP data: Bitmap index and join index


◼ OLAP query processing
◼ Summary ◼ OLAP servers: ROLAP, MOLAP, HOLAP
◼ Data generalization: Attribute-oriented induction
53 54

53 54

9
References (I) References (II)
◼ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
◼ C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
Dimensional Techniques. John Wiley, 2003
◼ D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data ◼ W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
warehouses. SIGMOD’97
◼ R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
◼ R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 Modeling. 2ed. John Wiley, 2002
◼ S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM ◼ P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–
SIGMOD Record, 26:65-74, 1997 11, Sept. 1995.
◼ E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision support. Computer World, 27, July ◼ P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
1993. ◼ Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
◼ J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab http://www.microsoft.com/data/oledb/olap, 1998
and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. ◼ S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
◼ A. Gupta and I. S. Mumick. Materialized Views: Techniques, Implementations, and ◼ A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
Applications. MIT Press, 1999. ◼ D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using
views. VLDB'96
◼ J. Han. Towards on-line analytical mining in large databases. ACM SIGMOD Record, 27:97-107,
◼ P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
1998.
◼ J. Widom. Research problems in data warehousing. CIKM’95
◼ V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
◼ K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans.
SIGMOD’96
on Database Systems (TODS), 31(1): 1-38, 2006
◼ J. Hellerstein, P. Haas, and H. Wang. Online aggregation. SIGMOD'97
55 56

55 56

Compression of Bitmap Indices


◼ Bitmap indexes must be compressed to reduce I/O costs
and minimize CPU usage—majority of the bits are 0’s
◼ Two compression schemes:
◼ Byte-aligned Bitmap Code (BBC)
Word-Aligned Hybrid (WAH) code
Surplus Slides ◼

◼ Time and space required to operate on compressed


bitmap is proportional to the total size of the bitmap
◼ Optimal on attributes of low cardinality as well as those of
high cardinality.
◼ WAH out performs BBC by about a factor of two
57 58

57 58

10

You might also like