0% found this document useful (0 votes)
94 views50 pages

Online Analytical Processing (OLAP) : An Overview

This document provides an overview of Online Analytical Processing (OLAP). It discusses the motivation for OLAP which is to enable aggregation, summarization and exploration of historical data to help management make informed decisions. It also describes how OLAP has different requirements than OLTP in terms of data size, time span, workload and performance goals. The document outlines research areas in OLAP including query language extensions, server architecture, parallel processing, index structures and materialized views. It provides details on techniques for simultaneously calculating multiple aggregates in a single pass over the data through array chunking and maintaining minimum spanning trees of aggregates.

Uploaded by

Neha Kohli
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views50 pages

Online Analytical Processing (OLAP) : An Overview

This document provides an overview of Online Analytical Processing (OLAP). It discusses the motivation for OLAP which is to enable aggregation, summarization and exploration of historical data to help management make informed decisions. It also describes how OLAP has different requirements than OLTP in terms of data size, time span, workload and performance goals. The document outlines research areas in OLAP including query language extensions, server architecture, parallel processing, index structures and materialized views. It provides details on techniques for simultaneously calculating multiple aggregates in a single pass over the data through array chunking and maintaining minimum spanning trees of aggregates.

Uploaded by

Neha Kohli
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

Online Analytical Processing (OLAP)

An Overview

Kian Win Ong, Nicola Onose


Mar 3rd 2006

Overview
Motivation Multi-Dimensional Data Model Research Areas Optimizations
Materializing multiple aggregates simultaneously Materialization strategy

Motivation
Aggregation, summarization and exploration Of historical data To help management make informed decisions

Different Goal
Aggregation, summarization and exploration Of historical data To help management make informed decisions
Product Coke (0.5 gallon) Pepsi (0.5 gallon) Coke (1 gallon) Altoids Branch Convoy Street UTC UTC Costa Verde Time 2006-03-01 09:00:01 2006-03-01 09:00:01 2006-03-01 09:00:02 2006-03-01 09:01:33 Price $1.00 $1.03 $1.50 $0.30

...

Find the total sales for each product and month Find the percentage change in the total monthly sales for each product

Different Requirements
OLTP On-Line Transaction Processing OLAP On-Line Analytical Processing
OLTP
Tasks Day to day operation

OLAP
High level decision support Terabytes

Size of database

Gigabytes

Time span

Recent, up-to-date

Spanning over months / years

Size of working set

Tens of records, accessed through primary keys


Structured / repetitive

Consolidated data from multiple databases


Ad-hoc, exploratory queries Query latency

Workload

Performance

Transaction throughput

Overview
Motivation Multi-Dimensional Data Model Research Areas Optimizations
Materializing multiple aggregates simultaneously Materialization strategy

Query Language Extensions


In the real world, data is stored in RDBs.

Query Language Extensions


In the real world, data is stored in RDBs.

How to express N-dimensional problems using 2D tables?

Query Language Extensions


In the real world, data is stored in RDBs.

How to express N-dimensional problems using 2D tables?


Can we combine OLAP and SQL queries?

Jim Gray et al: Data Cube: A Relational Aggregation Operator 1997

Query Language Extensions


Problems with GROUP BY 1.histograms
SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population

Query Language Extensions


Problems with GROUP BY 1.histograms 2.rollup/drilldow Product Product n
Category Name Drinks Coke

non relational representation


Month Sales Sales by Cat., by Name Sales by Cat.

Feb

30.3

Mar
Heineken Feb Mar

93.9
34.8 123.8

124.2

158.6

282.8

Query Language Extensions


Problems with GROUP BY 1.histograms 2.rollup/drilldow Product Product n
Category Name Drinks Coke

relational, but the rollup is huge


Month Sales Sales by Cat., by Name
124.2

Sales by Cat.

Feb

30.3

282.8

Drinks
Drinks Drinks

Coke
Heineken Heineken

Mar
Feb Mar

93.9
34.8 123.8

124.2
158.6 158.6

282.8
282.8 282.8

Query Language Extensions


Problems with GROUP BY 1.histograms 2.rollup/drilldown 3.cross tabulations
2-D aggregation is more compact and more natural:
Drinks Feb Mar Total

Coke
Heineken Total

30.3
34.8 65.1

93.9
123.8 217.7

124.2
158.6 282.8

Query Language Extensions


Reducing the number of attributes
Product Category Drinks Drinks Drinks Drinks Product Name Coke Coke Coke Heineken Month Sales

Feb Mar ALL Feb

30.3 93.9 124.2 34.8

Drinks
Drinks Drinks Drinks Drinks

Heineken
Heineken ALL ALL ALL

Mar
ALL ALL Feb Mar

123.8
158.6 282.8 65.1 217.7

Query Language Extensions


Reducing the number of attributes

introduce a new value: ALL


Drinks Feb Mar Total (ALL)

Coke
Heineken Total (ALL)

30.3
34.8 65.1

93.9
123.8 217.7

124.2
158.6 282.8

ALL = the set over which we aggregate

Query Language Extensions


General approach GROUP BY (1D)
Sales by Product Name Coke Heineken Feb 30.3 34.8 Mar 93.9 123.8

SUM

65.1

217.7

Query Language Extensions


General approach GROUP BY (1D) Cross Tab (2D)
the corresponding relation:
Product Category Drinks Drinks Drinks Drinks Drinks Drinks Coke Heineken ALL Feb 30.3 34.8 65.1 Mar 93.9 123.8 217.7 ALL 124.2 158.6 282.8 Drinks Product Name Coke Coke Coke Heineken Heineken Heineken Month Sales

Feb Mar ALL Feb Mar ALL

30.3 93.9 124.2 34.8 123.8 158.6

Drinks
Drinks Drinks

ALL
ALL ALL

Feb
Mar ALL

65.1
217.7 282.8

Query Language Extensions


General approach GROUP BY (1D) Cross Tab (2D) Cube (3D)
By cat. and name (does it make sense?) By cat. and month
Product Category Drinks Drinks Drinks Snacks Snacks Snacks Product Name Coke Coke Coke Doritos Doritos Doritos Month Sales Feb Mar ALL Feb Mar ALL 30.3 93.9 124.2 123.8 158.6 65.1

ALL

ALL

964.0

By month and name

ALL

Query Language Extensions


General approach GROUP BY (1D) Cross Tab (2D) Cube (3D)

Any hypercube can be represented as a relation!

Query Language Extensions


General approach a CUBE relation, with aggregation function f(.)
(x1, x2, , xn-1, xn, f() ) (x1, xn-1, , xn, ALL, f() ) (x1, x2, , ALL, xn, f() )

after ROLLUP , reduce to a linear # of tuples


(x1, x2, , xn-1, xn, f() ) (x1, xn-1, , xn, ALL, f() ) (x1, x2, , ALL, ALL, f() ) (ALL, ALL, , ALL, ALL, f() )

Query Language Extensions


The new operators: CUBE, ROLLUP
SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month
Product Category Drinks Drinks Drinks Product Name Coke Coke Coke Month Sales

Feb Mar ALL

30.3 93.9 124.2

Idea: Group by the CUBE list. Union the aggregates. Introduce the ALL values.

Drinks ALL

ALL ALL

Feb ALL

99.8 964.0

Query Language Extensions


The new operators: CUBE, ROLLUP
SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state
Product Category
Drinks

Month

Day

State

Product Name
Coke Heineken

Sales

Feb Feb

26 26

CA CA

12.3 5.4

Feb
Feb Snacks Feb

26
26 26

CA
ALL

ALL
Coke

30.4
12.0

CA

Doritos

Overview
Motivation Multi-Dimensional Data Model Research Areas Optimizations
Materializing multiple aggregates simultaneously Materialization strategy

Research Areas
SQL language extensions Server architecture Parallel processing Index structures Materialized views

Overview
Motivation Multi-Dimensional Data Model Research Areas Optimizations
Materializing multiple aggregates simultaneously Materialization strategy

Simultaneous Multi-Dimensional Aggregates


Y. Zhao, P. Deshpande, J. Naughton An Array-Based Algorithm for Simultaneous Multidimensional Aggregates SIGMOD 1997

Optimization to calculate multiple aggregates simultaneously Useful for materialization of aggregate views

Multiple Aggregates

Aggregate on
Product Coke Pepsi City San Diego Los Angeles Month Feb 06 Feb 06 Sales 12 13

Doritos
Altoids

San Diego
San Diego

Mar 06
Mar 06

72
65

...

Month / Product Altoids Coke

Feb 36 37

Mar 131 138

Total 167 175

Doritos
Heineken Pepsi Pringles Total

21
44 31 37 206

136
110 122 126 764

157
154 153 164 970

Multiple Aggregates
City / Product Altoids Coke Doritos Heineken Pepsi Pringles Total Month / Product Altoids Coke Month / City Los Angeles San Diego Total Feb 112 95 206 Mar 358 407 764 Total 469 501 970 Feb 36 37 San Diego 90 89 74 74 68 73 469 Mar 131 138 Los Angeles 77 86 83 80 85 90 501 Total 167 175 157 154 153 164 970 Total 167 175

Aggregate on
Product Coke Pepsi City San Diego Los Angeles Month Feb 06 Feb 06 Sales 12 13

Doritos
Altoids

San Diego
San Diego

Mar 06
Mar 06

72
65

...

Doritos
Heineken Pepsi Pringles Total

21
44 31 37 206

136
110 122 126 764

157
154 153 164 970

Multiple Aggregates

Aggregate on
Product Coke Pepsi City San Diego Los Angeles Month Feb 06 Feb 06 Sales 12 13

Doritos
Altoids

San Diego
San Diego

Mar 06
Mar 06

72
65

...

1. 2. 3. 4. 5. 6. 7.

Sales by Product / City Sales by Product / Month Sales by Month / City Sales by Product Sales by City Sales by Month Sales (Total)

Is it possible to make a single pass over the transactional table? calculate multiple aggregates simultaneously?

Chunking
Partition transactional data into array chunks
13 14 15 16

64

9
Dimension B

10

11

12 42

City Array Chunk


1
12

8 20

36

Dimension C

Month
Dimension A

Product
Product Coke City San Diego Month Feb 06 Sales 12

Nave Algorithm
13 Dimension A 14 15 16

64

9
Dimension B 5

10

11

12

42
6 7 8

36 20

4
Dimension C

Pivot on AB
aggregate on all C
Dimension A

Nave Algorithm
13
14 15 16

64

9
Dimension B

10

11

12

42 5
6 7 8

36 20

4
Dimension C

Pivot on AB
aggregate on all C
Dimension A

Pivot on AC
aggregate on all B

Pivot on BC
aggregate on all A

Single Pass Algorithm


AB
13
1 2 3 4

64

14

15

16

AC
B

10

11

12 42

8 20

36

4
Dimension C

BC

Dimension A

1234

Make a single pass over data

Single Pass Algorithm


AB
13 9 5 10 6 11 7 12 8

64

13

14

15

16

AC
B

10

11

12 42

8 20

36

159 13

2 6 10

3 7 11

4 5 12

4
Dimension C

BC
13

9 10 11 12 5678

Dimension A

1234

Simultaneously maintain multiple aggregates

Single Pass Algorithm


AB
13 9 5 10 6 11 7 12 8

64

13

14

15

16

AC
B

10

11

12 42

8 20

36

159 13

2 6 10

3 7 11

4 5 12

4
Dimension C

BC
13

9 10 11 12 5678

Dimension A

1234

Write out completed aggregates

Single Pass Algorithm


AB
13 9 5 10 6 11 7 12 8

64

13

14

15

16

AC
B

10

11

12 42

8 20

36

159 13

2 6 10

3 7 11

4 5 12

4
Dimension C

BC

Dimension A

13

Only allocate memory that is necessary

Single Pass Algorithm


AB
13 9 5 10 6 11 7 12 8

Array Chunk

ABC
1 2 3 4

4x4x4

AC

AB
16 x 4 x 4

AC
4x4x4

BC
4x4

159 13

2 6 10

3 7 11

4 5 12

A
4x4

B
4

C
4

BC

all
1
13

Minimum memory spanning tree

Multi Pass Algorithm


Recursively aggregate
ABCD

ABC

ABD

ACD

BCD

AB

AC

BC

AD

BD

CD

all

Overview
Motivation Multi-Dimensional Data Model Research Areas Optimizations
Materializing multiple aggregates simultaneously Materialization strategy

Implementing Data Cubes


Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube

Implementing Data Cubes


Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger!

Implementing Data Cubes


Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger! M(ultidimensional)OLAP: good query performance, but bad scalability R(elational)OLAP: very scalable; query performance improved by materializing (partial) results

Implementing Data Cubes


V. Harinarayan, A. Rajaraman, J.D. Ullman: Implementing Data Cubes Efficiently SIGMOD 1996 Presents a materialization strategy for the cells of the cube.

Implementing Data Cubes


Month Day Year Year

Month
Time Id City Id City City Id Product Id Week Week

State

Sales
Product Id Name Category Category Id Category Name

Implementing Data Cubes


casted as particular case of the rewriting using views problem what cells to materialize what SQL views to materialize

Implementing Data Cubes


casted as particular case of the rewriting using views problem what cells to materialize what SQL views to materialize
ptc pt t tc p none pc c p = product t = time c = city

simple idea: Q1 depends on Q2 (Q1Q2) if Q1 can be fully answered using the results of Q2

Implementing Data Cubes


but cube dimensions are usually hierarchical
product_name product_category X week day month year X city state none

none

none

direct-product lattice
ptc pt pcatt pwc pyc tc pc pmc pts ps

p = product t = time c = city

Implementing Data Cubes


Def. cost of answering Q = # of rows in the table of ancestor(Q) It can be estimated w/o materializing the views

Assume that all queries are identical to some view in the lattice

Implementing Data Cubes


For a set S and a view v B(v,S) = wv, (w not in S) max{cost(w)-cost(v), 0} Greedy algorithm for selecting k views to materialize from the lattice:
1. S := {top view} 2. For i=1 to k, add v to S s.t. B(v,S) is maximized

The greedy algorithm is an (e-1)/e 0.63 approx. of the optimum.

Discussion
Questions from the audience

You might also like