0% found this document useful (0 votes)
401 views55 pages

Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann

The document discusses data warehousing and online analytical processing (OLAP), including the differences between OLTP and OLAP systems, common data warehouse architectures like star schemas, and SQL extensions used for aggregation queries in data warehousing. It also provides examples of typical OLAP queries and concepts like drill-down, roll-up, and data cubes used to enable interactive exploration and analysis of aggregated data.

Uploaded by

Priya Shah
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
401 views55 pages

Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann

The document discusses data warehousing and online analytical processing (OLAP), including the differences between OLTP and OLAP systems, common data warehouse architectures like star schemas, and SQL extensions used for aggregation queries in data warehousing. It also provides examples of typical OLAP queries and concepts like drill-down, roll-up, and data cubes used to enable interactive exploration and analysis of aggregated data.

Uploaded by

Priya Shah
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Data Warehousing

(Advanced Query Processing)


Carsten Binnig
Donald Kossmann
http://www.systems.ethz.ch
Selected References
• General
– Chaudhuri, Dayal: An Overview of Data Warehousing and OLAP Technology.
SIGMOD Record 1997
– Lehner: Datenbanktechnologie für Data Warehouse Systeme. Dpunkt Verlag 2003
– (…)
• New Operators and Algorithms
– Agrawal, Srikant: Fast Algorithms for Association Rule Mining. VLDB 1994
– Barateiro, Galhardas: A Survey of Data Quality Tools. Datenbank Spektrum 2005
– Börszonyi, Kossmann, Stocker: Skyline Operator. ICDE 2001
– Carey, Kossmann: On Saying Enough Already in SQL. SIGMOD 1997
– Dalvi, Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB 2004
– Gray et al.: Data Cube... ICDE 1996
– Helmer: Evaluating different approaches for indexing fuzzy sets. Fuzzy Sets and
Systems 2003
– Olken: Database Sampling - A Survey. Technical Report LBL.
– (…)
References (ctd.)
• Projects & Systems
– Aurora, Borealis Systems (Brown, Brandeis, MIT)
– Dittrich, Kossmann, Kreutz: Bridging the Gap between OLAP and
SQL. VLDB 2005 (Btell Demo)
– Garlic (IBM) - Haas et al. VLDB 1997
– STREAM (Stanford)
– Telegraph (Berkeley)
– Trio (Stanford)
• SQL Extensions
– Jensen et al. The Consensus Glossary of Temporal Database
Concepts. Dagstuhl 1997 (also see TSQL)
– Kimball, Strehlo: Why decision support fails and how to fix it.
SIGMOD Record 1995
– Witkowski et al.: Spreadsheets in RDBMS. SIGMOD 2003
History of Databases
• Age of Transactions (70s - 00s)
– Goal: reliability - make sure no data is lost
– 60s: IMS (hierarchical data model)
– 80s: Oracle (relational data model)
• Age of Business Intelligence (95 -)
– Goal: analyze the data -> make business decisions
– Aggregate data for boss. Tolerate imprecision!
– SAP BW, Microstrategy, Cognos, … (rel. model)
• Age of „Data for the Masses“
– Goal: everybody has access to everything, M2M
– Google (text), Cloud (XML, JSON: Services)
Purpose of this Class
• Age of Transactions (70s - 00s)
– Goal: reliability - make sure no data is lost
– 60s: IMS (hierarchical data model)
– 80s: Oracle (relational data model)
• Age of Business Intelligence (95 -)
– Goal: analyze the data -> make business decisions
– Aggregate data for boss. Tolerate imprecision!
– SAP BW, Microstrategy, Cognos, … (rel. model)
• Age of „Data for the Masses“
– Goal: everybody has access to everything, M2M
– Google (text), Cloud (XML, JSON: Services)
Overview
• Motivation and Architecture
• SQL Extensions for Data Warehousing (DSS)
• Algorithms and Query Processing Techniques
• ETL, Virtual Databases (Data Integration)
• Parallel Databases, Cloud
• Data Mining
• Probabilistic Databases
• Industry Talk
OLTP vs. OLAP
• OLTP – Online Transaction Processing
– Many small transactions
(point queries: UPDATE or INSERT)
– Avoid redundancy, normalize schemas
– Access to consistent, up-to-date database
• OLTP Examples:
– Flight reservation (see IS-G)
– Order Management, Procurement, ERP
• Goal: 6000 Transactions per second
(Oracle 1995)
OLTP vs. OLAP
• OLAP – Online Analytical Processing
– Big queries (all the data, joins); no Updates
– Redundancy a necessity (Materialized Views, special-
purpose indexes, de-normalized schemas)
– Periodic refresh of data (daily or weekly)
• OLAP Examples
– Management Information (sales per employee)
– Statistisches Bundesamt (Volkszählung)
– Scientific databases, Bio-Informatics
• Goal: Response Time of seconds / few minutes
OLTP vs. OLAP (Water and Oil)
• Lock Conflicts: OLAP blocks OLTP
• Database design:
– OLTP normalized, OLAP de-normalized
• Tuning, Optimization
– OLTP: inter-query parallelism, heuristic optimization
– OLAP: intra-query parallelism, full-fledged optimization
• Freshness of Data:
– OLTP: serializability
– OLAP: reproducability
• Precision:
– OLTP: ACID
– OLAP: Sampling, Confidence Intervals
Solution: Data Warehouse
• Special Sandbox for OLAP
• Data input using OLTP systems
• Data Warehouse aggregates and replicates data
(special schema)
• New Data is periodically uploaded to Warehouse
• Old Data is deleted from Warehouse
– Archiving done by OLTP system for legal reasons
Architecture
OLTP OLAP

OLTP Applications GUI, Spreadsheets

DB1
DB2
Data Warehouse
DB3
Data Warehouses in the real World
• First industrial projects in 1995
• At beginning, 80% failure rate of projects
• Consultants like Accenture dominate market
• Why difficult: Data integration + cleaning,
poor modeling of business processes in warehous
• Data warehouses are expensive
(typically as expensive as OLTP system)
• Success Story: WalMart - 20% cost reduction
because of Data Warehouse (just in time...)
Products and Tools
• Oracle 11g, IBM DB2, Microsoft SQL Server, ...
– All data base vendors
• SAP Business Information Warehouse
– ERP vendors
• MicroStrategy, Cognos
– Specialized vendors
– „Web-based EXCEL“
• Niche Players (e.g., Btell)
– Vertical application domain
Star Schema (relational)
Dimension Table
(e.g., POS)

Dimension Table Dimension Table


(e.g. Customer) (e.g., Time)

Fact Table
(e.g., Order)

Dimension Table
Dimension Table (e.g., Product)
(e.g., Supplier)
Fact Table (Order)
No. Cust. Date ... POS Price Vol. TAX
001 Heinz 13.5. ... Mainz 500 5 7.0
002 Ute 17.6. ... Köln 500 1 14.0
003 Heinz 21.6. ... Köln 700 1 7.0
004 Heinz 4.10. ... Mainz 400 7 7.0
005 Karin 4.10. ... Mainz 800 3 0.0
006 Thea 7.10. ... Köln 300 2 14.0
007 Nobbi 13.11. ... Köln 100 5 7.0
008 Sarah 20.12 ... Köln 200 4 7.0
Fact Table
• Structure:
– key (e.g., Order Number)
– Foreign key to all dimension tables
– measures (e.g., Price, Volume, TAX, …)
• Store moving data (Bewegungsdaten)
• Very large and normalized
Dimension Table (PoS)
Name Manager City Region Country Tel.
Mainz Helga Mainz South D 1422
Köln Vera Hürth South D 3311

• De-normalized: City -> Region -> Country


• Avoid joins
• fairly small and constant size
• Dimension tables store master data (Stammdaten)
• Attributes are called Merkmale in German
Snowflake Schema
• If dimension tables get too large
– Partition the dimension table
• Trade-Off
– Less redundancy (smaller tables)
– Additional joins needed
• Exercise: Do the math!
Typical Queries
SELECT d1.x, d2.y, d3.z, sum(f.z1), avg(f.z2)
FROM Fact f, Dim1 d1, Dim2 d2, Dim3 d3
WHERE a < d1.feld < b AND d2.feld = c AND
Join predicates
GROUP BY d1.x, d2.y, d3.z;

• Select by Attributes of Dimensions


– E.g., region = „south“
• Group by Attributes of Dimensions
– E.g., region, month, quarter
• Aggregate on measures
– E.g., sum(price * volumen)
Example
SELECT f.region, z.month, sum(a.price * a.volume)
FROM Order a, Time z, PoS f
WHERE a.pos = f.name AND a.date = z.date
GROUP BY f.region, z.month

South May 2500


North June 1200
South October 5200
North October 600
Drill-Down und Roll-Up
• Add attribute to GROUP BY clause
– More detailed results (e.g., more fine-grained results)
• Remove attribute from GROUP BY clause
– More coarse-grained results (e.g., big picture)
• GUIs allow „Navigation“ through Results
– Drill-Down: more detailed results
– Roll-Up: less detailed results
• Typical operation, drill-down along hierarchy:
– E.g., use „city“ instead of „region“
Data Cube
Product
Sales by
Product and
Year

Year
all

Balls
alle
2000
Nets 1999
1998 Region
North South all
Moving Sums, ROLLUP
• Example:
GROUP BY ROLLUP(country, region, city)
Give totals for all countries and regions
• This can be done by using the ROLLUP Operator
• Attention: The order of dimensions in the GROUP
BY clause matters!!!
• Again: Spreadsheets (EXCEL) are good at this
• The result is a table! (Completeness of rel. model!)
ROLLUP alla IBM UDB

SELECT Country, Region, City, sum(price*vol)


FROM Orders a, PoS f
WHERE a.pos = f.name
GROUP BY ROLLUP(Country, Region, City)
ORDER BY Country, Region, City;

Also works for other aggregate functions; e.g., avg().


Result of ROLLUP Operator

D North Köln 1000


D North (null) 1000
D South Mainz 3000
D South München 200
D South (null) 3200
D (null) (null) 4200
Summarizability (Unit)
• Legal Query
SELECT product, customer, unit, sum(volume)
FROM Order
GROUP BY product, customer, unit;
• Legal Query (product -> unit)
SELECT product, customer, sum(volume)
FROM Order
GROUP BY product, customer;
• Illegal Query (add „kg“ to „m“)!!!
SELECT customer, sum(volume)
FROM Order
GROUP BY customer;
Summarizability (de-normalized data)
Region Customer Product Volume Populat.
South Heinz Balls 1000 3 Mio.
South Heinz Nets 500 3 Mio.
South Mary Balls 800 3 Mio.
South Mary Nets 700 3 Mio.
North Heinz Balls 1000 20 Mio.
North Heinz Nets 500 20 Mio.
North Mary Balls 800 20 Mio.
North Mary Nets 700 20 Mio.
Customer, Product -> Revenue
Region -> Population
Summarizability (de-normalized data)
• What is the result of the following query?

SELECT region, customer, product, sum(volume)


FROM Order
GROUP BY ROLLUP(region, customer, product);

• All off-the-shelf databases get this wrong!


• Problem: Total Revenue is 3000 (not 6000!)
• BI Tools get it right: keep track of functional
dependencies
• Problem arises if reports involve several unrelated
measures.
Cube Operator

• Operator that computes all „combinations“


• Result contains „(null)“ Values to encode „all“

SELECT product, year, region, sum(price * vol)


FROM Orders
GROUP BY CUBE(product, year, region);
Result of Cube Operator
Product Region Year Revenue
Netze Nord 1998 ...
Bälle Nord 1998 ...
(null) Nord 1998 ...
Netze Süd 1998 ...
Bälle Süd 1998 ...
(null) Süd 1998 ...
Netze (null) 1998 ...
Bälle (null) 1998 ...
(null) (null) 1998 ...
Visualization as Cube
Product

Year
all

Balls
all
2000
Nets 1999
1998 Region
North South all
Pivot Tables
• Define „columns“ by group by predicates
• Not a SQL standard! But common in products
•Reference: Cunningham, Graefe, Galindo-Legaria: PIVOT
and UNPIVOT: Optimization and Execution Strategies in an
RDBMS. VLDB 2004
UNPIVOT (material, factory)
PIVOT (material, factory)
Btell Demo

http://www.btell.de
Top N
• Many applications require top N queries
• Example 1 - Web databases
– find the five cheapest hotels in Madison
• Example 2 - Decision Support
– find the three best selling products
– average salary of the 10,000 best paid employees
– send the five worst batters to the minors
• Example 3 - Multimedia / Text databases
– find 10 documents about „database“ and „web“.
• Queries and updates, any N, all kinds of data
Key Observation
Top N queries cannot be expressed well in SQL

SELECT * FROM Hotels h


WHERE city = Madison AND
5 > (SELECT count(*) FROM Hotels h1
WHERE city = Madison AND h1.price < h.price);

• So what do you do?


– Implement top N functionality in your application
– Extend SQL and the database management system
Implementation of Top N in the App
• Applications use SQL to get as close as possible
• Get results ordered, consume only N objects
and/or specify predicate to limit # of results
SELECT * SELECT *
FROM Hotels FROM Hotels
WHERE city = Madison WHERE city = Madison
ORDER BY price; AND price < 70;

– either too many results, poor performance


– or not enough results, user must ask query again
– difficult for nested top N queries and updates
Extend SQL and DBMS
• STOP AFTER clause specifies number of results
SELECT *
FROM Hotels
WHERE city = Madison
ORDER BY price
STOP AFTER 5 [WITH TIES];

• Returns five hotels (plus ties)


• Challenge: extend query processor, performance
Top N Queries Revisited
• Example: The five cheapest hotels
SELECT *
FROM Hotels
ORDER BY price
STOP AFTER 5;

• What happens if you have several criteria?


Nearest Neighbor Search
• Cheap and close to the beach
SELECT *
FROM Hotels
ORDER BY distance * x + price * y
STOP AFTER 5;

• How to set x and y ?


Skyline Queries
• Hotels which are close to the beach and cheap.

distance
x x x
x x
x x
x x x x x x
x x
x x Convex Hull
x
x x
Top 5 x Skyline (Pareto Curve)
price

Literatur: Maximum Vector Problem. [Kung et al. 1975]


Syntax of Skyline Queries
• Additional SKYLINE OF clause
[Börszönyi, Kossmann, Stocker 2001]

• Cheap & close to the beach

SELECT *
FROM Hotels
WHERE city = ´Nassau´
SKYLINE OF distance MIN, price MIN;
Flight Reservation
• Book flight from Washington DC to San Jose

SELECT *
FROM Flights
WHERE depDate < ´Nov-13´
SKYLINE OF price MIN,
distance(27750, dept) MIN,
distance(94000, arr) MIN,
(`Nov-13` - depDate) MIN;
Visualisation (VR)
• Skyline of NY (visible buildings)

SELECT * FROM Buildings


WHERE city = `New York`
SKYLINE OF h MAX, x DIFF, z MIN;
Location-based Services
• Cheap Italian restaurants that are close
• Query with current location as parameter

SELECT *
FROM Restaurants
WHERE type = `Italian`
SKYLINE OF price MIN, d(addr, ?) MIN;
Skyline and Standard SQL
• Skyline can be expressed as nested Queries

SELECT *
FROM Hotels h
WHERE NOT EXISTS (
SELECT * FROM Hotels
WHERE h.price >= price AND h.d >= d
AND (h.price > price OR h.d > d))

• Such queries are quite frequent in practice


• The response time is desastrous
Online Aggregation
• Get approximate result very quickly
• Result (conf. intervals) get better over time
• Based on random sampling (difficult!)
• No product supports this yet

SELECT cust, avg(price)


Cust Avg +/- Conf
FROM Order Heinz 1375 5% 90%
GROUP BY cust Ute 2000 5% 90%
Karin - - -
Time Travel
• Give results of query AS OF a certain point in time
• Idea: Database is a sequence of states
– DB1, DB2, DB3, … DBn
– Each commit of a transaction creates a new state
– To each state, associate time stamp and version number
• Idea builds on top of „serialization“
– Time travel mostly relevant for OLTP system in order to
get reproducable results or recover old data
• Implementation (Oracle - Flashback)
– Leverage versioned data store + snapshot semantics
– Chaining of versions of records
– Specialized index structures (add time as a „parameter“)
Time Travel Syntax
• Give me avg(price) per customer as of last week
SELECT cust, avg(price)
FROM Order AS OF MAR-23-2007
GROUP BY cust

• Can use timestamp or version number


– Special built-in functions to convert timestamp <-> von
– None of this is standardized (all Oracle specific)
Notification (Oracle)
• Inform me when account drops below 1000
SELECT *
FROM accounts a
WHEN a.balance < 1000

• Based on temporal model


– Query state transitions; monitor transition: false->true
– No notification if account stays below 1000
• Some issues:
– How to model „delete“?
– How to create an RSS / XML stream of events?
DBMS for Data Warehouses
• ROLAP – Extend RDBMS
– Special Star-Join Techniques
– Bitmap Indexes
– Partition Data by Time (Bulk Delete)
– Materialized Views
• MOLAP – special multi-dimensional systems
– Implement cube as (multi-dim.) array
– Pro: potentially fast (random access in array)
– Problem: array is very sparse
• Religious war (ROLAP wins in industry)
Materialized Views
• Compute the result of a query using the
result of another query
• Principle: Subsumption
– The set of all German researchers is a subset of
the set of all researchers
– If query asks for German researchers, use set of
all researchers, rather than all people
• Subsumption works well for GROUP BY
Materialized View

SELECT product, year, region, sum(price * vol)


FROM Order
GROUP BY product, year, region;

GROUP BY Materialized
product, year View

SELECT product, year, sum(price * vol)


FROM Order
GROUP BY product, year;
Computation Graph of Cube
{}

{product} {year} {region}

{product, year} {product, region} {year, region}

{product, year, region}

You might also like