Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
DB1
DB2
Data Warehouse
DB3
Data Warehouses in the real World
• First industrial projects in 1995
• At beginning, 80% failure rate of projects
• Consultants like Accenture dominate market
• Why difficult: Data integration + cleaning,
poor modeling of business processes in warehous
• Data warehouses are expensive
(typically as expensive as OLTP system)
• Success Story: WalMart - 20% cost reduction
because of Data Warehouse (just in time...)
Products and Tools
• Oracle 11g, IBM DB2, Microsoft SQL Server, ...
– All data base vendors
• SAP Business Information Warehouse
– ERP vendors
• MicroStrategy, Cognos
– Specialized vendors
– „Web-based EXCEL“
• Niche Players (e.g., Btell)
– Vertical application domain
Star Schema (relational)
Dimension Table
(e.g., POS)
Fact Table
(e.g., Order)
Dimension Table
Dimension Table (e.g., Product)
(e.g., Supplier)
Fact Table (Order)
No. Cust. Date ... POS Price Vol. TAX
001 Heinz 13.5. ... Mainz 500 5 7.0
002 Ute 17.6. ... Köln 500 1 14.0
003 Heinz 21.6. ... Köln 700 1 7.0
004 Heinz 4.10. ... Mainz 400 7 7.0
005 Karin 4.10. ... Mainz 800 3 0.0
006 Thea 7.10. ... Köln 300 2 14.0
007 Nobbi 13.11. ... Köln 100 5 7.0
008 Sarah 20.12 ... Köln 200 4 7.0
Fact Table
• Structure:
– key (e.g., Order Number)
– Foreign key to all dimension tables
– measures (e.g., Price, Volume, TAX, …)
• Store moving data (Bewegungsdaten)
• Very large and normalized
Dimension Table (PoS)
Name Manager City Region Country Tel.
Mainz Helga Mainz South D 1422
Köln Vera Hürth South D 3311
Year
all
Balls
alle
2000
Nets 1999
1998 Region
North South all
Moving Sums, ROLLUP
• Example:
GROUP BY ROLLUP(country, region, city)
Give totals for all countries and regions
• This can be done by using the ROLLUP Operator
• Attention: The order of dimensions in the GROUP
BY clause matters!!!
• Again: Spreadsheets (EXCEL) are good at this
• The result is a table! (Completeness of rel. model!)
ROLLUP alla IBM UDB
Year
all
Balls
all
2000
Nets 1999
1998 Region
North South all
Pivot Tables
• Define „columns“ by group by predicates
• Not a SQL standard! But common in products
•Reference: Cunningham, Graefe, Galindo-Legaria: PIVOT
and UNPIVOT: Optimization and Execution Strategies in an
RDBMS. VLDB 2004
UNPIVOT (material, factory)
PIVOT (material, factory)
Btell Demo
http://www.btell.de
Top N
• Many applications require top N queries
• Example 1 - Web databases
– find the five cheapest hotels in Madison
• Example 2 - Decision Support
– find the three best selling products
– average salary of the 10,000 best paid employees
– send the five worst batters to the minors
• Example 3 - Multimedia / Text databases
– find 10 documents about „database“ and „web“.
• Queries and updates, any N, all kinds of data
Key Observation
Top N queries cannot be expressed well in SQL
distance
x x x
x x
x x
x x x x x x
x x
x x Convex Hull
x
x x
Top 5 x Skyline (Pareto Curve)
price
SELECT *
FROM Hotels
WHERE city = ´Nassau´
SKYLINE OF distance MIN, price MIN;
Flight Reservation
• Book flight from Washington DC to San Jose
SELECT *
FROM Flights
WHERE depDate < ´Nov-13´
SKYLINE OF price MIN,
distance(27750, dept) MIN,
distance(94000, arr) MIN,
(`Nov-13` - depDate) MIN;
Visualisation (VR)
• Skyline of NY (visible buildings)
SELECT *
FROM Restaurants
WHERE type = `Italian`
SKYLINE OF price MIN, d(addr, ?) MIN;
Skyline and Standard SQL
• Skyline can be expressed as nested Queries
SELECT *
FROM Hotels h
WHERE NOT EXISTS (
SELECT * FROM Hotels
WHERE h.price >= price AND h.d >= d
AND (h.price > price OR h.d > d))
GROUP BY Materialized
product, year View