8 Data Warehousing
8 Data Warehousing
Hassan Khosravi
Based on slides from Ed, George, Laks, Jennifer Widom (Stanford),
and Jiawei Han (Illinois)
1
Learning Goals
v Compare and contrast OLAP and OLTP processing (e.g., focus,
clients, amount of data, abstraction levels, concurrency, and
accuracy).
v Explain the ETL tasks (i.e., extract, transform, load) for data
warehouses.
v Explain the differences between a star schema design and a
snowflake design for a data warehouse, including potential
tradeoffs in performance.
v Argue for the value of a data cube in terms of: the type of data in
the cube (numeric, categorical, counts, sums) and the goals of
OLAP (e.g., summarization, abstractions).
v Estimate the complexity of a data cube in terms of the number of
views that a given fact table and set of dimensions could generate,
and provide some ways of managing this complexity.
2
Learning Goals (cont.)
v Given a multidimensional cube, write regular SQL queries that
perform roll-up, drill-down, slicing, dicing, and pivoting
operations on the cube.
v Use the SQL:1999 standards for aggregation (e.g., GROUP BY
CUBE) to efficiently generate the results for multiple views.
v Explain why having materialized views is important for a data
warehouse.
v Determine which set of views are most beneficial to materialize.
v Given an OLAP query and a set of materialized views, determine
which views could be used to answer the query more efficiently
than the fact table (base view).
v Define and contrast the various methods and policies for
materialized view maintenance.
3
What We Have Focused on So Far
❖ OLTP (On-Line Transaction Processing)
– class of information systems that facilitate and manage
transaction-oriented applications, typically for data entry
and retrieval transaction processing.
– the system responds immediately to user requests.
– high throughput and insert- or update-intensive database
management. These applications are used concurrently
by hundreds of users.
❖ The key goals of OLTP applications are availability,
speed, concurrency and recoverability.
source: Wikipedia 4
On-Line Transaction Processing
v OLTP Systems are OLTP
used to “run” a Typical User Basically Everyone (Many
business. Concurrent Users)
Type of Data Current, Operational, Frequent
Updates
Type of Query Short, Often Predictable
# of Queries Many concurrent queries
Access Many reads, writes and updates
DB Design Application oriented
6
A Producer Wants to Know …
Who are our
lowest/highest margin
customers ?
Who are my customers,
What is the most and what products
effective distribution are they buying?
channel?
Recognized by many
as the father of the
data warehouse
9
Data Warehouse—Subject-Oriented
❖ Subject-Oriented: Data that gives information about a
particular subject area such as customer, product, and
sales instead of about a company's ongoing operations.
1. Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
2. Provides a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
10
Application-Oriented
Membership levels Visit Level
ID Type Fee ID Type Fee
A Gold $100 YP Pool $15
B Basic $50 NP No pool $10
Members
Subject-Oriented
Revenue
ID Name Level StartDate
R-ID Date By Amount
111 Joe A 01/01/2008
….
222 Sue B 01/01/2008
7235 01/01/2008 Non-Member $15
333 Pat A 01/01/2008
7236 01/01/2008 Member $100
7237 01/01/2008 Member $50
Non-member Visits
ID VID VisitDate 7238 01/01/2008 Member $100
3 NP 01/01/2008 …
…. 11
Data Warehouse—Integrated
❖ Constructed by integrating multiple, heterogeneous data
sources.
– relational databases, XML, flat files, on-line transaction
records
13
A Typical Data Integration Scenario
Part 1 of 2
❖ Consider Canada Safeway’s data sources:
– Operational data from daily purchase transactions, in each store
– Data about item placement on shelves
– Supplier data
– Data about employees, their compensation, etc.
– Sales/promotion plans
– Product categories and sub-categories; brands, types; customer
demographics; time and date of sale
❖ Each of the above is essentially an autonomous OLTP
database (or set of tables)
– Local queries; no queries cutting across multiple databases
– Data must be current at all times
– Support for concurrency, recovery, and transaction management are a
must.
14
A Typical Data Integration Scenario,
Part 2 of 2
❖ Consider the following use-case queries:
❖ How does the sale of hamburgers for Feb. 2015 compare with
that for Feb. 2014?
❖ What were the sales of ketchup like last week (when
hamburgers and ketchup were placed next to each other)
compared to the previous week (when they were far apart)?
❖ What was the effect of the promotion on ground beef on the
sales of hamburger buns and condiments?
❖ How has the reorganization of the store(s) impacted sales?
❖ Be specific here to try to see cause-and-effect—especially
with respect to prior periods’ sales.
❖ What was the total sales volume on all frozen food items (not
just one item or a small set of items)? 15
Data Warehouse Integration
Challenges
❖ When getting data from multiple sources, we must eliminate
mismatches (e.g., different currencies, units/scales, schemas)
16
DW Integration Challenges (cont.)
❖ e.g., Shell may need to deal with data mismatches
throughout the multinational organization:
– Multiple cu rrencies and dynamic exchange rates
– Gallons vs. litres; thousands of cubic feet (of gas) vs. cubic
metres vs. British Thermal Units (BTUs)
– Different suppliers, contractors, unions, and business
partner relationships
– Different legal, tax, and royalty structures
– Local, provincial, federal, and international regulations
– Different statutory holidays (when reporting holiday sales)
– Light Sweet Crude (Nigeria) vs. Western Canada Select
(Alberta, heavier crude oil)
– Joint ownership of resources (partners)
– Retail promotions in its stores; different products 17
DW—Time Variant
❖ Time-Variant: All data in the data warehouse is associated
with a particular time period.
❖ The time horizon for a DW is significantly longer than for
operational systems.
– Operational DB: all data is current, and subject to change
– DW: contains lots of historical data that may never change, but
may have utility to the business when determining trends,
outliers, profitability; effect of business decisions or changes to
policy; pre-compute aggregations; record monthly balances or
inventory; etc.
◆ DW data is tagged with date and time, explicitly or implicitly
18
DW—Non-volatile
❖ Non-volatile: Data is stable in a data warehouse. More
data is added, but data is not removed. This enables
management to gain a consistent picture of the business.
❖ Real-time updates of operational data typically does not
occur in the DW environment, but can be done in bulk later
(e.g., overnight, weekly, monthly).
– DW does not require transaction processing, recovery,
and concurrency control mechanisms.
– DW focuses on two major operations:
◆ loading of data and accessing of data
19
Operational DBMS vs. Data Warehouse
❖ Operational DBMS ❖ Data Warehouse
– Day-to-day operations: – Data analysis and
purchasing, inventory, decision making
banking, payroll, – Integrated data spanning
manufacturing, long time periods, often
registration, accounting, augmented with
etc. summary information
– helps to “optimize” the
– Used to run a business business
20
Why a Separate Data Warehouse?
❖ High performance for both systems
– DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
– Warehouse—tuned for complex queries, multidimensional
views, consolidation
Typical Size MB to GB GB to TB
26
EXTERNAL DATA SOURCES
Data Warehousing
EXTRACT
TRANSFORM
LOAD
❖ The process of constructing REFRESH
SUPPORTS
DATA OLAP
MINING 27
Business Intelligence: 3 Major Areas
1. Data Warehousing
– Consolidate and integrate operational OLTP databases from many
sources into one large, well-organized repository
◆ e.g., acquire data from k different stores or branches
◆ Must handle conflicts in schemas, semantics, platforms, integrity
constraints, etc.
◆ May need to perform data cleaning
– Load data through periodic updates
◆ Synchronization and currency considerations
– Maintain an archive of potentially useful historical data including
data that is an aggregation or summary, but has been pre-processed.
❖ Data Warehouses are then used to perform OLAP and Data
Mining (DM).
28
Business Intelligence: 3 Major Areas
2. OLAP
– Perform complex SQL queries and views, including trend
analysis, drilling down for more details, and rolling up to
provide more easily understood summaries.
– Perform interactive, exploratory data analysis
– Queries are based on spreadsheet-style operations, albeit
on a “multidimensional” scale/view of data.
– Queries are normally performed by domain (business)
experts rather than database experts.
3. Data Mining
– Exploratory search for interesting trends (patterns) and
anomalies (e.g., outliers, deviations) using more
sophisticated algorithms (as opposed to queries).
29
That is all great, but what are the
challenges with data warehousing?
❖ Semantic Integration: Extract, Transform, Load
challenges. We already talked about the Shell example.
31
Data Warehousing Challenges
(Answering Queries Quickly)
❖ Pre-compute and store (materialize) some answers
– Use a Data Cube to store summarized/aggregated data to
answer queries, instead of having to go through a much
bigger table to find the same answer.
– The computation is similar in spirit to relational query
optimization which is studied in detail in CPSC 404.
32
OLAP Queries
❖ OLAP queries are full of groupings and
aggregations.
❖ The natural way to think about such queries is in
terms of a multidimensional model, which is an
extension of the table model in regular relational
databases.
❖ This model focuses on:
– a set of numerical measures: quantities that are
important for business analysis, like sales, etc.
– a set of dimensions: entities on which the measures
depend on, like location, date, etc.
33
Multidimensional Data Model
❖ The main relation, which relates dimensions to a
measure via foreign keys, is called the fact table.
– Recall that a FK in one table refers to a candidate key (and most
of the time, the primary key) in another table.
– The fact table has FKs to the dimension tables.
– These mappings are essential.
❖ Each dimension can have additional attributes and an
associated dimension table.
– Attributes can be numeric, categorical, temporal, counts, sums
35
Running Example
❖ Star Schema – fact table references dimension tables
– Join → Filter → Group → Aggregate
State
County
Gender Age
City Category Color
37
Running Example (cont.)
Sales(storeID, itemID, custID, price)
Store(storeID, city, county, state)
Item(itemID, category, color)
Customer(custID, cname, gender, age)
CA
WA
20 State
21 Santa Clara Palo Alto
M 22 Santa Mateo County Mountain View
F 25
T-shirt Red
Gender Age King Menlo Park Jacket Blue
Belmont Category Color
City
Customer Store Seattle
Redmond Item 38
Full Star Join
❖ An example of how to find the full star join (or complete star
join) among 4 tables (i.e., fact table + all 3 of its dimensions) in
a Star Schema:
– Join on the foreign keys
SELECT *
FROM Sales F, Store S, Item I, Customer C
WHERE F.storeID = S.storeID and
F.itemID = I.itemID and
F.custID = C.custID;
❖ If we join fewer than all dimensions, then we have a star join.
❖ In general, OLAP queries can be answered by computing
some or all of the star join, then by filtering, and then by
aggregating.
39
Full Star Join Summarized
❖ Find total sales by store, item, and customer.
Cust3
… … … …
Store5
Cust2 65 Store4
Store3
Store2
Cust1 10 Store1
Item4 Item3 Item2 Item1
Items 40
OLAP Queries – Roll-up
❖ Roll-up allows you to summarize data by:
– changing the level of granularity of a
particular dimension
– dimension reduction
41
Roll-up Example 1 (Hierarchy)
❖ Use Roll-up on total sales by store, item, and
customer to find total sales by item and
customer for each county.
SELECT storeID, itemID, custID,
SUM(price)
Cust3 FROM Sales F
GROUP BY storeID, itemID, custID;
Customers
Store5
Cust2 Store4
Store3
Store2
Cust1
Store1
Item4 Item3 Item2 Item1
Items
Cust3
Customers
SELECT county, itemID, custID,
SUM(price) Cust2
Santa Clara
FROM Sales F, Store S San Mateo
Cust1
WHERE F.storeID = S.storeID King
GROUP BY county, itemID, custID; Item4 Item3 Item2 Item1
42
Items
Roll-up Example 2 (Hierarchy)
❖ Use Roll-up on total sales by item, customer,
and county to find total sales by item, gender
and county.
SELECT county, itemID, custID,
SUM(price)
Cust3
Customers
SUM(price)
female Santa Clara
FROM Sales F, Store S, Customer C
San Mateo
WHERE F.storeID = S.storeID and male
King
F.custID = C.custID
Item4 Item3 Item2 Item1
GROUP BY county, itemID, gender; 43
Items
Roll-up Example 3 (Dimension)
❖ Use Roll-up on total sales by item, gender and
county to find total sales by item for each county.
SELECT county, itemID, gender,
SUM(price)
Customers
45
Drill-down Example 1 (Hierarchy)
❖ Use Drill-down on total sales by item and
gender for each county to find total sales by
item and gender for each city.
SELECT county, itemID, gender,
Customers
SUM(price)
female Santa Clara FROM Sales F, Store S, Customer C
male San Mateo WHERE F.storeID = S.storeID AND
King
F.custID = C.custID
Item4 Item3 Item2 Item1
GROUP BY county, itemID, gender;
Items
SUM(price) Female
Belmont
Redmond
FROM Sales F, Store S, Customer C Seattle
Palo Alto
WHERE F.storeID = S.storeID AND Male
Menlo Park
F.custID = C.custID
Item4 Item3 Item2 Item1
GROUP BY city, itemID, gender; 46
Items
Drill-down Example 2 (Dimension)
❖ Use Drill-down on total sales by item and
county to find total sales by item and gender for
each county.
SELECT county, itemID, SUM(price)
Santa Clara FROM Sales F, Store S
San Mateo WHERE F.storeID = S.storeID
King
Item4 Item3 Item2 Item1 GROUP BY county, itemID;
Items
48
Slicing Example 1
❖ Use Slicing on total sales by item and gender for each
county to find total sales by item and gender for Santa
Clara.
SELECT county, itemID, gender,
SUM(price)
Customers
female
SUM(price)
Santa Clara
San Mateo
FROM Sales F, Store S, Customer C
male WHERE F.storeID = S.storeID AND
King
Item4 Item3 Item2 Item1 F.custID = C.custID
Items GROUP BY county, itemID, gender;
Customers
WHERE F.storeID = S.storeID AND Female Santa Clara
F.custID = C.custID AND San Mateo
Male
F.itemID = I.itemID AND King
category = 'Tshirt' T-shirt
GROUP BY county, gender; Items 50
OLAP Queries – Dicing
❖ The dice operation produces a sub-cube by
picking specific values for multiple dimensions.
Belmont
Female Redmond
Seattle
Palo Alto
Male SELECT city, itemID, gender, SUM(price)
Menlo Park
Item4 Item3 Item2 Item1 FROM Sales F, Store S, Customer C
Items WHERE F.storeID = S.storeID AND
F.custID = C.custID
GROUP BY city, itemID, gender; 51
Dicing Example 1
❖ Use Dicing on total sales by gender, item, and
city to find total sales by gender, category, and
city for red items in the state of California (CA).
SELECT city, itemID, gender, SUM(price)
FROM Sales F, Store S, Customer C
WHERE F.storeID = S.storeID AND
Customers
Belmont
Female Redmond F.custID = C.custID
Seattle
Male
Palo Alto GROUP BY city, itemID, gender;
Menlo Park
Item4 Item3 Item2 Item1
Items
SELECT category, city, gender, SUM(price)
Customers
FROM Sales F, Store S, Customer C, Item I Female
Belmont
WHERE F.storeID = S.storeID AND Palo Alto
F.custID = C.custID AND Male Menlo Park
F.itemID = I.itemID AND T-shirt Jacket
color = 'red' AND state = 'CA' Items
GROUP BY category, city, gender; 52
Clicker Question
❖ Consider a fact table Sales(saleID, itemID, color, size, qty,
unitPrice), and the following three queries:
❖ Q1: SELECT itemID, color, size, Sum(qty*unitPrice) FROM
Sales GROUP BY itemID, color, size
❖ Q2: SELECT itemID, size, Sum(qty*unitPrice) FROM Sales
GROUP BY itemID, size
❖ Q3: SELECT itemID, size, Sum(qty*unitPrice) FROM Sales
WHERE size < 10 GROUP BY itemID, size
55
Pivoting Example 1
❖ From total sales by store and customer pivot to
find total sales by item and store.
Store5
Store4
Cust2 Store3
Store2
Cust1 Store1
Store5
Store4
SELECT storeID, itemID, sum(price) Store3
Store2
FROM Sales Store1
GROUP BY storeID, itemID; Item4 Item3 Item2 Item1
Items 56
Aggregating over Multiple Fact Tables
SELECT itemID, sum(price)
FROM Sales
GROUP BY itemID;
Item4 Item3 Item2 Item1
Items
Customers
Cust3
FROM Sales
Cust2
GROUP BY custID;
Cust3
Customers
Store5
Cust2 Store4 Cust1
Store3
Cust1 Store2
Store1
Item4 Item3 Item2 Item1 SELECT storeID, sum(price)
Items FROM Sales
Store5 GROUP BY storeID;
Store4
Store3
Store2
Store1
SELECT sum(price)
FROM Sales
57
Data Cube
v Adata cube is a k-dimensional object containing
both fact data and dimensions.
Customers
Cust3
Cust2
Store5
Cust1 Store4
Store3
Store2
Store1
Item4 Item3 Item2 Item1
Items
v Acube contains pre-calculated, aggregated,
summary information to yield fast queries.
58
Data Cube (cont.)
❖ The small, individual blocks in the multidimensional cube
are called cells, and each cell is uniquely identified by the
members from each dimension.
Customers
Cust3
Cust2
Store5
Cust1 Store4
Store3
Store2
Store1
Item4 Item3 Item2 Item1
Items
60
Clicker Question
❖ If we have 2 stores, 5 items, and 10 customers, how
many potential "entries" are there in the data cube?
(The cube diagram is just an arbitrary example.)
❖ A: 17
❖ B: 100
❖ C: 117
❖ D: 198
❖ E: none of the above
61
Clicker Question
❖ If we have 2 stores, 5 items, and 10 customers, how
many potential "entries" are there in the data cube?
(The cube diagram is just an arbitrary example.)
❖ A: 17
❖ B: 100
❖ C: 117
❖ D: 198
❖ E: none of the above
❖ One way: 2*5*10 + 2*5 + 2*10 + 5*10 +2 + 5 + 10 +1
❖ Another way: (2+1) * (5+1) * (10+1) = 3 * 6 * 11 62
Clicker Question
❖ How many standard SQL queries are required
for computing all of the cells of the cube?
❖ A: 2
❖ B: 4
❖ C: 6
❖ D: 8
❖ E: 10
63
Clicker Question
❖ How many standard SQL queries are required for
computing all of the cells of the cube?
{S, T, C}
73
Snowflake Schema
v The alternative organization is a snowflake schema:
§ each dimension is normalized into a set of tables
§ usually, one table per level of hierarchy, per dimension
v Example: TIMES table would be split into:
§ TIMES(timeid, date)
§ DWEEK(date, week)
§ DMONTH(date, month)
v Snowflake schema features:
§ Query formulation is inherently more complex (possibly many joins
per dimension).
v Neither schema is fully satisfactory for OLAP applications.
v The star schema is more popular, and is gaining interest.
74
Snowflake Schema example
Source: http://en.wikipedia.org/wiki/Snowflake_schema
75
Star vs. Snowflake
Star Snowflake
Ease of Has redundant data and hence is No redundancy, schemas are
maintenance less easy to maintain/change easier to maintain and change.
Ease of Use Lower complex query writing; easier More complex queries and hence
to understand less easy to understand
Query Fewer foreign keys and hence More foreign keys and hence
Performance shorter query execution time longer query execution time
(faster) (slower)
Joins Fewer Joins More Joins
Dimension table A single dimension table for each May have more than one
dimension dimension table for each
dimension
When to use Star schema is the default choice When dimension table is
relatively big in size, or we
expect a lot of updates
Normalization Dimension Tables are not Dimension Tables are Normalized
Normalized
76
Measures in Fact Tables
❖ Additive facts are measurements in a fact table that
can be added across all the dimensions. e.g., Price
❖ Semi-additive facts are numeric facts that can be
added along some dimensions in a fact table but
not others.
– balance amounts are common semi-additive facts
because they are additive across all dimensions except
time.
❖ Non-additive facts cannot logically be added
between rows.
– Ratios and percentages
– A good approach for non-additive facts is to store the
fully additive components and later compute the final
non-additive fact. 77
Factless Fact Tables
❖ Factless fact table: A fact table that has no facts but
captures certain many-to-many relationships
between the dimension keys. It’s most often used
to represent events or to provide coverage
information that does not appear in other fact
tables.
78
Factless Fact Table Example
MOLAP ROLAP
Data Can require up to 50% less disk Requires more disk space
Compression space. A special technique is used
for storing sparse cubes.
Query Fast query performance due to Not suitable when the model
Performance optimized storage, is heavy on calculations; this
multidimensional indexing and doesn’t translate well into
caching SQL.
Data latency Data loading can be quite lengthy As data always gets fetched
for large data volumes. This is from a relational source, data
usually remedied by doing only latency is small or none.
incremental processing.
handling non- Tends to suffer from slow Better at handling textual
aggregatable performance when dealing with descriptions
facts textual descriptions
81
Which Storage Mode is Recommended?
❖ Almost always, choose MOLAP.
❖ Choose ROLAP if one or more of these are true:
– There is a very large number of members in a dimension
—typically hundreds of millions of members.
– The dimension data is frequently changing.
– You need real-time access to current data (as opposed to
historical data).
– You don’t want to duplicate data.
◆ Reference: [Harinath, et al., 2009]
83
Queries over Views
❖ How does using a view work?
Create view TshirtSales AS
SELECT category, county, gender, price
View FROM Sales F, Store S, Customer C, Item I
WHERE F.storeID = S.storeID AND F.custID = C.custID AND
F.itemID = I.itemID ANDcategory = 'Tshirt'
88
Benefit of Materializing a View
{S, T, C} 6M
❖ The number associated with each
{S, T} 0.8 M {S, C} 6M {T, C} 6M node represents the number of
rows in that view (in millions)
{S} 0.01M {T} 0.2M {C} 0.1M ❖ Initial state has only the top most
view materialized
{} 1
❖ Define the benefit (savings) of view v relative to S as B(v,S).
B(v, S) = 0 S = set of views selected for
For each w ≦ v materialization
u = view of least cost in S such that w ≦ u b ≦ a means b is a descendant
if C(v) < C(u) then Bw = C(v) – C(u) of a (including itself)
else Bw = 0
C(v) = cost of view v, which
B(v,S) = B(v,S) + Bw we’re approximating by its size
end 89
Benefit of Materializing a View
{S, T, C} 6M
❖ The number associated with each
{S, T} 0.8 M {S, C} 6M {T, C} 6M node represents the number of
rows in that view (in millions)
{S} 0.01M {T} 0.2M {C} 0.1M ❖ Initial state has only the top most
view materialized
{} 1
❖ Define the benefit (savings) of view v relative to S as B(v,S).
B(v, S) = 0 Example
For each w ≦ v S = {S, T, C}, v = {S, T}
u = view of least cost in S such that w ≦ u B{S, T} = 5.2 M
if C(v) < C(u) then Bw = C(v) – C(u) B{S} = 5.2 M
else Bw = 0 B{T} = 5.2 M
B(v,S) = B(v,S) + Bw B{} = 5.2 M
end B(v,S) = 5.2M *4
90
Finding the Best k Views to Materialize
{S, T, C} 6M
❖ The number associated with each
{S, T} 0.8 M {S, C} 6M {T, C} 6M node represents the number of
rows in that view (in millions)
{S} 0.01M {T} 0.2M {C} 0.1M ❖ Initial state has only the top most
view materialized
{} 1
❖ A greedy algorithm for finding the best k views to materialize
S = {top view}
for i=1 to k do begin
select v ⊄ S such that B(v,S) is maximized
S = S union {v}
end
91
HRU Algorithm Example
View 1st choice 2nd choice
{S, T, C} 6M {S, T} (6-0.8)M *4 = 20.8M
{S, C} (6-6) *4 = 0 (6-6) *2 = 0
{S, T} 0.8 M {S, C} 6M {T, C} 6M {T, C} (6-6) *4 = 0 (6-6) *2 = 0
{S} (6-0.01) M*2 = 11.98M (0.8-0.01)M*2 = 1.58M
{S} 0.01M {T} 0.2M {C} 0.1M {T} (6-0.2) M*2 = 11.6M (0.8-0.2)M*2 = 1.2M
{C} (6-0.1) M*2 = 11.8M (6-0.1)M +
(0.8–0.1)M = 6.6M
{} 1
{} 6M – 1 0.8M – 1
❖ For k=2, other than {S, T, C}, {S, T} and {C} will be
materialized.
92
In-class Exercise
❖ Assuming 'a' is already 100
materialized, what are the
a
best 3 other views that we
should materialize? 50 75
b c
20 30 40
d e f
1 10
g h
1
i
93
In-class Exercise
❖ Assuming 'a' is already
materialized, what are the 100
best 3 other views that we a
should materialize? 50 75
View 1st choice
b c
b 50*6=300 20 30 40
c 25*6=150 d e f
d 80*3=240
e 70*4=280
f 60*3=180 1 10
g h
g 99*2=198
h 90*2=180
1
i 99 i
94
In-class Exercise
❖ Assuming 'a' is already
materialized, what are the 100
best 3 other views that we a
should materialize? 50 75
View 1st choice
b c
b 50*6=300 20 30 40
c 25*6=150 d e f
d 80*3=240
e 70*4=280
f 60*3=180 1 10
g h
g 99*2=198
h 90*2=180
1
i 99 i
95
In-class Exercise
❖ Assuming 'a' is already
materialized, what are the 100
best 3 other views that we a
should materialize? 50 75
View 1st choice 2nd choice
b c
b 50*6=300 20 30 40
c 25*6=150 25*2=50 d e f
d 80*3=240 30*3=90
e 70*4=280 20*4=80
f 60*3=180 60+10*2=80 1 10
g h
g 99*2=198 49*2=98
h 90*2=180 40*2=80
1
i 99 49 i
96
In-class Exercise
❖ Assuming 'a' is already
materialized, what are the 100
best 3 other views that we a
should materialize? 50 75
View 1st choice 2nd choice
b c
b 50*6=300 20 30 40
c 25*6=150 25*2=50 d e f
d 80*3=240 30*3=90
e 70*4=280 20*4=80
f 60*3=180 60+10*2=80 1 10
g h
g 99*2=198 49*2=98
h 90*2=180 40*2=80
1
i 99 49 i
97
In-class Exercise
❖ Assuming 'a' is already
materialized, what are the 100
best 3 other views that we a
should materialize? 50 75
View 1st choice 2nd choice 3rd choice
b c
b 50*6=300 20 30 40
c 25*6=150 25*2=50 25*2=50 d e f
d 80*3=240 30*3=90 30
e 70*4=280 20*4=80 20*2=40
f 60*3=180 60+10*2=80 60+10=70 1 10
g h
g 99*2=198 49*2=98
h 90*2=180 40*2=80 40
1
i 99 49 0 i
98
In-class Exercise
❖ Assuming 'a' is already
materialized, what are the 100
best 3 other views that we a
should materialize? 50 75
View 1st choice 2nd choice 3rd choice
b c
b 50*6=300 20 30 40
c 25*6=150 25*2=50 25*2=50 d e f
d 80*3=240 30*3=90 30
e 70*4=280 20*4=80 20*2=40
f 60*3=180 60+10*2=80 60+10=70 1 10
g h
g 99*2=198 49*2=98
h 90*2=180 40*2=80 40
1
i 99 49 0 i
99
Using the Materialized Views
❖ Once we have chosen a set
of views, we need to 100
consider how they can be a
used to answer queries on 50 75
other views. b c
20 30 40
❖ What is the best way to d e f
answer queries on view ‘h’?
1 10
g h
1
i
100
The Exponential Explosion of Views
❖ Assume that we have two dimensions, each with a hierarchy
Store dimension Calendar dimension
0 storeID 0 dateID
1 city {storeID, dateID} 1 month
2 state 2 year
{storeID, month} {city, dateID}
{state} {year}
{} 101
Issues in View Materialization (2)
❖ What indexes should we build on the materialized
views?
– No index is good for all queries.
❖ Consider the ItemCustSales view, which involves a
join of Item, Customer, and Sales. Let’s assume that
we use (category, gender, price) as our index.
SELECT gender, sum(price) SELECT category, sum(price)
FROM Sales F, Customer C, Item I FROM Sales F, Customer C, Item I
Where F.custID = C.custID AND Where F.custID = C.custID AND
F.itemID = I.itemID AND F.itemID = I.itemID AND
category = 'T-shirt' gender = 'M'
GROUP BY gender GROUP BY category
Index on pre-computed Index is less useful (must
view is a good idea scan entire index) 102
Issues in View Materialization (3)
❖ How do we maintain views incrementally without
re-computing them from scratch?
v Two steps:
§ Identify the changes to the view when the data
changes.
§ Apply only those changes to the materialized view.
103
Issues in View Materialization (4)
❖ How should we refresh and maintain a
materialized view when an underlying table is
modified?
❖ Maintenance policy: Controls when we refresh
§ Immediate: As part of the transaction that modifies the
underlying data tables
+ Materialized view is always consistent
- Updates are slow
§ Deferred: Some time later, in a separate transaction
- View is inconsistent for a while
+ Can scale to maintain many views without slowing
updates 104
Deferred Maintenance
❖ Three flavors:
– Lazy: Delay refresh until next query on view; then refresh
before answering the query.
◆ This approach slows down queries rather than
updates, in contrast to immediate maintenance.
105
Top N Queries
❖ For complex queries, users like to get an approximate
answer quickly and keep refining their query.
108
Multidimensional Expressions (MDX)
❖ Multidimensional Expressions (MDX) is a query
language for OLAP databases, much like SQL is a
query language for relational databases.