0% found this document useful (0 votes)
76 views46 pages

OLAP and Data Warehousing: Slides Courtesy Of: Julia Stoyanovitch

Uploaded by

Shakul Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views46 pages

OLAP and Data Warehousing: Slides Courtesy Of: Julia Stoyanovitch

Uploaded by

Shakul Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

OLAP and Data Warehousing

Slides courtesy of:


Julia Stoyanovitch
Columbia University

Surajit Chaudhuri
Microsoft Research, Redmond, WA, USA
[email protected]

Umeshwar Dayal
Hewlett-Packard Labs., Palo Alto, CA, USA
[email protected]
What is OLAP?
 On-Line Analytical Processing
 Information technology to help the knowledge
worker (executive, manager, analyst) make faster
and better decisions.
 OLAP is an element of decision support systems
(DSS).

© Surajit Chaudhuri, Umeshwar Dayal 23


Running Example: Car Sales
 Cars: carId, make, model, color

 Dealers: dealerId, city, state

 Time of Sale: tid, year, month, day

 Sales: carId, dealerId, tid, price

3
OLTP Queries: Examples

 create a new sales record that indicates that a red


VW Golf was sold in Boston, MA

 see how many black and silver VW Passats were


sold at dealership #123 on April 11th 2005

4
OLAP Queries: Examples
 Analyze comparative sales of the different colors
of VW Golf by state

 See which months are particularly favorable to the


sale of different VW models and colors

 Rank VW dealerships by revenue, displaying a


ranked list of dealerships and % differences in
sales between each dealership and the one ranked
1 place higher
5
OLAP vs. OLTP
OLTP  OLAP
 User  Clerk, IT professional  Knowledge worker
 Function  Day to day operations  Decision support
 DB design  Application-oriented  Subject-oriented
 (E-R based)  (Star, snowflake)
Data  Current, Isolated  Historical, Consolidated
 View  Detailed, Flat relational Summarized,
 Usage  Structured, Repetitive Multidimensional
 Unit of work  Short, simple transaction Ad hoc
 Access Read/write  Complex query
 Operations  Index/hash on prim. key  Read mostly
  Tens
# Records accessed
 Lots of scans
 # Users  Thousands  Millions
 Db size  100 MB - GB  Hundreds
 Metric  Trans. throughput  100 GB - TB
© Surajit Chaudhuri, Umeshwar Dayal
 Query throughput, response 6
OLAP Queries: Challenges
 Many AND, OR in the WHERE clause
 Self-join, nested sub-queries
» Last year’s sales vs this year’s sales for each product
» Show reps for whom every sale has been more than $15000
 Extensive use of aggregation, often on related datasets
 Aggregation over time periods
 Ranking
 Use of statistical functions
 Very large datasets
 Expectation of an interactive response time

7
OLAP Query Tools
 Goal of OLAP is to support ad-hoc querying for
the business analyst (Power user)
 Business analysts are familiar with spreadsheets
 Extend spreadsheet analysis model to work with
warehouse data
» Large data set
» Semantically enriched to understand business terms
(e.g., time, geography)
» Combined with reporting features
 Multidimensional view of data is the foundation of
OLAP.
© Surajit Chaudhuri, Umeshwar Dayal 8
Multidimensional Data Model
 Database is a set of facts (points) in a
multidimensional space
 A fact has a measure dimension
» quantity that is analyzed, e.g., sale amount, budget
 A set of dimensions with respect to which data is
analyzed
» e.g., store, product, date associated with a sale amount
 Dimensions form a sparsely populated coordinate
system
 Each dimension has a set of attributes
» e.g., owner, city and county of store

© Surajit Chaudhuri, Umeshwar Dayal 9


Attribute Hierarchies
 Attributes of a dimension may be related
 An m:1 dependency is most common
 Dependency graph may be:
» Hierarchy: e.g.,
city -> state -> country
» Lattice:
date -> month -> year
date -> week -> year
 Hierarchies are most common
 Dependencies influence choice of operations and
data representation

© Surajit Chaudhuri, Umeshwar Dayal 10


Multidimensional Data
Sales volume as a function of product, time, geography
Dimensions
Color, State, Date
ate

WI
CA Attributes
St

NY
Red
date (year, month, day)
10
Color

Green 50 Attribute Hierarchies and Lattice


Blue 20 Industry Country Year
White 12

Silver 15
Category State Quarter
Black 10

1 2 3 4 5 67
Date Product City Month Week

Fact data: Sales volume in $100 Date


© Surajit Chaudhuri, Umeshwar Dayal 11
ROLAP and MOLAP
 Relational OLAP (ROLAP)
» Relational and Specialized Relational DBMS to store
and manage warehouse data
» OLAP middleware to support missing pieces
– Optimize for each DBMS backend
– Aggregation Navigation Logic
– Additional tools and services
 Multidimensional OLAP (MOLAP)
» Array-based storage structures
» Direct access to array data structures

© Surajit Chaudhuri, Umeshwar Dayal 12


Multiple Aggregations
 Create a 2-dimensional spreadsheet that shows
sum of sales by year as well as by model of car
 Each subtotal requires a separate aggregate query

STATE
Y Sum
E by
A
R Year
Sum By State

© Surajit Chaudhuri, Umeshwar Dayal 13


Example:
Multiple Aggregations
WI CA Total

2003 63 81 144

2004 38 107 145

2005 75 35 110

Total 176 223 399

14
Generalization: The Data Cube
 Base tuples
 Aggregate tuples:
» one aggregation for each subset of dimensions
(powerset)
» exponential number of subsets, but can optimize the
computation
 Example
» N = 3 dimensions
– model = {Golf, Jetta}
– color = {red, black, white}
– state = {NY, CA, WI}
» How many aggregate tuples in the data cube?
– face – 1D agg; edge – 2D agg; corner – 3D agg

15
Operations on Multidimensional
Data Model
 Aggregation (roll-up) of detailed data to create summary
data
 Navigation to detailed data (drill-down) from summary
 Selection (slice) defines a subcube
– Project the cube on fewer dimensions by specifying
coordinates of remaining dimensions
– e.g., sales where state = NY and month = Jan
 Calculation
– Within a dimension, e.g., (sales - expense) by state
– Across dimensions
 Ranking
– top 3% of states by average sales
 Window Queries
16
© Surajit Chaudhuri, Umeshwar Dayal
Roll-up and Drill-Down
 Roll-Up: Use of aggregation
» dimension reduction:
– e.g., total sales by state by color
– e.g., total sales by state
» navigating attribute hierarchy:
– e.g., sales by city -> total sales by state -> total sales by
country
– e.g., total sales by city and year -> total sales by state and year
-> total sales by country
 Drill-Down: Inverse operation of roll-up
» Provides the data set that was aggregated
– e.g., show “base” data for total sales figure for CA state

© Surajit Chaudhuri, Umeshwar Dayal 17


Slice and Dice
 What colors of Golf are not doing so well?

Select color, sum(price)


From SALES
Where model = ‘Golf’ slicing
Group By color dicing

 Keep slicing if results are uniform

18
More Examples
Q: Given a query, which values from the CUBE do
we need to retrieve?

A: To answer a query Q use tuples T s.t.


» If Q groups by A, T must have a non-* value in its
component for A
» If Q slices by A = b, T must have the value b (not * or
any other value) in its component for A
» If Q neither groups nor slices by A, then T has to have
* in its component for A

19
Pivot (Rotate)

th
on
LA
C it

M
SF
NY
y

Juice 10
Product

Cola 50
Milk 20
Cream 12

City
Toothpaste 15
Soap 10
1 2 3 4 5 67
Month
Product
Fact data: Sales volume in $100
Result: cross tabulation
© Surajit Chaudhuri, Umeshwar Dayal 20
Warehouse Database Schema
 Entity-Relationship design techniques not
appropriate
 Design should reflect multidimensional view
 Typical schemas:
» Star Schema
» Snowflake Schema
» Fact Constellation Schema

© Surajit Chaudhuri, Umeshwar Dayal 21


Example of a Star Schema
Product
Order ProdNo
OrderNo ProdName
OrderDate ProdDescr
Fact table Category
Customer OrderNo CategoryDescr
SalespersonID UnitPrice
CustomerNo
CustomerNo QOH
CustomerName Date
CustomerAddress ProdNo
DateKey DateKey
City
CityName Date
Quantity Month
Salesperson
TotalPrice Year
SalespersonID
City
SalespersonName
City CityName
Quota State
Country
© Surajit Chaudhuri, Umeshwar Dayal 22
Star Schema and Variants
 A single fact table and a single table for each
dimension
 Generated keys are used for performance and
maintenance reasons
 Fact constellation: Multiple Fact tables that share
common dimension tables
» Example: ProjectedExpense and ActualExpense may
share dimensional tables
 Snowflake Schema: Represents dimensional
hierarchy by normalization

© Surajit Chaudhuri, Umeshwar Dayal 23


Example of a Snowflake Schema
Product
Order Category
ProdNo
OrderNo CategoryName
ProdName
OrderDate CategoryDescr
ProdDescr
Fact table
Category
Customer OrderNo UnitPrice
CustomerNo SalespersonID QOH
CustomerName CustomerNo
CustomerAddress DateKey Date Month Year
City CityName
DateKey Month Year
ProdNo
Date Year
Salesperson Quantity
Month
TotalPrice
SalespersonID
City State
SalespesonName
City CityName StateName
Quota State Country

© Surajit Chaudhuri, Umeshwar Dayal 24


Performance Considerations
 Normalization for dimension tables
» Read-only data, so no update anomalies
» Fewer joins – better performance
 Pre-computation of summary tables
» Re-use can speed up performance
» How can we use pre-computed results effectively?
 Data is very large, dimension data often sparse
» Crucial to use indexes effectively
» Need for new indexing techniques: bitmap indexes, join
indexes

25
Bit Map Index
 An alternative representation of RID-list
 Comparison, join and aggregation operations are
reduced to bit arithmetic
 Specially advantageous for low-cardinality
domains
» Significant reduction in space and I/O (30:1)
» Adapted for higher cardinality domains
» Compression (e.g., run-length encoding) exploited
» Upper Bound of 2R words for any bitmap over R rows
[Hasan & Sinha, 1997]

© Surajit Chaudhuri, Umeshwar Dayal 26


Bitmap Index Example
M F custid name gender rating
1 0 112 Joe M 3
1 0 115 Ram M 5
0 1 119 Sue F 5
1 0 116 Woo M 4

1 2 3 4 5
0 0 1 0 0
0 0 0 0 1
0 0 0 0 1
0 0 0 1 0 27
Join Index
 Traditional index maps the value in a column to a
list of rows with that value
 Join index maintain relationships between
attribute value of a dimension and the matching
rows in the fact table
 Join index may span multiple dimensions
(composite join index)
» Use join index to identify regions of cartesian product
that are of interest
» Few people in Southern California may buy umbrellas

© Surajit Chaudhuri, Umeshwar Dayal 28


Algorithm Using Bitmapped Join
Indexes
 [O’Neil&Graefe95]
 Maintain bit mapped join indexes between each
dimension table and the fact table
 To answer a query over multiple dimensions
» Take intersection of join indexes until the set of
candidate fact tuples is small
» Do foreign key joins with rest of the dimension tables
» Look up the fact table

© Surajit Chaudhuri, Umeshwar Dayal 29


Join Index over Star Schema
Product
Order ProdNo
OrderNo ProdName
OrderDate ProdDescr
Fact table Category
Customer OrderNo CategoryDescr
SalespersonID UnitPrice
CustomerNo
CustomerNo QOH
CustomerName Dat
CustomerAddress ProdNo
DateKey e
DateKey
City
CityName Date
Quantity Month
Salesperson
TotalPrice Year
SalespersonID
City
SalespesonName
City CityName
Quota State
Country
© Surajit Chaudhuri, Umeshwar Dayal 30
ROLAP:
Handling of Aggregate Views
 Important component for ROLAP Servers
 Choice of aggregate views to materialize
 Physical representation of Materialized Views in
the star schema
 Logic for Aggregation Navigation
» make optimum use of materialized aggregates to
answer a query

© Surajit Chaudhuri, Umeshwar Dayal 31


ROLAP: Choice of Aggregate
Views to Materialize
 Storage can increase dramatically if precomputed
views are not chosen properly
 Must take into account queries in the workload,
their frequencies and their costs
 The decision must be taken in the broader context
of physical database design
» e.g., should take into account the choice of indexes
 Heuristic approaches adopted in products

© Surajit Chaudhuri, Umeshwar Dayal 32


ROLAP: Using Materialized
Views Through Selection
 A query can use a view through a selection if
» Each selection condition C on each dimension d
in the query is
» Logically implies a condition C’ on dimension
d in the view
 Example: A view has sum(sales) by product and
by year for products introduced after 1991
» OK to use for sum(sales) by product for
products introduced after 1992
» CANNOT use for sum(sales) for products
introduced after 1989

© Surajit Chaudhuri, Umeshwar Dayal 33


Using Materialized Views
through Group By (Roll Up)
 The view V may be applicable via roll-up if for
every grouping attribute g of the query Q:
» Q has Group By a1,..,g, an
» V has Group By a1,..,h, an
» Attribute g is higher than h in the attribute
hierarchy
» Aggregation functions are distributive
 Example: Compute “sum(sales) by category” from
the view “sum(sales) by product”

© Surajit Chaudhuri, Umeshwar Dayal 34


Data Warehouse
 A decision support database that is maintained
separately from the organization’s operational
databases.
 A data warehouse is a
– subject-oriented,
– integrated,
– time-varying,
– non-volatile
collection of data that is used primarily in
organizational decision making.

-- W.H. Inmon, Building the Data Warehouse, 1992.


© Surajit Chaudhuri, Umeshwar Dayal 35
Why Separate Data Warehouse
 Performance
» Op dbs designed & tuned for known trans. workloads.
» Complex OLAP queries would degrade performance
for operational transactions.
» Special data organization, access & implementation
methods needed for multidimensional views & queries.
 Function
» Missing data: Decision support requires historical data,
which op dbs do not typically maintain.
» Data consolidation: Decision support requires data
consolidation (aggregation, summarization) from many
heterogeneous sources: op dbs, external sources.
» Data quality: Different sources typically use
inconsistent data representations, codes, and formats,
which have to be reconciled.

© Surajit Chaudhuri, Umeshwar Dayal 36


Data Warehousing Architecture
Monitoring & Administration

Metadata
Repository
OLAP
Servers OLAP
Data Warehouse
External
sources Extract
Transform Query/Reporting
Operational Transport
dbs Serve
Data Mining

Data sources
Data Marts Front-End Tools

© Surajit Chaudhuri, Umeshwar Dayal 37


Data Warehouse vs. Data Marts
 Enterprise data warehouse: collects all information
about subjects (customers, products, sales, assets,
personnel) that span the entire organization.
» Requires extensive business modeling.
» May take years to design and build.
 Data Marts: Departmental subsets that focus on
selected subjects.
» Marketing data mart: customer, products, sales.
» Faster roll out, but complex integration in the long run.
 Virtual warehouse: views over operational dbs
» materialize some summary views for efficient query
processing
» easier to build
» requisite excess capacity on operational db servers.

© Surajit Chaudhuri, Umeshwar Dayal 38


Three-Tier Architecture
 Warehouse database server
» almost always a relational DBMS; rarely flat files.
 OLAP servers
» Relational OLAP (ROLAP): extended relational
DBMS that maps operations on multidimensional data
to standard relational operations.
» Multidimensional OLAP (MOLAP): special purpose
server that directly implements multidimensional data
and operations.
 Clients
» Query and reporting tools.
» Analysis tools.
» Data mining tools.

© Surajit Chaudhuri, Umeshwar Dayal 39


Populating & Refreshing the
Warehouse
 Data extraction
 Data cleaning
 Data transformation
» Convert from legacy/host format to warehouse format
 Load
» Sort, summarize, consolidate, compute views, check
integrity, build indexes, partition
 Refresh
» Propagate updates from sources to the warehouse.

© Surajit Chaudhuri, Umeshwar Dayal 40


Data Cleaning
 Why?
» Data warehouse contains data that is analyzed for
business decisions
» More data and mulitple sources could mean more errors
» Results in incorrect analysis
 Detecting data anomalies and rectifying them early
has huge payoffs
 Important to identify tools that work together well
 Long Term Solution
» Change business practices and data entry tools
» Repository for metadata

© Surajit Chaudhuri, Umeshwar Dayal 41


Load
 Issues:
» huge volumes of data to be loaded
» small time window (usually at night) when the
warehouse can be taken off-line
» when to build indexes and summary tables
» allow system administrator to monitor status, cancel
suspend, resume load, or change load rate
» restart after failure with no loss of data integrity.
 Techniques:
» batch load utility: sort input records on clustering key
and use sequential I/O; build indexes and derived tables
» sequential loads still too long (~100 days for TB)
» use parallelism and incremental techniques.
© Surajit Chaudhuri, Umeshwar Dayal 42
Parallel Load
Pipelined and partitioned parallelism

Source tables Scan Sort runs Merge runs Table insert Target tables

Build index record

Sort runs Merge runs Index insert Target index

[Barclay, Barnes, Gray, Sundaresan: Loading Databases Using Dataflow Parallelism]

© Surajit Chaudhuri, Umeshwar Dayal 43


Incremental Load
 Full load may still take too long.
» entire load is a (long) batch transaction
» replace old table with new after transaction commits
» use periodic checkpoints; after failure, restart from last
checkpoint.
 Use incremental loads during refresh to reduce data
volume
» insert only updated tuples
» now, incremental load conflicts with queries
» break into sequence of shorter transactions (every
~1000 records, every few seconds)
» coordinate this sequence of transactions: must ensure
consistency between base tables and derived tables &
indices.

© Surajit Chaudhuri, Umeshwar Dayal 44


Refresh
 Issues:
» when to refresh
– on every update: too expensive, only necessary if
OLAP queries need current data (e.g., up-to-the-
minute stock quotes)
– periodically (e.g., every 24 hours, every week) or
after “significant” events
– refresh policy set by administrator based on user
needs and traffic
– possibly different policies for different sources.
» how to refresh.

© Surajit Chaudhuri, Umeshwar Dayal 45


Refresh Techniques
 Full extract from base tables
» read entire source table or database: expensive
» may be the only choice for legacy databases or files.
 Incremental techniques (related to work on active dbs)
» detect & propagate changes on base tables: replication
servers
– snapshots & triggers (Oracle)
– transaction shipping (Sybase)
» logical correctness
– computing changes to star tables
– computing changes to derived and summary tables
– optimization: only significant changes
» transactional correctness: incremental load.

© Surajit Chaudhuri, Umeshwar Dayal 46

You might also like