DW - Course Information: - Teachers
DW - Course Information: - Teachers
Teachers:
Petia Wohed Erik Perjons Gudrun Jeppesen
Literature:
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling Ralph Kimbal & Margy Ross [K&R] Compendium with extra reading material [Comp]
Reference Literature:
Fundamentals of Database Systems, Elmasri & Navathe [EN] Database Systems, Connolly & Begg [CB]
DW - course pedagogy
F1 DW Introduction (A3 + Extra assignment handed out) F2 Multidimensional Modelling 1 (A1 handed out) F3 Multidimensional Modelling 2 (A2 handed out) F4 DW Lifecycle S1 Multidimensional Modelling-Theory (A1 reported) F5 DW Physical design (A4 handed out) S2 Multidimensional Modelling-Practice (A2 reported) F6 Data Mining S3 Presentation of Articles (A3 reported) A4 reported (individual time for each group has to be booked)
Written Examination
DW - reading directions
F1 DW Introduction
[Comp] article 1, [K&R] chapter 1
F2 Multidimensional Modelling 1
[K&R] chapters 2,3,4
F3 Multidimensional Modelling 2
[K&R] chapters 5,6,7,8
F4 DW Lifecycle
[K&R] chapter 16
F5 DW Physical design
[Comp] article 2
F6 Data Mining
[Comp] article 3
A4 Tool Practice
Lecture 1 - Introduction to DW
Reading Requirements
[Comp] R. Ramakrishnan and J. Gehrke, Chapter 23, Decision Support [K&R] Kimbal, Chapter 1 [EN] chapter 26 [CB] chapter 25
Keywords
DW, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, Bitmap Index, Join Index, Data Mart
DW
Payroll System Customer Data Product Data Sales Data
Sales System
Operational Systems
DW
Operational Systems
Order System
DW
Customer Data
60-90 days
5-10 years
Access
(Navathe)
Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions.
Will a 10% discount increase sales volume sufficiently? Which of two new medications will result in the best best outcome: higher recovery rate & shorter hospitality rate? How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?
Data Warehouse
(Navathe)
A decision support database that is maintained separately from the organisations operational databases. A data warehouse is a
subject oriented, integrated, time-varying, non-volatile
Function
The decision support requires data that may be missing from the operational DBs Decision support usually requires consolidating data from many heterogeneous sources
OLTP
holds current data stores detailed data data is dynamic repetitive processing
vs.
OLAP
holds historic and integrated data stores detailed and summarised data data is largely static ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions serves relatively lower level of managerial users
high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions serves large number of operational users
DW Architecture
Monitoring & Administration Tools Data sources Metadata repository Data warehouse External sources Extract Transform Load Refresh OLAP servers Analysis
Productt Product2 Product3 Product4 Time1 Time2 Time3 Time4 Value1 Value2 Value3 Value4 Value11 Value21 Value31 Value41
Serve
Query/Reporting
Operational DBs
Data Mart: Departmental subsets that focus on selected subjects: Marketing data mart: customer,
product, sales
A data cube:
y tr n u co
2 300 130
130
month
5 024 200
product product
ry nt u co
product ry t n u co
month month
ry nt u co
month
ry nt u co
product
product
Example
Location Key City 1 Stockholm 2 London 3 Paris
rid4 rid5 rid6 rid7 rid8 rid9 rid10 rid11 rid12 rid13 rid14 rid15 rid16 rid17 rid18 rid19 rid20 rid21
Sales LKey 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
PKey 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
TKey 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
Qnt 5 7 4 8 3 5 20 10 30 10 9 7 5 10 8 20 50 30
Time Key 1 2 3 4
Star-Join Schema
A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes The fact table is highly normalised, whereas the dimension tables not normalised. Dimensions does not capture hierarchies directly Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share many dimension tables
Snowflake Schema
Represent dimensional hierarchy directly by normalising the dimension tables Save storage Reduces the effectiveness of browsing
Year Month
Time
- date
Service group
Telephone calls
- sum ($) - number of calls
Quarter
Region
Sales Dimension
- seller name
Customer
- customer name - address
Income group
Office
Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its
Bitmap Indexing
An effective indexing technique for attributes with low-cardinality domains There is a distinct bit vector BV for each value V of the domain Example: the attribute sex has value M and F. A table of 100 million people needs 2 lists of 100 million bits.
Bitmap Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L W H
Region Index
RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1
Rating Index
RowId 1 2 3 4 5 6 7 H 1 0 0 1 0 0 1 M 0 1 0 0 0 0 0 L 0 0 1 0 1 1 0
Bitmap Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L W H
Region Index
RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1
Rating Index
RowId 1 2 3 4 5 6 7 H 1 0 0 1 0 0 1 M 0 1 0 0 0 0 0 L 0 0 1 0 1 1 0
Region = W
AND
Rating = L
Bitmap Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L W H
Region Index
RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1
Rating Index
RowId 1 2 3 4 5 6 7 H 1 0 0 1 0 0 1 M 0 1 0 0 0 0 0 L 0 0 1 0 1 1 0
Region = W
AND
Rating = L
Join Index
Join index roughly: JI(Cf, R-id), where D(Cd,R-id,) >< Cd=Cf F(Cf,R-id,) Traditional indixes map the values to a list of record ids. In data warehouse, join index relates the values of the dimensions of a star schema to rows in the fact table Join indices can span multiple dimensions
Example
Join Index - Ex
Location Key City
rid1 rid2 rid3
1 2 3
1 1 1 2 2 2 3 3 3
1 2 3 1 2 3 1 2 3
2 2 2 2 2 2 2 2 2
10 9 7 5 10 8 20 50 30
Sales LKey 1 1 1 2 2 2 3 3 3
PKey 1 2 3 1 2 3 1 2 3
TKey 1 1 1 1 1 1 1 1 1
Qnt 5 7 4 8 3 5 20 10 30
Time Key 1 2 3 4
rid13
1 1 1 2 2 2 3 3 3
1 2 3 1 2 3 1 2 3
2 2 2 2 2 2 2 2 2
10 9 7 5 10 8 20 50 30
Sales LKey 1 1 1 2 2 2 3 3 3
PKey 1 2 3 1 2 3 1 2 3
TKey 1 1 1 1 1 1 1 1 1
Qnt 5 7 4 8 3 5 20 10 30
Time Key 1 2 3 4
CityJI CityK 1 1 1 1 1 1 2 2 2 2 2 2
Rid rid4 rid5 rid6 rid13 rid14 rid15 rid7 rid8 rid9 rid16 rid17 rid18
rid13
1 1 1 2 2 2 3 3 3
1 2 3 1 2 3 1 2 3
2 2 2 2 2 2 2 2 2
10 9 7 5 10 8 20 50 30
Sales LKey 1 1 1 2 2 2 3 3 3
PKey 1 2 3 1 2 3 1 2 3
TKey 1 1 1 1 1 1 1 1 1
Qnt 5 7 4 8 3 5 20 10 30
Time Key 1 2 3 4
City-Product JI CityK PrdK Rid 1 1 rid4 1 1 rid13 1 2 rid5 1 2 rid14 1 3 rid6 1 3 rid15
Summary
Data warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP