0% found this document useful (0 votes)
81 views18 pages

DW - Course Information: - Teachers

This document provides information on a data warehousing (DW) course, including: - The teachers for the course and required literature - An outline of course topics and assignments - Reading directions for each lecture - Keywords and definitions related to DW, data marts, OLAP, and star and snowflake schemas - Common OLAP operations like roll-up, drill-down, slice and dice - Approaches to OLAP servers like ROLAP and MOLAP

Uploaded by

Akshay Verma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views18 pages

DW - Course Information: - Teachers

This document provides information on a data warehousing (DW) course, including: - The teachers for the course and required literature - An outline of course topics and assignments - Reading directions for each lecture - Keywords and definitions related to DW, data marts, OLAP, and star and snowflake schemas - Common OLAP operations like roll-up, drill-down, slice and dice - Approaches to OLAP servers like ROLAP and MOLAP

Uploaded by

Akshay Verma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

DW - course information

Teachers:
Petia Wohed Erik Perjons Gudrun Jeppesen

Literature:
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling Ralph Kimbal & Margy Ross [K&R] Compendium with extra reading material [Comp]

Reference Literature:
Fundamentals of Database Systems, Elmasri & Navathe [EN] Database Systems, Connolly & Begg [CB]

DW - course pedagogy
F1 DW Introduction (A3 + Extra assignment handed out) F2 Multidimensional Modelling 1 (A1 handed out) F3 Multidimensional Modelling 2 (A2 handed out) F4 DW Lifecycle S1 Multidimensional Modelling-Theory (A1 reported) F5 DW Physical design (A4 handed out) S2 Multidimensional Modelling-Practice (A2 reported) F6 Data Mining S3 Presentation of Articles (A3 reported) A4 reported (individual time for each group has to be booked)

Optional: Extra assignment handed in individually.

Written Examination

DW - reading directions
F1 DW Introduction
[Comp] article 1, [K&R] chapter 1

F2 Multidimensional Modelling 1
[K&R] chapters 2,3,4

F3 Multidimensional Modelling 2
[K&R] chapters 5,6,7,8

A1 Multidimensional ModellingTheory A2 Multidimensional ModellingPractice


- [K&R] chapter 9

F4 DW Lifecycle
[K&R] chapter 16

F5 DW Physical design
[Comp] article 2

A3 Presentation of Article Extra assignment (optional)

F6 Data Mining
[Comp] article 3

- periodicals, i.e., ACM, IEEE, - conf. proc., i.e., WLDB, CAiSE, ER

A4 Tool Practice

We are drowing in information, but starving for knowledge


- John Naisbett

Lecture 1 - Introduction to DW
Reading Requirements
[Comp] R. Ramakrishnan and J. Gehrke, Chapter 23, Decision Support [K&R] Kimbal, Chapter 1 [EN] chapter 26 [CB] chapter 25

Keywords
DW, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, Bitmap Index, Join Index, Data Mart

The Data Warehouse - definition


B. Imnon:
A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of manadements decisions. En data lager r en verksamhetsorienterat, integrerat, icke-ombytlig och tids-beroende samling av data mnat att stdja beslutsfattande p strategisk niv.

S. Chaudhiri & U. Dayal:


Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions.

Data Warehouse Subject-Oriented


Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Operational Systems
Production System

DW
Payroll System Customer Data Product Data Sales Data

Sales System

Data Warehouse Integrated


Constructed by integrating multiple, heterogeneous data sources Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted. relational or other databases, flat files, external data

Operational Systems

Marketing System Order System Billing System Customer Data

DW

Data Warehouse Time Variant


The time horizon for the data warehouse is significantly longer than that of operational systems.
Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse


Contains an element of time But the key of operational data may or may not contain time element.

Operational Systems

Order System

DW

Customer Data

60-90 days

5-10 years

Data Warehouse Non-Volatile


A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms Requires only : loading and access of data. Operational Systems Create Update
Order System

DW Delete Insert Load


Customer Data

Access

Decision Support and OLAP

(Navathe)

Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions.
Will a 10% discount increase sales volume sufficiently? Which of two new medications will result in the best best outcome: higher recovery rate & shorter hospitality rate? How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years?

On-Line Analytical Processing (OLAP) is an element of decision support system (DSS).

Data Warehouse

(Navathe)

A decision support database that is maintained separately from the organisations operational databases. A data warehouse is a
subject oriented, integrated, time-varying, non-volatile

collection of data that is used primarily in the organisational decision making.

Why separate data warehouse?


Performance
The operational DBs are tuned to support known OLTP workloads Supporting OLAP requires special data organisations, access methods and implementation methods

Function
The decision support requires data that may be missing from the operational DBs Decision support usually requires consolidating data from many heterogeneous sources

OLTP
holds current data stores detailed data data is dynamic repetitive processing

vs.

OLAP
holds historic and integrated data stores detailed and summarised data data is largely static ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions serves relatively lower level of managerial users

high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions serves large number of operational users

DW Architecture
Monitoring & Administration Tools Data sources Metadata repository Data warehouse External sources Extract Transform Load Refresh OLAP servers Analysis
Productt Product2 Product3 Product4 Time1 Time2 Time3 Time4 Value1 Value2 Value3 Value4 Value11 Value21 Value31 Value41

Serve

Query/Reporting

Operational DBs

Data mining Data marts


Fal aldf flad akld fal alksdf

Data Warehouse vs. Data Mart


Enterprise warehouse: collects all information about subject (customer, products, sales, assets, personnel) that span the entire organisation
Requires extensive business modelling May take years to design and build

Data Mart: Departmental subsets that focus on selected subjects: Marketing data mart: customer,

product, sales

Faster roll-out Complex integration in the long term

To Meet the Requirements within DW


The data is organised differently, i.e. multidimensional
star-joins schemas snowflake schemas

The data is viewed differently The data is stored differently


vector (array) storage

The data is indexed differently


bitmap indexes join indexes

From Spreadsheets to Data Cubes


Spreadsheets:
ry nt u co
month
2 300 5 024 200

A data cube:
y tr n u co
2 300 130

130

month

5 024 200

product product

Multidimensional view of the data


promotion campaign
month

ry nt u co

product ry t n u co
month month

ry nt u co
month

ry nt u co

product

product

product customer group

Example - Star-Join Schema

Location Key City Country

Sales Fact LocationKey ProductKey TimeKey QuantitySold

Product Key Name Category

Time Key Month Year

Example
Location Key City 1 Stockholm 2 London 3 Paris
rid4 rid5 rid6 rid7 rid8 rid9 rid10 rid11 rid12 rid13 rid14 rid15 rid16 rid17 rid18 rid19 rid20 rid21

Sales LKey 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

PKey 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

TKey 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2

Qnt 5 7 4 8 3 5 20 10 30 10 9 7 5 10 8 20 50 30

rid22 rid23 rid24

Product Key Name 1 #5 2 Noah 3 Opium

rid25 rid26 rid27 rid27

Time Key 1 2 3 4

Month Jan Feb Mar Apr

Star-Join Schema
A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes The fact table is highly normalised, whereas the dimension tables not normalised. Dimensions does not capture hierarchies directly Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share many dimension tables

Snowflake Schema
Represent dimensional hierarchy directly by normalising the dimension tables Save storage Reduces the effectiveness of browsing

Example - Snowflake Schema


Service used
- service name

Year Month

Time
- date

Service group

Telephone calls
- sum ($) - number of calls

Quarter

Region

Sales Dimension
- seller name

Customer
- customer name - address

Income group

Office

Typical OLAP Operations


Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or

detailed data, or introducing new dimensions

Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Approaches to OLAP Servers


Relational OLAP (ROLAP)
Relational and Extended Relational DBMS to store and manage warehouse data

Multidimensional OLAP (MOLAP)


Array-based storage structure (n-dimensional array) Direct access to array data structure Good indexing properties Poor storage utilisation when the data is sparse.

Bitmap Indexing
An effective indexing technique for attributes with low-cardinality domains There is a distinct bit vector BV for each value V of the domain Example: the attribute sex has value M and F. A table of 100 million people needs 2 lists of 100 million bits.

Bitmap Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L W H

Region Index
RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1

Rating Index
RowId 1 2 3 4 5 6 7 H 1 0 0 1 0 0 1 M 0 1 0 0 0 0 0 L 0 0 1 0 1 1 0

SELECT Customers FROM Base Table WHERE Region = W AND Rating = L

Bitmap Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L W H

Region Index
RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1

Rating Index
RowId 1 2 3 4 5 6 7 H 1 0 0 1 0 0 1 M 0 1 0 0 0 0 0 L 0 0 1 0 1 1 0

Region = W

AND

Rating = L

Bitmap Index
Base Table
Cust C1 C2 C3 C4 C5 C6 C7 Region Rating N H S M W L W H S L W L W H

Region Index
RowId N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 0 0 0 1

Rating Index
RowId 1 2 3 4 5 6 7 H 1 0 0 1 0 0 1 M 0 1 0 0 0 0 0 L 0 0 1 0 1 1 0

Region = W

AND

Rating = L

Join Index
Join index roughly: JI(Cf, R-id), where D(Cd,R-id,) >< Cd=Cf F(Cf,R-id,) Traditional indixes map the values to a list of record ids. In data warehouse, join index relates the values of the dimensions of a star schema to rows in the fact table Join indices can span multiple dimensions

Example

Location Key City Country

Sales Fact LocationKey ProductKey TimeKey QuantitySold

Product Key Name Category

Time Key Month Year

Join Index - Ex
Location Key City
rid1 rid2 rid3

1 2 3

Stockholm London Paris

rid13 rid14 rid15 rid16 rid17 rid18 rid19 rid20 rid21

1 1 1 2 2 2 3 3 3

1 2 3 1 2 3 1 2 3

2 2 2 2 2 2 2 2 2

10 9 7 5 10 8 20 50 30

rid4 rid5 rid6 rid7 rid8 rid9 rid10 rid11 rid12

Sales LKey 1 1 1 2 2 2 3 3 3

PKey 1 2 3 1 2 3 1 2 3

TKey 1 1 1 1 1 1 1 1 1

Qnt 5 7 4 8 3 5 20 10 30

rid22 rid23 rid24

Product Key Name 1 #5 2 Noah 3 Opium

rid25 rid26 rid27 rid27

Time Key 1 2 3 4

Month Jan Feb Mar Apr

Join Index - Ex1 rid14


Location Key City 1 Stockholm 2 London 3 Paris
rid15 rid16 rid17 rid18 rid19 rid20 rid21

rid13

rid1 rid2 rid3

1 1 1 2 2 2 3 3 3

1 2 3 1 2 3 1 2 3

2 2 2 2 2 2 2 2 2

10 9 7 5 10 8 20 50 30

rid4 rid5 rid6 rid7 rid8 rid9 rid10 rid11 rid12

Sales LKey 1 1 1 2 2 2 3 3 3

PKey 1 2 3 1 2 3 1 2 3

TKey 1 1 1 1 1 1 1 1 1

Qnt 5 7 4 8 3 5 20 10 30

rid22 rid23 rid24

Product Key Name 1 #5 2 Noah 3 Opium

rid25 rid26 rid27 rid27

Time Key 1 2 3 4

Month Jan Feb Mar Apr

CityJI CityK 1 1 1 1 1 1 2 2 2 2 2 2

Rid rid4 rid5 rid6 rid13 rid14 rid15 rid7 rid8 rid9 rid16 rid17 rid18

Join Index - Ex2 rid14


Location Key City 1 Stockholm 2 London 3 Paris
rid15 rid16 rid17 rid18 rid19 rid20 rid21

rid13

rid1 rid2 rid3

1 1 1 2 2 2 3 3 3

1 2 3 1 2 3 1 2 3

2 2 2 2 2 2 2 2 2

10 9 7 5 10 8 20 50 30

rid4 rid5 rid6 rid7 rid8 rid9 rid10 rid11 rid12

Sales LKey 1 1 1 2 2 2 3 3 3

PKey 1 2 3 1 2 3 1 2 3

TKey 1 1 1 1 1 1 1 1 1

Qnt 5 7 4 8 3 5 20 10 30

rid22 rid23 rid24

Product Key Name 1 #5 2 Noah 3 Opium

rid25 rid26 rid27 rid27

Time Key 1 2 3 4

Month Jan Feb Mar Apr

City-Product JI CityK PrdK Rid 1 1 rid4 1 1 rid13 1 2 rid5 1 2 rid14 1 3 rid6 1 3 rid15

Summary
Data warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP

You might also like