0% found this document useful (0 votes)
51 views44 pages

Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor

The document provides an overview of key concepts related to data warehousing and online analytical processing (OLAP). It discusses the characteristics of a data warehouse as being subject-oriented, integrated, time-variant and non-volatile. It also describes the extraction, transformation and loading (ETL) process, multidimensional data models such as star schemas and snowflake schemas, and different types of OLAP servers and operations including roll up, drill down and slice and dice.

Uploaded by

Harsha Gangwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views44 pages

Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor

The document provides an overview of key concepts related to data warehousing and online analytical processing (OLAP). It discusses the characteristics of a data warehouse as being subject-oriented, integrated, time-variant and non-volatile. It also describes the extraction, transformation and loading (ETL) process, multidimensional data models such as star schemas and snowflake schemas, and different types of OLAP servers and operations including roll up, drill down and slice and dice.

Uploaded by

Harsha Gangwani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Mining and Warehousing

(203105431)
Sandeep Jangir, Assistant Professor
Department of Computer Science & Engineering
The Course Outline
Chapter 1 : Introduction to data mining (DM):

Chapter 2: Overview and concepts Data Warehousing and Business


Intelligence
Chapter 3: Data Warehousing and Online Analytical Processing
Chapter 4: Data Pre-processing:
Chapter 5: Mining Frequent Patterns, Associations, and Correlations:
Chapter 6: Classification
Chapter 7: Clustering:
Chapter 8: Applications
CHAPTER-3
Data Warehousing and OLAP
Outline
1.1 Introduction to Data Warehousing

1.2 Multitier Architecture


1.3 ETL Process
1.4 Multidimensional Data Model
1.5 Data Warehouse Models
1.6 OLAP Server

Image source : Google


Introduction of Data Warehousing

• A data warehouse is a subject-oriented,


integrated, time-variant and non-volatile
collection of data in support of
management's decision making process.

Figure3.1:
Representation
of Data
Warehouse
Subject- Oriented

• A data warehouse can be used to analyze a particular subject area.


• For example, "sales" can be a particular subject.
• Organized around major subjects, such as customer, product, sales

Image source : Google


Integrated

• A data warehouse integrates data from multiple data sources.


- Relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
- Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources
- E.g., Hotel price: currency, tax, breakfast covered, etc.
- When data is moved to the warehouse, it is converted.

Image source : Google


Time- Variant

• The time horizon for the data warehouse is significantly longer than that
of operational systems
- Operational database: current value data
- Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
• Every key structure in the data warehouse
- Contains an element of time, explicitly or implicitly
- But the key of operational data may or may not contain “time element”

Image source : Google


Non Volatile

• A physically separate store of data transformed from the operational


environment
• Operational update of data does not occur in the data warehouse
environment
- Does not require transaction processing, recovery, and concurrency
control mechanisms
- Requires only two operations in data accessing:
- Initial loading of data and access of data

Image source : Google


Why a Separate Data Warehouse?

• High performance for both systems


- DBMS tuned for OLTP: access methods, indexing, concurrency control,
recovery
- Warehouse tuned for OLAP: complex OLAP queries, multidimensional view,
consolidation

• Different functions and different data


- Missing data: Decision support requires historical data which operational
DBs do not typically maintain
- Data consolidation: DS requires consolidation (aggregation, summarization)
of data from heterogeneous sources Image source : Google
Why a Separate Data Warehouse?

- Data consolidation: DS requires consolidation (aggregation, summarization)


of data from heterogeneous sources
- Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

• Note: There are more and more systems which perform OLAP analysis
directly on relational databases

Image source : Google


Type of Date Warehousing Architecture
Single- tier Data Warehouse Architecture

Image source : Google


Two- tier Data Warehouse Architecture

Image source : Google


Three- tier Data Warehouse Architecture

Image source : Google


ETL Process in Data Warehouse

Figure3.4: ETL
Process
ETL Process (Contd...)

• Data extraction
- Get data from multiple, heterogeneous, and external sources

• Data cleaning
- Detect errors in the data and rectify them when possible

• Data transformation
- Convert data from legacy or host format to warehouse format

Image source : Google


ETL Process (Contd.....)

• Load
- Sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions

• Refresh
- Propagate the updates from the data sources to the warehouse
Multi Dimensional Model

• Sales volume as a function of product,


month, and region

• Dimensions: Production, Location, time

Figure3.7:
Mulidimensiona
l Model
Conceptual Modelling in Data Warehouses

• Three are three type of Schema

- Star schema
- Snowflake Schema
- Fact constellations Schema

Image source : Google


Star Schema

• Two different type of table in Star schema


- Fact Table
- Dimension table
- A fact table in the middle connected to a set of dimension tables
Example of Star Schema

Figure : Star
Schema
Snowflake Schema
• A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape
similar to snowflake

Image source : Google


Example of Snowflake Schema

Figure :
Snowflake
Schema
Fact Constellations Schema

• Multiple fact tables share dimension tables, viewed as a collection of stars,


therefore called galaxy schema or fact constellation

Image source : Google


Fact Constellations Schema (Contd.....)

Figure :
Snowflake
Schema
Example of Fact Constellations Schema

Figure : Fact
Constellations
Schema
Data Warehouse Model

• Data Warehouse model has categorized into three parts:


- Enterprise warehouse
- Data Mart
- Virtual warehouse

Image source : Google


Data Warehouse Model (Contd.....)

• Enterprise warehouse
- Collects all of the information about subjects spanning the entire
organization

• Data Mart
- A subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data mart
- Independent vs. dependent (directly from warehouse) data mart
Data Warehouse Model (Contd.....)

• Virtual warehouse
- A set of views over operational databases
- Only some of the possible summary views may be materialized
Concept Hierarchies

• Reduces the data size by collecting and then replacing the low-level concepts
(such as 43 for age) to high-level concepts concepts (categorical variables such
as middle age or Senior).

• For numeric data following techniques can be followed:


- Binning
- Histogram analysis
Binning (Contd.....)

• Binning is the process of changing numerical variables into categorical


counterparts.

• The number of categorical counterparts depends on the number of bins


specified by the user.
Histogram (Contd.....)

• The histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets.

• There are several partitioning rules:


- Equal Frequency partitioning
- Equal Width Partitioning
- Clustering
Partitioning Rule of Histogram Analysis
(Contd.....)
• Equal Frequency partitioning:
- Partitioning the values based on their number of occurrences in the
data set.

• Equal Width Partitioning:


- Partitioning the values in a fixed gap based on the number of bins i.e. a
set of values ranging from 0-20.

• Clustering:
- Grouping the similar data together.
OLAP Server
(Contd.....)
• Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
- Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
- Greater scalability
OLAP Server
(Contd.....)
• Multidimensional OLAP (MOLAP)
- Sparse array-based multidimensional storage engine
- Fast indexing to pre-computed summarized data

• Hybrid OLAP (HOLAP)


- (e.g., Microsoft SQL Server)
- Flexibility, e.g., low level: relational, high-level: array

• Specialized SQL servers


- (e.g., Redbricks)
- Specialized support for SQL queries over star/snowflake schemas
Roll Up

• Roll up (drill-up)
- Summarize data
- By climbing up hierarchy
or by dimension reduction

Figure3.7.2.a:
RollUp
Drill Down

• Drill down (roll down): reverse of


roll-up
• From higher level summary to
lower level summary or detailed
data, or introducing new
dimensions

Figure3.7.2.b:
Drill Down
Slice and Dice

• Slice and dice: project and select


- Here Slice is performed for the dimension "time" using the criterion
time = "Q1".
- 032
Dice selects two or more dimensions from a given cube and provides a
new sub-cube.
Slice

Figure3.7.2.c:
Slice
Dice

Figure3.7.2.d:Di
ce
Pivot (Rotate)

• Reorient the cube,


visualization, 3D to series
of 2D planes

Figure3.7.2.e:
Pivot
OLAP AND OLTP

Image source : Google


www.paruluniversity.ac.in

You might also like