Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor
Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor
(203105431)
Sandeep Jangir, Assistant Professor
Department of Computer Science & Engineering
The Course Outline
Chapter 1 : Introduction to data mining (DM):
Figure3.1:
Representation
of Data
Warehouse
Subject- Oriented
• The time horizon for the data warehouse is significantly longer than that
of operational systems
- Operational database: current value data
- Data warehouse data: provide information from a historical perspective
(e.g., past 5-10 years)
• Every key structure in the data warehouse
- Contains an element of time, explicitly or implicitly
- But the key of operational data may or may not contain “time element”
• Note: There are more and more systems which perform OLAP analysis
directly on relational databases
Figure3.4: ETL
Process
ETL Process (Contd...)
• Data extraction
- Get data from multiple, heterogeneous, and external sources
• Data cleaning
- Detect errors in the data and rectify them when possible
• Data transformation
- Convert data from legacy or host format to warehouse format
• Load
- Sort, summarize, consolidate, compute views, check integrity, and build
indices and partitions
• Refresh
- Propagate the updates from the data sources to the warehouse
Multi Dimensional Model
Figure3.7:
Mulidimensiona
l Model
Conceptual Modelling in Data Warehouses
- Star schema
- Snowflake Schema
- Fact constellations Schema
Figure : Star
Schema
Snowflake Schema
• A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape
similar to snowflake
Figure :
Snowflake
Schema
Fact Constellations Schema
Figure :
Snowflake
Schema
Example of Fact Constellations Schema
Figure : Fact
Constellations
Schema
Data Warehouse Model
• Enterprise warehouse
- Collects all of the information about subjects spanning the entire
organization
• Data Mart
- A subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data mart
- Independent vs. dependent (directly from warehouse) data mart
Data Warehouse Model (Contd.....)
• Virtual warehouse
- A set of views over operational databases
- Only some of the possible summary views may be materialized
Concept Hierarchies
• Reduces the data size by collecting and then replacing the low-level concepts
(such as 43 for age) to high-level concepts concepts (categorical variables such
as middle age or Senior).
• The histogram is used to partition the value for the attribute X, into
disjoint ranges called brackets.
• Clustering:
- Grouping the similar data together.
OLAP Server
(Contd.....)
• Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
- Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
- Greater scalability
OLAP Server
(Contd.....)
• Multidimensional OLAP (MOLAP)
- Sparse array-based multidimensional storage engine
- Fast indexing to pre-computed summarized data
• Roll up (drill-up)
- Summarize data
- By climbing up hierarchy
or by dimension reduction
Figure3.7.2.a:
RollUp
Drill Down
Figure3.7.2.b:
Drill Down
Slice and Dice
Figure3.7.2.c:
Slice
Dice
Figure3.7.2.d:Di
ce
Pivot (Rotate)
Figure3.7.2.e:
Pivot
OLAP AND OLTP