Module-1
Module-1
To
29-11-2024
Department of Information Science and Engg
1
Transform Here
Modules and High Level Topics
29-11-2024
Department of Information Science and Engg
2
Transform Here
Detailed Syllabus – Module Wise
Module-1: Data warehousing and OLAP : Basic Concepts: Data
Warehousing: A multitier Architecture, Data warehouse models: Enterprise
warehouse, Data mart and virtual warehouse, Extraction, Transformation and
loading, Data Cube: A multidimensional data model, Stars, Snowflakes and
Fact constellations: Schemas for multidimensional Data models, Dimensions:
The role of concept Hierarchies, Measures: Their Categorization and
computation, Typical OLAP Operations
29-11-2024
Department of Information Science and Engg
3
Transform Here
Module-3: Association Analysis: Association Analysis: Problem Definition,
Frequent Item set Generation, Rule generation. Alternative Methods for
Generating Frequent Item sets, FPGrowth Algorithm, Evaluation of
Association Patterns.
29-11-2024
Department of Information Science and Engg
4
Transform Here
Course Outcomes
29-11-2024
Department of Information Science and Engg
5
Transform Here
Text Books
Text Books:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining,
Pearson, First impression,2014.
2. Jiawei Han, Micheline Kamber, Jian Pei: Data Mining -Concepts and Techniques, 3rd
Edition, Morgan Kaufmann Publisher, 2012.
Reference Books:
1. Sam Anahory, Dennis Murray: Data Warehousing in the Real World, Pearson, Tenth
Impression,2012.
2. Michael.J.Berry,Gordon.S.Linoff: Mastering Data Mining , Wiley Edition, second
edtion,2012.
29-11-2024
Department of Information Science and Engg
6
Transform Here
We will deep dive into DWH & DM
Module-1: Data Warehousing & Modelling: Basic Concepts: Data Warehousing:
A multitier Architecture.
29-11-2024
Department of Information Science and Engg
7
Transform Here
Basic Definitions
Data: Raw facts that can be recorded/acquired which has an implicit
meaning. Ex- Age, Color, name..etc
29-11-2024
Department of Information Science and Engg
8
Transform Here
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.”— W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses
Department of Information Science and Engg
Transform Here 9
Data Warehouse - Subject-Oriented
■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
■ Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
recovery
■ Warehouse - tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
■ Different functions and different data:
■ missing data: Decision support (DS) requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP analysis directly
on relational databases
Department of Information Science and Engg
Transform Here
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
a) A relational OLAP(ROLAP)model
(i.e.,an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or
entire organization
■ Data Mart
■ a subset of corporate-wide data that is of value to a specific
materialized
Department of Information Science and Engg
Transform Here
■ A virtual warehouse is easy to build but requires
excess capacity on operational database servers
“What are the pros and cons of the top-down and bottom-up
approaches to data warehouse development?”
■ The top-down development of an enterprise warehouse
serves as a systematic solution and minimizes integration
problems.
■ However, it is expensive, takes a long time to develop,
and lacks flexibility due to the difficulty in achieving
consistency and consensus for a common data model for the
entire organization.
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the warehouse
Department of Information Science and Engg
Transform Here
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data definitions,
data mart locations and contents.
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error reports,
audit trails)
■ The algorithms used for summarization
■ which include measure and dimension definition algorithms, data on
■ Summary
■ For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the
given time, item, location, and supplier dimensions.
Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
Department of Information Science and Engg
Transform Here
Stars, Snowflakes, and Fact Constellations:
Schemas for Multidimensional Data Models
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Department of Information Science and Engg
Transform Here
Snowflake Schema:
■ The snowflake schema is a variant of the star
schema model, where some dimension tables are
normalized, thereby further splitting the data into
additional tables.
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Department of Information Science and Engg
Transform Here
Fact Constellation
Sophisticated applications may require multiple fact tables to share dimension
tables. This kind of schema can be viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.
Ex figure: This schema specifies two fact tables, sales and shipping.
The sales table definition is identical to that of the star schema.
The shipping table has five dimensions, or keys: item key, time key, shipper key,
from location, and to location, and two measures: cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact
tables.
For example, the dimensions tables for time, item, and location are shared
between both the sales and shipping fact tables.
Department of Information Science and Engg
Transform Here
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Lattice :
A regular geometrical arrangement of
points or objects over an area or in
space.
all all
Specification of hierarchies
■ Schema hierarchy
URL: https://www2.cs.sfu.ca/CourseCentral/459/han/tutorial/tutorial.html
■ Algebraic, and
■ Holistic
■ Other operations
■ Drill Across: involving (across) more than one fact table
■ Drill Through: through the bottom level of the cube to its back-end
relational tables (using SQL)
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
warehouses
■ ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
■ OLAP-based exploratory data analysis
29-11-2024
Department of Information Science and Engg
76
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections
29-11-2024
Department of Information Science and Engg
77
Transform Here
Contact Details:
Dr.Manjunath T N
Professor and Dean – ER
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] / [email protected]
29-11-2024
Department of Information Science and Engg
78
Transform Here