0% found this document useful (0 votes)

20 views45 pages

Data - Mining - Warehousing Unit I

The document outlines the concepts of Data Warehousing and Online Analytical Processing (OLAP), detailing the differences between operational database systems and data warehouses. It explains data mining techniques, examples of their applications, and various data warehouse schemas like Star, Snowflake, and Fact Constellation. Additionally, it covers OLAP operations, the role of concept hierarchies, measures, and indexing methods to enhance data retrieval efficiency.

Uploaded by

aarsha.br

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views45 pages

Data - Mining - Warehousing Unit I

Uploaded by

aarsha.br

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

23AD406 Data Mining & Data Warehousing

Unit - I
SYLLABUS- UNIT I
DATA WAREHOUSING AND ONLINE ANALYTICAL
PROCESSING

Data Warehouse, Operational Database Systems versus Data

Warehouses, A Multi tired Architecture, A Multidimensional
Data Model, Stars, Snowflakes and Fact Constellations:
Schemas, Role of Concept hierarchies, Measures, OLAP
Operations, From online Analytical processing to
Multidimensional Data Mining, Indexing OLAP Data.
What is Data Mining?

• Data Mining is the process of discovering patterns, trends,

and useful insights from large datasets
• Uses various techniques like statistics, machine learning,
and artificial intelligence.
• Helps businesses make data-driven decisions.
Examples of Data Mining

• Banks use data mining to detect unusual transactions and

prevent fraud.
• E-commerce (Amazon, Flipkart): Suggests products based
on previous purchases.
• Marketing & Advertising: Segments customers based on
buying behavior
• Healthcare: Identifies disease patterns from patient records
What is a Data Warehouse?

• A Data Warehouse is a centralized storage system that

collects and stores structured data from different sources.
• It supports decision-making and business analytics.
• It stores structured data optimized for fast querying.
Examples of Data Warehousing

• Retail: Analyzing customer purchases over years(To

predicting stock demand based on past sales)
• Stock Market: Tracking trends for investment decisions.
• Healthcare: Analyzing patient records for research.
• Government: Storing census data for policy planning
Operational Database Systems versus
Data Warehouses

• An Operational Database System (OLTP - Online

Transaction Processing) is a database designed to manage
real-time transactions efficiently.
• It supports day-to-day business operations like banking
transactions, online shopping, and ticket bookings.
What is OLTP?

• Online Transaction Processing (OLTP) systems handle real-

time data.
• processing of short, fast, and frequent transactions in a
database system.
• Performs CRUD operations (Create, Read, Update, Delete)
Examples of OLTP

• Banking systems – Processing withdrawals, deposits, fund

transfers
• E-commerce websites – Managing product purchases, cart
updates
• Hospital management systems – Booking patient
appointments
• Airline reservation systems – Booking or canceling tickets
What is OLAP?

• Online Analytical Processing (OLAP) is designed for

analyzing large amounts of historical data to support
decision-making and business intelligence.
• Processes complex queries on large datasets.
• Uses multidimensional databases (fact & dimension tables)
OLTP vs. OLAP (Comparison)

• OLTP: Fast, real-time transactions (CRUD operations).

• OLAP: Read-heavy, complex queries for business analysis.
• OLTP: Normalized databases
• OLAP: Denormalized for efficiency.
Normalized databases:

Denormalized Databases:
Real-Life Examples of OLTP & OLAP

• OLTP: Bank withdrawals, ATM transactions, e-commerce

orders.
• OLAP: Business sales analysis, trend forecasting, customer
insights.
A Multi-Tiered Architecture in Data
Warehousing
What is a Multi-Tiered Architecture?
• A multi-tiered architecture in a data warehouse is a layered design
that organizes data storage and processing into different levels.
• It helps in efficient data management, security, and scalability.
Three-Tier Architecture of a Data Warehouse:
A typical Data Warehouse is divided into three main tiers:
1. Bottom Tire
2. Middle Tire
3. Top Tire
1. Bottom Tire- Data Sources and ETL Layer
Data Sources: The data source tier consists of multiple operational
systems (OLTP systems, flat files, external databases, etc.) from where
data is pulled. The data here is often unstructured or semi-structured,
such as raw data logs, transactional data, or even social media feeds.
ETL (Extract, Transform, Load):
Extract: Data is extracted from multiple heterogeneous sources,
including databases, files, APIs, etc.
Transform: This phase involves cleaning, filtering, and transforming
the data to fit into a unified structure. This includes data
standardization, removing duplicates, and applying business rules.
Load: After the transformation, the clean data is loaded into the data
warehouse (usually in a structured form like star schema or snowflake
schema).
Star Schema:
The Star Schema is the simplest type of data warehouse schema.
It consists of a central fact table connected to multiple dimension
tables, forming a star-like shape.
🔹 Structure
Fact Table: Contains numeric data (measures) related to business
operations, such as sales revenue, number of units sold, etc.
Dimension Tables: Contain descriptive data (attributes), such as
time, product details, customer details, etc.
Denormalized Data: The dimension tables are not split into sub-
tables; they contain redundant data to improve query
performance.
Star Schema
Snowflake Schema:
The Snowflake Schema is an extension of the Star Schema,
where dimension tables are normalized into multiple related
tables. Th
🔹 Structure
Fact Table: Same as in Star Schema.
Normalized Dimension Tables: Each dimension table is split
into smaller related tables.is reduces redundancy but increases
complexity.
Snowflake Schema
Snowflake Schema (continue)
Fact Constellation Schema (Galaxy Schema)
A Fact Constellation Schema, also known as a Galaxy Schema, is
a data warehouse schema that consists of multiple fact tables
sharing common dimension tables.
• It is more complex than Star and Snowflake schemas.
• It is used when multiple business processes are analyzed in a
single data warehouse.
• Since it consists of multiple fact tables, it allows
multidimensional analysis across different subjects.
• A Fact Constellation Schema looks like a collection of Star
Schemas where different fact tables share the same dimension
tables.
Example for Fact Constellation Schema:
Example for Fact Constellation Schema(continue):
Example for Fact Constellation Schema(continue):
Star Schema vs. Snowflake Schema

Snowflake Fact
Feature Star Schema
Schema Constellation

Complexity Simple Moderate High

Denormalized Normalized (sub-

Normalization Hybrid
(flat tables) tables)
Query Faster (fewer Moderate(more Moderate(Many
Performance joins) joins) more joins)
Storage
High Less High
Required
2. Middle Tier – Data Warehouse and OLAP Servers
• Data Warehouse: This is the core storage area where all the
cleaned and transformed data resides. It is usually structured
for fast querying and analysis, optimized for reporting and
decision-making.
• OLAP Server: The OLAP server allows for complex
multidimensional analysis of the data stored in the
warehouse.
• It supports operations like drill-down (viewing data at more
granular levels), roll-up (aggregating data at higher levels),
and slice-and-dice (looking at the data from different
perspectives).The OLAP server is often designed to be query-
efficient, meaning it stores pre-aggregated data to speed up
common analytical queries.
3. Top Tier – Front-End Tools and Applications
BI Tools: This tier is where business users interact with the data. The
data is visualized through dashboards, reports, and ad-hoc queries.
Common front-end tools include:
•Tableau: Used for visualizing data in an interactive dashboard
format.
•Power BI: A Microsoft tool that integrates easily with other
Microsoft products and provides interactive reports.
•QlikView: A tool that allows users to analyze data in different
visual formats.
•Data Exploration and Reporting: The front-end tier also allows
for data exploration, enabling users to query the data warehouse and
analyze different dimensions such as time, location, product, etc.
•Users: This layer is used by a variety of users including data
analysts, business analysts, and management for decision-making.
They may perform tasks like reporting on sales data, forecasting
future trends, or analyzing customer behavior.
Role of Concept Hierarchies in Data Warehousing
A concept hierarchy organizes data into multiple levels of abstraction, making it
easier to analyze and interpret. It helps in OLAP operations like Drill-Down, Roll-
Up, and Slice-and-Dice by struct
• Example of Concept Hierarchy
• Geographical Hierarchy:
Country → State → City → District
Example: India → Tamil Nadu → Chennai → Anna Nagar
• Time Hierarchy:
Year → Quarter → Month → Week → Day
Example: 2024 → Q1 → January → Week 2 → 15th Jan
• Product Hierarchy:
Category → Subcategory → Product → Model
Example: Electronics → Mobile → Samsung → Galaxy S23
Types of Concept Hierarchies:
Explicit Hierarchy:
• Predefined manually by users or database designers.
• Stored as part of the database schema.
• Clearly structured with defined levels.
Implicit Concept Hierarchy
• Derived automatically from existing data.
• Not predefined in the schema.
• Hierarchy is determined based on relationships within the
dataset.
Measures
In Data Warehousing and OLAP systems, measures are
the quantitative values or metrics that are analyzed
across different dimensions. Measures are what users are
typically interested in when querying data, as they
represent the actual numeric values that need to be
aggregated, analyzed, or summarized.
Characteristics of Measures:
• Quantitative Data: Measures are numeric values like
sales, profit, quantity, revenue, etc.
• Aggregated Data: In OLAP cubes, measures are usually
aggregated across different dimensions. For example,
summing up sales for a specific product or calculating the
average revenue for a particular time period.
• Aggregated Functions: Measures are often subjected to
mathematical operations such as sum, average, count,
max, and min.
Common Operations on Measures:
1. Aggregation: Measures are often aggregated at
different levels of the dimension. For example, if we
are analyzing sales across regions and time periods,
the system might aggregate sales by region or month.
Example: The total sales for a specific region in a
year (sum of sales across months).
2. Roll-up (Aggregation): Measures are aggregated
from lower level to a higher level.
3. Drill-down: Break down measures into lower levels
of detail.
4. Slice: It is used to isolate a subset of data from a larger
dataset, focusing on a specific measure and dimension
combination.
Common OLAP Operations:
1. Roll-Up
Example: From monthly sales to yearly sales.
2. Drill-Down
Example: From yearly sales to monthly sales or daily
sales.
3. Slice
Example: Showing all data for a particular region in a
certain year.
4. Dice:
Example: Extracting data for "Region" and "Product
Type" for a particular year
5. Pivot
Example: Changing the view from "Region by Time" to
"Product by Region."
From Online Analytical Processing (OLAP) to
Multidimensional Data Mining

OLAP is primarily used for analyzing structured data stored in

data warehouses. It allows for fast retrieval and interactive
analysis of multidimensional data using operations like drill-
down, roll-up, slicing, and dicing. However, OLAP is limited to
summarization and does not provide deep insights beyond
aggregation.

Aggregation is the process of summarizing data by grouping

and calculating values like totals, averages, or counts.
Multidimensional Data Mining (MDM)

Multidimensional Data Mining applies machine learning and

statistical techniques to analyze OLAP data beyond
aggregation. It helps in:
✔ Finding hidden patterns in large datasets.
✔ Predicting future trends (e.g., sales forecasting).
✔ Detecting anomalies (e.g., fraud detection).

Key Techniques in Multidimensional Data Mining:

1. Classification
2. Clustering
3. Association Rule Mining
4. Anomaly Detection
1. Classification: Assigns data into predefined categories.
Example: Categorizing customers as ‘loyal’ or ‘high-risk’
based on purchase behavior.
2. Clustering – Groups similar data points together.
Example: Identifying customer segments based on spending
patterns.
3. Association Rule Mining – Identifies relationships between
variables.
Example: "People who buy laptops also buy laptop bags."
4. Anomaly Detection – Identifies unusual patterns.
Example: Detecting fraudulent transactions in a bank.
Indexing OLAP Data
What is Indexing?
Indexing is a technique used in databases to speed up data
retrieval. It works like a book index, helping the system find
data quickly without scanning the entire database.
Types of Indexing:
1. Bitmap Indexing
2. B-Tree Index
3. Hash Index
4. Clustered Index
Indexing OLAP Data
BitMap Index
BitMap Indexing is a data indexing technique used in database
management systems (DBMS) to improve the performance of
read-only queries that involve large datasets. It involves
creating a bitmap index, which is a data structure that
represents the presence or absence of data values in a table or
column.
Indexing OLAP Data
B-Tree Index
A B-Tree is a specialized m-way tree designed to optimize data
access, especially on disk-based storage systems.
In a B-Tree of order m, each node can have up to m children
and m-1 keys, allowing it to efficiently manage large datasets.
Indexing OLAP Data
Hash Index
Hashing in DBMS is a technique to quickly locate a data record in a
database irrespective of the size of the database. For larger databases
containing thousands and millions of records, the indexing data
structure technique becomes very inefficient because searching a
specific record through indexing will consume more time.
Static Hashing
In static hashing, the hash function always generates the same
bucket's address. For example, if we have a data record for
employee_id = 107, the hash function is mod-5 which is - H(x) % 5,
where x = id. Then the operation will take place like this:
H(106) % 5 = 1.
This indicates that the data record should be placed or searched in
the 1st bucket (or 1st hash index) in the hash table.
Indexing OLAP Data
Indexing OLAP Data
Clustered Index
A Clustered Index determines the physical order of the data in a
table. When a clustered index is created on a column, SQL
Server reorders the data in the table based on that index.
Because the data is physically stored in the order of the
clustered index, a table can only have one clustered index.
Typically, a clustered index is created on the primary key by
default.

Antiquitex (6 X 9)
No ratings yet
Antiquitex (6 X 9)
6 pages
Falcon Zinc Metal Industries L.L.C
No ratings yet
Falcon Zinc Metal Industries L.L.C
7 pages
Chapter 4
No ratings yet
Chapter 4
19 pages
Data - Mining - Warehousing Unit 1
No ratings yet
Data - Mining - Warehousing Unit 1
35 pages
DataMining and Data Warehousing
No ratings yet
DataMining and Data Warehousing
96 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
04OLAP
No ratings yet
04OLAP
66 pages
Chap3 PIEAS DCIS BSCIS DM 23 Topic 03 DWH OLAP
No ratings yet
Chap3 PIEAS DCIS BSCIS DM 23 Topic 03 DWH OLAP
46 pages
04OLAP
No ratings yet
04OLAP
50 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
58 pages
Datawarehouse: Fact Table
No ratings yet
Datawarehouse: Fact Table
55 pages
04olap New
No ratings yet
04olap New
55 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Lecture 4 (Dataware Housing)
No ratings yet
Lecture 4 (Dataware Housing)
50 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
Data Warehouse and OLAP
No ratings yet
Data Warehouse and OLAP
55 pages
Module-3 Data Warehousing
No ratings yet
Module-3 Data Warehousing
44 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
28 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
CH - 3
No ratings yet
CH - 3
45 pages
04OLAP
100% (1)
04OLAP
58 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
51 pages
Unit 1 - Data Warehouse
No ratings yet
Unit 1 - Data Warehouse
21 pages
ch4 DW Summary
No ratings yet
ch4 DW Summary
8 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
Data Warehouse and Mining
No ratings yet
Data Warehouse and Mining
7 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
Data Warehouse
No ratings yet
Data Warehouse
174 pages
Warehouse
No ratings yet
Warehouse
58 pages
02datawarehousing For DM
No ratings yet
02datawarehousing For DM
38 pages
Data Warehousing Unit 1,2
No ratings yet
Data Warehousing Unit 1,2
9 pages
Datawarehouse Notes
No ratings yet
Datawarehouse Notes
39 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Data Mining: Concepts and Techniques: - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 2
62 pages
DMDW Operations
No ratings yet
DMDW Operations
65 pages
Data Mining and Warehosuing Lecture 01
No ratings yet
Data Mining and Warehosuing Lecture 01
36 pages
04DWH & Olap
No ratings yet
04DWH & Olap
50 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
47 pages
04OLAP Editted v1
No ratings yet
04OLAP Editted v1
59 pages
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
23 pages
DWM Unit 1
No ratings yet
DWM Unit 1
67 pages
UNIT-1 Data Warehousing Part-III
No ratings yet
UNIT-1 Data Warehousing Part-III
68 pages
Data Mining-Data Warehouse
No ratings yet
Data Mining-Data Warehouse
7 pages
chp15 16 17 Warehouse NoSQL
No ratings yet
chp15 16 17 Warehouse NoSQL
38 pages
Data Warehouse
No ratings yet
Data Warehouse
23 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor
No ratings yet
Data Mining and Warehousing (203105431) : Sandeep Jangir, Assistant Professor
44 pages
Chapter 1 Datawarehouse
100% (1)
Chapter 1 Datawarehouse
47 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
25 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
Unit 1
No ratings yet
Unit 1
36 pages
DataMining - Chapter2 - Data WareHouse
No ratings yet
DataMining - Chapter2 - Data WareHouse
53 pages
04OLAP
No ratings yet
04OLAP
58 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Unit 1
No ratings yet
Unit 1
99 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
CH 1
No ratings yet
CH 1
53 pages
Data Warehousingand Data Mining
No ratings yet
Data Warehousingand Data Mining
65 pages
DWDM Unit-2 PDF
No ratings yet
DWDM Unit-2 PDF
149 pages
RS Cia-1 QP-1
No ratings yet
RS Cia-1 QP-1
2 pages
Data - Mining - Warehousing Unit II
No ratings yet
Data - Mining - Warehousing Unit II
39 pages
Unit 1
No ratings yet
Unit 1
33 pages
Unit 2
No ratings yet
Unit 2
33 pages
CloudComputing Syllabus
No ratings yet
CloudComputing Syllabus
4 pages
AIDS Syllabus 218 220
No ratings yet
AIDS Syllabus 218 220
3 pages
Development of Economical Microcontroller-Based Soil Moisture Sensor Using Time Domain Reflectometry
No ratings yet
Development of Economical Microcontroller-Based Soil Moisture Sensor Using Time Domain Reflectometry
5 pages
Mockito Basics and BDDMockito Class
No ratings yet
Mockito Basics and BDDMockito Class
9 pages
D400 Research Proposal Format 2021
No ratings yet
D400 Research Proposal Format 2021
5 pages
Three-Dimensional Analysis of Train-Rail-Bridge Interaction Problems
No ratings yet
Three-Dimensional Analysis of Train-Rail-Bridge Interaction Problems
37 pages
Remote Sensing Midterm Exam Reviewer
No ratings yet
Remote Sensing Midterm Exam Reviewer
19 pages
DeepMicrobes Taxonomic Classification For Metagenomics Using Deep Learning
No ratings yet
DeepMicrobes Taxonomic Classification For Metagenomics Using Deep Learning
13 pages
Question For Machine Stitch.
No ratings yet
Question For Machine Stitch.
4 pages
Object Oriented Programming Lab-10 (Polymorphism and Abstract Classes)
100% (1)
Object Oriented Programming Lab-10 (Polymorphism and Abstract Classes)
5 pages
Magic Square AP PC Unit 1 Review
No ratings yet
Magic Square AP PC Unit 1 Review
5 pages
Exponential Function
No ratings yet
Exponential Function
22 pages
Mutable Plaits
No ratings yet
Mutable Plaits
12 pages
Reinforced
No ratings yet
Reinforced
725 pages
New Syllabus Mathematics Teacher S Resource Book 4 7th Edition Teh Keng Seng - Get The Ebook Instantly With Just One Click
100% (2)
New Syllabus Mathematics Teacher S Resource Book 4 7th Edition Teh Keng Seng - Get The Ebook Instantly With Just One Click
91 pages
Mach4 G and M Code Reference Manual
No ratings yet
Mach4 G and M Code Reference Manual
81 pages
Course Summary ATAS
No ratings yet
Course Summary ATAS
2 pages
LDR
No ratings yet
LDR
7 pages
Cat 500kva PDF
No ratings yet
Cat 500kva PDF
6 pages
Conceptual
No ratings yet
Conceptual
45 pages
Teacher Assistants Working With Students With Disability: The Role of Adaptability in Enhancing Their Workplace Wellbeing
No ratings yet
Teacher Assistants Working With Students With Disability: The Role of Adaptability in Enhancing Their Workplace Wellbeing
24 pages
Module 5 HSC Physics Notes
No ratings yet
Module 5 HSC Physics Notes
8 pages
Psychiatric Nursing Contemporary Practice Boyd 5th Edition Test Bank 2024 Scribd Download Full Chapters
100% (11)
Psychiatric Nursing Contemporary Practice Boyd 5th Edition Test Bank 2024 Scribd Download Full Chapters
31 pages
Reading Content From The File: Application 61: File Writing Demo
No ratings yet
Reading Content From The File: Application 61: File Writing Demo
200 pages
Manual Aire Acondicionado
No ratings yet
Manual Aire Acondicionado
22 pages
PMOS NMOS Equations and Examples
100% (1)
PMOS NMOS Equations and Examples
3 pages
امتى اخلى القواعد هنج ولا فيكسد مع العمود - الصفحة 5
No ratings yet
امتى اخلى القواعد هنج ولا فيكسد مع العمود - الصفحة 5
7 pages
W Scientific Inquiry Design Lab
No ratings yet
W Scientific Inquiry Design Lab
7 pages
Module 4. Magnetism and Electromagnetism
100% (1)
Module 4. Magnetism and Electromagnetism
4 pages

Data - Mining - Warehousing Unit I

Uploaded by

Data - Mining - Warehousing Unit I

Uploaded by

23AD406 Data Mining & Data Warehousing

Data Warehouse, Operational Database Systems versus Data

• Data Mining is the process of discovering patterns, trends,

• Banks use data mining to detect unusual transactions and

• A Data Warehouse is a centralized storage system that

• Retail: Analyzing customer purchases over years(To

• An Operational Database System (OLTP - Online

• Online Transaction Processing (OLTP) systems handle real-

• Banking systems – Processing withdrawals, deposits, fund

• Online Analytical Processing (OLAP) is designed for

• OLTP: Fast, real-time transactions (CRUD operations).

• OLTP: Bank withdrawals, ATM transactions, e-commerce

Complexity Simple Moderate High

Denormalized Normalized (sub-

OLAP is primarily used for analyzing structured data stored in

Aggregation is the process of summarizing data by grouping

Multidimensional Data Mining applies machine learning and

Key Techniques in Multidimensional Data Mining:

You might also like