0% found this document useful (0 votes)
20 views45 pages

Data - Mining - Warehousing Unit I

The document outlines the concepts of Data Warehousing and Online Analytical Processing (OLAP), detailing the differences between operational database systems and data warehouses. It explains data mining techniques, examples of their applications, and various data warehouse schemas like Star, Snowflake, and Fact Constellation. Additionally, it covers OLAP operations, the role of concept hierarchies, measures, and indexing methods to enhance data retrieval efficiency.

Uploaded by

aarsha.br
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views45 pages

Data - Mining - Warehousing Unit I

The document outlines the concepts of Data Warehousing and Online Analytical Processing (OLAP), detailing the differences between operational database systems and data warehouses. It explains data mining techniques, examples of their applications, and various data warehouse schemas like Star, Snowflake, and Fact Constellation. Additionally, it covers OLAP operations, the role of concept hierarchies, measures, and indexing methods to enhance data retrieval efficiency.

Uploaded by

aarsha.br
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

23AD406 Data Mining & Data Warehousing

Unit - I
SYLLABUS- UNIT I
DATA WAREHOUSING AND ONLINE ANALYTICAL
PROCESSING

Data Warehouse, Operational Database Systems versus Data


Warehouses, A Multi tired Architecture, A Multidimensional
Data Model, Stars, Snowflakes and Fact Constellations:
Schemas, Role of Concept hierarchies, Measures, OLAP
Operations, From online Analytical processing to
Multidimensional Data Mining, Indexing OLAP Data.
What is Data Mining?

• Data Mining is the process of discovering patterns, trends,


and useful insights from large datasets
• Uses various techniques like statistics, machine learning,
and artificial intelligence.
• Helps businesses make data-driven decisions.
Examples of Data Mining

• Banks use data mining to detect unusual transactions and


prevent fraud.
• E-commerce (Amazon, Flipkart): Suggests products based
on previous purchases.
• Marketing & Advertising: Segments customers based on
buying behavior
• Healthcare: Identifies disease patterns from patient records
What is a Data Warehouse?

• A Data Warehouse is a centralized storage system that


collects and stores structured data from different sources.
• It supports decision-making and business analytics.
• It stores structured data optimized for fast querying.
Examples of Data Warehousing

• Retail: Analyzing customer purchases over years(To


predicting stock demand based on past sales)
• Stock Market: Tracking trends for investment decisions.
• Healthcare: Analyzing patient records for research.
• Government: Storing census data for policy planning
Operational Database Systems versus
Data Warehouses

• An Operational Database System (OLTP - Online


Transaction Processing) is a database designed to manage
real-time transactions efficiently.
• It supports day-to-day business operations like banking
transactions, online shopping, and ticket bookings.
What is OLTP?

• Online Transaction Processing (OLTP) systems handle real-


time data.
• processing of short, fast, and frequent transactions in a
database system.
• Performs CRUD operations (Create, Read, Update, Delete)
Examples of OLTP

• Banking systems – Processing withdrawals, deposits, fund


transfers
• E-commerce websites – Managing product purchases, cart
updates
• Hospital management systems – Booking patient
appointments
• Airline reservation systems – Booking or canceling tickets
What is OLAP?

• Online Analytical Processing (OLAP) is designed for


analyzing large amounts of historical data to support
decision-making and business intelligence.
• Processes complex queries on large datasets.
• Uses multidimensional databases (fact & dimension tables)
OLTP vs. OLAP (Comparison)

• OLTP: Fast, real-time transactions (CRUD operations).


• OLAP: Read-heavy, complex queries for business analysis.
• OLTP: Normalized databases
• OLAP: Denormalized for efficiency.
Normalized databases:

Denormalized Databases:
Real-Life Examples of OLTP & OLAP

• OLTP: Bank withdrawals, ATM transactions, e-commerce


orders.
• OLAP: Business sales analysis, trend forecasting, customer
insights.
A Multi-Tiered Architecture in Data
Warehousing
What is a Multi-Tiered Architecture?
• A multi-tiered architecture in a data warehouse is a layered design
that organizes data storage and processing into different levels.
• It helps in efficient data management, security, and scalability.
Three-Tier Architecture of a Data Warehouse:
A typical Data Warehouse is divided into three main tiers:
1. Bottom Tire
2. Middle Tire
3. Top Tire
1. Bottom Tire- Data Sources and ETL Layer
Data Sources: The data source tier consists of multiple operational
systems (OLTP systems, flat files, external databases, etc.) from where
data is pulled. The data here is often unstructured or semi-structured,
such as raw data logs, transactional data, or even social media feeds.
ETL (Extract, Transform, Load):
Extract: Data is extracted from multiple heterogeneous sources,
including databases, files, APIs, etc.
Transform: This phase involves cleaning, filtering, and transforming
the data to fit into a unified structure. This includes data
standardization, removing duplicates, and applying business rules.
Load: After the transformation, the clean data is loaded into the data
warehouse (usually in a structured form like star schema or snowflake
schema).
Star Schema:
The Star Schema is the simplest type of data warehouse schema.
It consists of a central fact table connected to multiple dimension
tables, forming a star-like shape.
🔹 Structure
Fact Table: Contains numeric data (measures) related to business
operations, such as sales revenue, number of units sold, etc.
Dimension Tables: Contain descriptive data (attributes), such as
time, product details, customer details, etc.
Denormalized Data: The dimension tables are not split into sub-
tables; they contain redundant data to improve query
performance.
Star Schema
Snowflake Schema:
The Snowflake Schema is an extension of the Star Schema,
where dimension tables are normalized into multiple related
tables. Th
🔹 Structure
Fact Table: Same as in Star Schema.
Normalized Dimension Tables: Each dimension table is split
into smaller related tables.is reduces redundancy but increases
complexity.
Snowflake Schema
Snowflake Schema (continue)
Fact Constellation Schema (Galaxy Schema)
A Fact Constellation Schema, also known as a Galaxy Schema, is
a data warehouse schema that consists of multiple fact tables
sharing common dimension tables.
• It is more complex than Star and Snowflake schemas.
• It is used when multiple business processes are analyzed in a
single data warehouse.
• Since it consists of multiple fact tables, it allows
multidimensional analysis across different subjects.
• A Fact Constellation Schema looks like a collection of Star
Schemas where different fact tables share the same dimension
tables.
Example for Fact Constellation Schema:
Example for Fact Constellation Schema(continue):
Example for Fact Constellation Schema(continue):
Star Schema vs. Snowflake Schema

Snowflake Fact
Feature Star Schema
Schema Constellation

Complexity Simple Moderate High

Denormalized Normalized (sub-


Normalization Hybrid
(flat tables) tables)
Query Faster (fewer Moderate(more Moderate(Many
Performance joins) joins) more joins)
Storage
High Less High
Required
2. Middle Tier – Data Warehouse and OLAP Servers
• Data Warehouse: This is the core storage area where all the
cleaned and transformed data resides. It is usually structured
for fast querying and analysis, optimized for reporting and
decision-making.
• OLAP Server: The OLAP server allows for complex
multidimensional analysis of the data stored in the
warehouse.
• It supports operations like drill-down (viewing data at more
granular levels), roll-up (aggregating data at higher levels),
and slice-and-dice (looking at the data from different
perspectives).The OLAP server is often designed to be query-
efficient, meaning it stores pre-aggregated data to speed up
common analytical queries.
3. Top Tier – Front-End Tools and Applications
BI Tools: This tier is where business users interact with the data. The
data is visualized through dashboards, reports, and ad-hoc queries.
Common front-end tools include:
•Tableau: Used for visualizing data in an interactive dashboard
format.
•Power BI: A Microsoft tool that integrates easily with other
Microsoft products and provides interactive reports.
•QlikView: A tool that allows users to analyze data in different
visual formats.
•Data Exploration and Reporting: The front-end tier also allows
for data exploration, enabling users to query the data warehouse and
analyze different dimensions such as time, location, product, etc.
•Users: This layer is used by a variety of users including data
analysts, business analysts, and management for decision-making.
They may perform tasks like reporting on sales data, forecasting
future trends, or analyzing customer behavior.
Role of Concept Hierarchies in Data Warehousing
A concept hierarchy organizes data into multiple levels of abstraction, making it
easier to analyze and interpret. It helps in OLAP operations like Drill-Down, Roll-
Up, and Slice-and-Dice by struct
• Example of Concept Hierarchy
• Geographical Hierarchy:
Country → State → City → District
Example: India → Tamil Nadu → Chennai → Anna Nagar
• Time Hierarchy:
Year → Quarter → Month → Week → Day
Example: 2024 → Q1 → January → Week 2 → 15th Jan
• Product Hierarchy:
Category → Subcategory → Product → Model
Example: Electronics → Mobile → Samsung → Galaxy S23
Types of Concept Hierarchies:
Explicit Hierarchy:
• Predefined manually by users or database designers.
• Stored as part of the database schema.
• Clearly structured with defined levels.
Implicit Concept Hierarchy
• Derived automatically from existing data.
• Not predefined in the schema.
• Hierarchy is determined based on relationships within the
dataset.
Measures
In Data Warehousing and OLAP systems, measures are
the quantitative values or metrics that are analyzed
across different dimensions. Measures are what users are
typically interested in when querying data, as they
represent the actual numeric values that need to be
aggregated, analyzed, or summarized.
Characteristics of Measures:
• Quantitative Data: Measures are numeric values like
sales, profit, quantity, revenue, etc.
• Aggregated Data: In OLAP cubes, measures are usually
aggregated across different dimensions. For example,
summing up sales for a specific product or calculating the
average revenue for a particular time period.
• Aggregated Functions: Measures are often subjected to
mathematical operations such as sum, average, count,
max, and min.
Common Operations on Measures:
1. Aggregation: Measures are often aggregated at
different levels of the dimension. For example, if we
are analyzing sales across regions and time periods,
the system might aggregate sales by region or month.
Example: The total sales for a specific region in a
year (sum of sales across months).
2. Roll-up (Aggregation): Measures are aggregated
from lower level to a higher level.
3. Drill-down: Break down measures into lower levels
of detail.
4. Slice: It is used to isolate a subset of data from a larger
dataset, focusing on a specific measure and dimension
combination.
Common OLAP Operations:
1. Roll-Up
Example: From monthly sales to yearly sales.
2. Drill-Down
Example: From yearly sales to monthly sales or daily
sales.
3. Slice
Example: Showing all data for a particular region in a
certain year.
4. Dice:
Example: Extracting data for "Region" and "Product
Type" for a particular year
5. Pivot
Example: Changing the view from "Region by Time" to
"Product by Region."
From Online Analytical Processing (OLAP) to
Multidimensional Data Mining

OLAP is primarily used for analyzing structured data stored in


data warehouses. It allows for fast retrieval and interactive
analysis of multidimensional data using operations like drill-
down, roll-up, slicing, and dicing. However, OLAP is limited to
summarization and does not provide deep insights beyond
aggregation.

Aggregation is the process of summarizing data by grouping


and calculating values like totals, averages, or counts.
Multidimensional Data Mining (MDM)

Multidimensional Data Mining applies machine learning and


statistical techniques to analyze OLAP data beyond
aggregation. It helps in:
✔ Finding hidden patterns in large datasets.
✔ Predicting future trends (e.g., sales forecasting).
✔ Detecting anomalies (e.g., fraud detection).

Key Techniques in Multidimensional Data Mining:


1. Classification
2. Clustering
3. Association Rule Mining
4. Anomaly Detection
1. Classification: Assigns data into predefined categories.
Example: Categorizing customers as ‘loyal’ or ‘high-risk’
based on purchase behavior.
2. Clustering – Groups similar data points together.
Example: Identifying customer segments based on spending
patterns.
3. Association Rule Mining – Identifies relationships between
variables.
Example: "People who buy laptops also buy laptop bags."
4. Anomaly Detection – Identifies unusual patterns.
Example: Detecting fraudulent transactions in a bank.
Indexing OLAP Data
What is Indexing?
Indexing is a technique used in databases to speed up data
retrieval. It works like a book index, helping the system find
data quickly without scanning the entire database.
Types of Indexing:
1. Bitmap Indexing
2. B-Tree Index
3. Hash Index
4. Clustered Index
Indexing OLAP Data
BitMap Index
BitMap Indexing is a data indexing technique used in database
management systems (DBMS) to improve the performance of
read-only queries that involve large datasets. It involves
creating a bitmap index, which is a data structure that
represents the presence or absence of data values in a table or
column.
Indexing OLAP Data
B-Tree Index
A B-Tree is a specialized m-way tree designed to optimize data
access, especially on disk-based storage systems.
In a B-Tree of order m, each node can have up to m children
and m-1 keys, allowing it to efficiently manage large datasets.
Indexing OLAP Data
Hash Index
Hashing in DBMS is a technique to quickly locate a data record in a
database irrespective of the size of the database. For larger databases
containing thousands and millions of records, the indexing data
structure technique becomes very inefficient because searching a
specific record through indexing will consume more time.
Static Hashing
In static hashing, the hash function always generates the same
bucket's address. For example, if we have a data record for
employee_id = 107, the hash function is mod-5 which is - H(x) % 5,
where x = id. Then the operation will take place like this:
H(106) % 5 = 1.
This indicates that the data record should be placed or searched in
the 1st bucket (or 1st hash index) in the hash table.
Indexing OLAP Data
Indexing OLAP Data
Clustered Index
A Clustered Index determines the physical order of the data in a
table. When a clustered index is created on a column, SQL
Server reorders the data in the table based on that index.
Because the data is physically stored in the order of the
clustered index, a table can only have one clustered index.
Typically, a clustered index is created on the primary key by
default.

You might also like