DM-M1-PPT v1.11
DM-M1-PPT v1.11
1
Data Mining Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)
2
Data Mining Syllabus
Module - 3 (Advanced classification and Cluster analysis)
3
Data Mining Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)
4
Module – 1 (Introduction to Data Mining and Data
Warehousing)
1.1 Data warehouse-Differences between Operational Database Systems
and Data Warehouses
1.2 Multidimensional data model- Warehouse schema, OLAP Operations
1.3 Data Warehouse Architecture
5
1.1 Data warehouse-Differences between Operational
Database Systems and Data Warehouses
◼ 301 List out the three major features of data warehouse. (3)
◼ 401 List and explain any two applications of data warehouse. (3)
6
Evolution of Database Technology – Not in Syllabus
7
What is a Data Warehouse?
◼ A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.
◼ Mentioned above are the 3 major features of Datawarehouse
8
Data Warehouse—Subject-Oriented
9
Data Warehouse—Integrated
10
Data Warehouse—Time Variant
11
Data Warehouse—Nonvolatile
◼ A physically separate store of data transformed from the
operational environment
◼ Operational update of data does not occur in the data
warehouse environment
◼ Does not require transaction processing, recovery, and
concurrency control mechanisms
◼ Requires only two operations in data accessing:
◼ initial loading of data and access of data
12
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (1)
13
Feature OLAP OLTP
Large volumes (TB) of historical Smaller volumes (GB) of
Data Volume
data (data warehouse). current, operational data.
Complex queries involving Simple and short transactions
Operations
aggregations and joins. like CRUD operations.
Response Optimized for query speed, not Requires quick responses for
Time immediate responses. real-time transactions.
Database Denormalized schema (star or Normalized schema to minimize
Design snowflake schema). redundancy (ER Based).
Low concurrency due to High concurrency to handle
Concurrency
analytical workloads. multiple user requests.
Read-intensive, historical and Write-intensive, operational
Data Type
summary data. and detailed data.
Example Use Business intelligence, trend Banking systems, order
Cases analysis, forecasting. processing, inventory systems.
Data warehouses (e.g., Transactional databases (e.g.,
Tools
Snowflake, BigQuery). MySQL, PostgreSQL).
Periodic updates (ETL
Data Updates Continuous, real-time updates.
processes). 14
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (2)
◼ 201 Illustrate the multi-dimensional data model with a neat
figure. (3)
2 marks; Figure – 1 mark
◼ 211 b) List and illustrate the schemas used for the physical
representation of the multidimensional data with examples. (8)
3 schema - a star schema, a snowflake schema, or a fact constellation
schema – 3 marks
Each with explanation – 1 mark each, figures and example – 2 marks
◼ 312 a) Explain the differences between star schema and
snowflake schema in a data warehouse (6)
◼ 402 Describe the similarities and the differences of star schema
and snowflake schema – 3 marks
15
1.2(a) Multidimensional data model- Warehouse schema
16
From Tables to Data Cubes
18
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
19
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
20
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Office Day
Time
22
Data Cube Aggregation
Country
sum
Canada
Mexico
sum
all
0-D (apex) cuboid
product Quarter Country
1-D cuboids
Quarter
2Qtr 3Qtr 4Qtr sum
U.S.A 0-D (apex) cuboid All.
Country
Canada
1-D Product. Quarter . C
cuboids
Mexico Product,
Quarter. Product,
sum Country.
2-D cuboids
Product, Quarter,
Quarter. Product, Country.
Country.
2-D cuboids
26
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
27
1.2(b) Multidimensional data model- OLAP Operations
28
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (3)
29
Typical OLAP Operations
1. Roll-Up (Drill-Up)
◼ Summarizes data by climbing up a hierarchy or by reducing
dimensions. It allows users to view data at a higher level of
abstraction.
◼ Example: In a sales data cube, roll up from `city` to
`region` or `country`. This aggregates the data at a
broader level.
2. Drill-Down (Roll-Down)
◼ The reverse of roll-up. It moves from higher-level summarized
data to lower-level detailed data or introduces new dimensions
for detailed analysis.
◼ In a sales data cube, drill down from `region` to `city` to
see the granular data.
30
3. Slice and Dice
◼ Slice: Selects a single value from one dimension to create a
sub-cube.
◼ Example: Filter sales data for `Year = 2023`.
◼ Dice: Filters data based on multiple dimension values,
resulting in a sub-cube with specific data.
◼ Filter sales data for `Year = 2023` and `Product =
Electronics`.
4. Pivot (Rotate)
◼ Reorients the data cube for better visualization, typically
converting 3D data into 2D views (e.g., switching rows and
columns in a table).
◼ Rotate a data cube to view `products` on rows and
`regions` on columns.
31
5. Drill Across
◼ Retrieves data involving more than one fact table.
◼ Combine `sales` data with `inventory` data to analyze
stock levels and sales performance together.
6. Drill Through
◼ Navigates through the bottom level of the cube to access
back-end relational tables, often using SQL queries.
◼ From aggregated sales data, drill through to view the
raw transaction records in a store.
32
Fig. 3.10 Typical OLAP
Operations
33
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (4)
34
3.1.2 (b)
i. Draw a schema diagram for the above data warehouse using one of
the schemas. [star, snowflake, fact constellation]
35
3.1.2 (b)
ii. Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2004
Given
The base cuboid [day, doctor, patient],
list the total fee collected by each doctor in 2004
1. *Slice* the data for the year 2004:
2. *Roll up* the data along the “patient” dimension:
3. *Aggregation*: on the “charge” measure, for each doctor. This will
give the total fee collected by each doctor in 2004. (PTO)
36
1. *Slice* the data for the year 2004:
This operation will select only the data for the year 2004 from the “time”
dimension. After this operation, the “time” dimension will be reduced to
just the year-level, and the cuboid becomes [doctor, patient,
year=2004]
2. *Roll up* the data along the “patient” dimension:
The “patient” dimension can be aggregated to summarize the fee by
doctor. This will result in the cuboid “[doctor, year=2004]” with the total
charges for each doctor.
3. *Aggregation*:
After the *roll-up* operation, you will perform an *aggregation*
(summing) on the “charge” measure, for each doctor. This will give the
total fee collected by each doctor in 2004.
Final Result: List of doctors with their corresponding total charge
collected in 2004.
37
Not In Syllabus
To obtain the same list, write an SQL query assuming the data are
stored in a relational database with the schema fee (day, month, year,
doctor, hospital, patient, count, charge)
38
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (5)
◼ 412 b) Suppose that a data warehouse for a university consists of
the following four dimensions: student, course, semester, and
instructor, and two measures: count and avg_grade. (8)
(i) Draw a snowflake schema diagram for the data warehouse.
(ii) Starting with the base cuboid, what specific OLAP operations
should one perform in order to list the average grade of CS courses
for each University student
i) a snowflake schema diagram for the data warehouse – 4 marks
[marks can be given for the correct steps to the solution]
(ii) specific OLAP operations should one perform in order to list the
average grade of CS courses for each University student – 4 marks
39
4.1.2 (b)
(i) Draw a snowflake schema diagram for the data warehouse.
41
◼ Base Cuboid [student, course, semester, instructor]:
1. Selection: Slice Operation
Apply a slice operation to filter for CS courses in the "course"
dimension (course.department = "CS"). This reduces the dataset to
include only data related to Computer Science courses.
2. Projection: Drill-Down or Roll-up
Perform a drill-down operation on the "student" dimension to move
from a higher-level aggregation (e.g., university-level or department-
level data) to individual student-level data. This ensures that data is
listed for individual students.
3. Aggregation:
Use the *grade* measure to calculate the average grade obtained by
each student across all CS courses he/she has registered for. This is
achieved by aggregating the data across the other dimensions
(“semester” and “instructor”).
Final Output:
The result will be a two-dimensional representation with student as
rows and their corresponding average grade for CS courses.
42
Not In Syllabus
Step 1: Roll-Up on Course from Course_ID to Department
◼ Operation Explanation:
◼ Aggregate the data from individual courses (e.g., CS101,
CS102) to the department level (e.g., CS). This simplifies
the dataset to focus on departments rather than specific
courses.
◼ Schema After Roll-Up:
43
Not In Syllabus
Step 2: Dice on Course and Student with Department =
"CS" and University = "BigUniversity“
◼ Operation Explanation:
◼ Apply a filter to include only the relevant data:
◼ Department = "CS" ensures only Computer Science courses
are included.
◼ University = "BigUniversity" ensures only students from Big
University are included.
◼ Schema After Dice:
44
Not In Syllabus
Step 3: Drill-Down on Student from University to
Student_Name
◼ Operation Explanation:
◼ Navigate from the higher University level (e.g.,
BigUniversity) to the individual Student_Name level (e.g.,
Student_A, Student_C). This ensures granularity at the
student level for analysis.
◼ Schema After Drill-Down
45
Not In Syllabus
Step 4: Aggregation on Grade
◼ Operation Explanation:
◼ Group by Student_Name and calculate the average of the
Grade measure for all CS courses taken by each student.
◼ Schema After Aggregation:
Student Avg_Grade
Student_A 87.5
Student_C 92.0
46
Not in Syllabus
Corresponding SQL query
SELECT student_id, AVG(grade)
FROM data_warehouse_table
WHERE course = 'CS’
GROUP BY student_id
This query effectively implements the *slice*, *drill-down*, and
*aggregation* operations specified in the OLAP process.
47
1.3 Data Warehouse Architecture
48
Why a Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
◼ Different functions and different data:
◼ missing data: Decision support requires historical data which
operational DBs do not typically maintain
◼ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
◼ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
49
Data Warehouse: A Multi-Tiered Architecture
See Section 4.1.4 of “Data Mining: Concepts and Techniques”
1. The bottom tier is a warehouse database server that is generally a
relational database system. Back-end tools and utilities are used to feed
data into the bottom tier from operational databases or other external
sources (e.g., customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning,
and transformation (e.g., to merge similar data from different sources
into a unified format), as well as load and refresh functions to update
the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying
DBMS and allows client programs to generate SQL code to be executed
at a server. This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using
either a ROLAP model or MOLAP) model.
3. The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and data mining tools (e.g., identifying
patterns and trends, prediction etc.)
50
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
OLAP
Operational Extract Query,
DBs Transform Data Serve Reports,
Load
Refresh
Warehouse Analysis,
Data mining
Data Marts
entire organization
◼ Data Mart
◼ a subset of corporate-wide data that is of value to a specific
data mart
◼ Virtual warehouse
◼ A set of views over operational databases
materialized
52
Extraction, Transformation, and Loading (ETL)
◼ Data extraction
◼ get data from multiple, heterogeneous, and external sources
◼ Data cleaning
◼ detect errors in the data and rectify them when possible
◼ Data transformation
◼ convert data from legacy or host format to warehouse format
◼ Load
◼ sort, summarize, consolidate, compute views, check integrity,
and build indices and partitions
◼ Refresh
◼ periodic updates from the data sources to the warehouse
53
Metadata Repository
◼ Meta data is the data defining warehouse objects. It stores:
◼ Description of the structure of the data warehouse
◼ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
◼ Operational meta-data
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ business terms and definitions, ownership of data, charging policies
54
OLAP Server Architectures
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middleware
◼ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
◼ ROLAP is suited for environments with large volumes of
transactional data and the need for scalability.
◼ Multidimensional OLAP (MOLAP)
◼ Sparse array-based multidimensional storage engine
◼ Fast indexing to pre-computed summarized data
◼ MOLAP excels at providing fast performance for complex
analytical queries by leveraging pre-aggregated multidimensional
cubes.
55
OLAP Server Architectures
◼ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
◼ Flexibility, e.g., low-level: relational, high-level: array
◼ HOLAP combines the best of both worlds, offering flexibility and
scalability by using relational storage for detailed data and
multidimensional cubes for fast aggregation.
◼ Specialized SQL servers (e.g., Redbricks)
◼ Specialized support for SQL queries over star/snowflake schemas.
Implements Query optimization for complex, multidimensional
analytical queries using techniques such as:
◼ Rewriting SQL queries to improve performance.
◼ Indexing: Creating specialized indexes to optimize the retrieval
of multidimensional data from star and snowflake schemas.
◼ Materialized Views: Pre-computing and storing query results to
speed up response times for frequently requested data.
56
1.4 Data Warehousing to Data Mining
1.5 Data Mining Concepts and Applications
1.6 Knowledge Discovery in Database Vs Data mining
57
Why Data Mining?
58
What Is Data Mining?
59
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
60
The knowledge discovery process consists of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the database)
4. Data transformation (where data are transformed and consolidated
into forms appropriate for mining by performing summary or aggregation
operations)
5. Data mining (an essential process where intelligent methods are
applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation (visualization and knowledge representation
techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data
are prepared for mining. The data mining step may interact with the user
or a knowledge base. The interesting patterns are presented to the user
and may be stored as new knowledge in the knowledge base
February 4, 2025 Data Mining: Concepts and Techniques 61
Example: A Web Mining Framework
62
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
63
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
◼ https://www.javatpoint.com/data-mining-architecture
65
66
Architecture of typical data mining system
Data Source:
◼ The sources of data for the warehouse are the Database, data
warehouse, World Wide Web (WWW), text files, and other
documents.
67
ETL Process
68
Architecture of typical data mining system
69
Architecture of typical data mining system
70
Architecture of typical data mining system
Knowledge Base:
◼ The knowledge base is used to guide the search or evaluate the
stake of the result patterns.
◼ The knowledge base may even contain user views and data
71
1.8 Data Mining Functionalities
72
Data Mining Function: (1) Generalization
◼ Data entries can be associated with classes or concepts, such
as "computers" or "printers" for items and "big Spenders" or
"budget Spenders" for customers.
◼ Descriptions of these classes or concepts, known as
class/concept descriptions, can be derived using data
characterization and data discrimination.
◼ Data Characterization involves summarizing the general
characteristics or features of a target class (e.g., software
products with a 10% sales increase).
◼ Data Discrimination focuses on comparing the general features
of a target class against one or more contrasting classes (e.g.,
comparing software products with sales increases versus
decreases).
73
Data Mining Function: (2) Association
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Milk → Bread [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering, and
other applications?
74
Data Mining Function: (3) Classification
75
Data Mining Function: (4) Cluster Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data
◼ Noise or exception?
◼ Methods: statistics, clustering, classification, regression
analysis, …
◼ Useful in fraud detection, rare events analysis
77
Data Mining Function: (6) Trend Analysis
78
Data Mining Function: (7) Data Visualization
79
February 4, 2025 Data Mining: Concepts and Techniques 80
1.9 Data Mining Issues
◼ 412 a) Describe different issues in data mining. (6)
Any six issues in data mining – 1 mark each
81
Major Issues in Data Mining
82
Major Issues in Data Mining
83
Major Issues in Data Mining
84