0% found this document useful (0 votes)
18 views84 pages

DM-M1-PPT v1.11

The document outlines a syllabus for a course on Data Mining and Data Warehousing, covering topics such as data warehouse architecture, data preprocessing, classification, clustering, and association rule analysis. It details the differences between operational databases and data warehouses, the need for data preprocessing, and various data mining functionalities and issues. Additionally, it discusses multidimensional data models, OLAP operations, and the evolution of database technology.

Uploaded by

Insha Nourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views84 pages

DM-M1-PPT v1.11

The document outlines a syllabus for a course on Data Mining and Data Warehousing, covering topics such as data warehouse architecture, data preprocessing, classification, clustering, and association rule analysis. It details the differences between operational databases and data warehouses, the need for data preprocessing, and various data mining functionalities and issues. Additionally, it discusses multidimensional data models, OLAP operations, and the evolution of database technology.

Uploaded by

Insha Nourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Module – 1 (Introduction to Data Mining

and Data Warehousing)

Major Source For This Material:


Data Mining: Concepts and Techniques (3rd ed.) —
Jiawei Han, Micheline Kamber, and Jian Pei
Data Mining and Warehousing - YouTube

1
Data Mining Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)

Data warehouse-Differences between Operational Database Systems


and Data Warehouses, Multidimensional data model- Warehouse
schema, OLAP Operations, Data Warehouse Architecture, Data
Warehousing to Data Mining, Data Mining Concepts and Applications,
Knowledge Discovery in Database Vs Data mining, Architecture of typical
data mining system, Data Mining Functionalities, Data Mining Issues.

Module - 2 (Data Preprocessing)

Data Preprocessing-*Need of data preprocessing*, Data Cleaning-


Missing values, Noisy data, Data Integration and Transformation, Data
Reduction-Data cube aggregation, Attribute subset selection,
*Dimensionality reduction*, Numerosity reduction, Discretization and
concept hierarchy generation.

2
Data Mining Syllabus
Module - 3 (Advanced classification and Cluster analysis)

Classification- Introduction, Decision tree construction principle, Splitting


indices -Information Gain, Gini index, Decision tree construction
algorithms-ID3, Decision tree construction with presorting-SLIQ,
Classification Accuracy-Precision, Recall.

Introduction to clustering-*Clustering Paradigms*, Partitioning


Algorithm- PAM, Hierarchical Clustering-DBSCAN, Categorical Clustering-
ROCK

Module 4: (Association Rule Analysis)

*Association Rules-Introduction, Methods to discover Association rules,


Apriori(Level-wise algorithm)*, Partition Algorithm, Pincer Search
Algorithm, Dynamic Itemset Counting Algorithm, FP-tree Growth
Algorithm.

3
Data Mining Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)

Data warehouse-Differences between Operational Database Systems


and Data Warehouses, Multidimensional data model- Warehouse
schema, OLAP Operations, Data Warehouse Architecture, Data
Warehousing to Data Mining, Data Mining Concepts and Applications,
Knowledge Discovery in Database Vs Data mining, Architecture of typical
data mining system, Data Mining Functionalities, Data Mining Issues.

Module - 2 (Data Preprocessing)

Data Preprocessing-*Need of data preprocessing*, Data Cleaning-


Missing values, Noisy data, Data Integration and Transformation, Data
Reduction-Data cube aggregation, Attribute subset selection,
*Dimensionality reduction*, Numerosity reduction, Discretization and
concept hierarchy generation.

4
Module – 1 (Introduction to Data Mining and Data
Warehousing)
1.1 Data warehouse-Differences between Operational Database Systems
and Data Warehouses
1.2 Multidimensional data model- Warehouse schema, OLAP Operations
1.3 Data Warehouse Architecture

1.4 Data Warehousing to Data Mining


1.5 Data Mining Concepts and Applications
1.6 Knowledge Discovery in Database Vs Data mining

1.7 Architecture of typical data mining system


1.8 Data Mining Functionalities
1.9 Data Mining Issues.

5
1.1 Data warehouse-Differences between Operational
Database Systems and Data Warehouses

◼ 301 List out the three major features of data warehouse. (3)
◼ 401 List and explain any two applications of data warehouse. (3)

6
Evolution of Database Technology – Not in Syllabus

◼ 1. 1960s: Hierarchical database systems, eg.,IMS; Network DBMS


◼ 2. 1970s: Relational DBMS (RDBMS) Introduction
◼ 3. 1980s: RDBMS Popularity; OODBMS; Application-Oriented DB
(spatial data, scientific, and engineering applications)
◼ 4. 1990s:
◼ Data Warehousing: consolidate large volumes of data for analysis
◼ Data Mining: extract patterns and insights from large datasets.
◼ Multimedia Databases: images, audio, video, and other media.
◼ Web Databases: optimized for internet and dynamic web content.
◼ 5. 2000s
◼ Stream Data Management and Mining
◼ Data Mining Advances
◼ Web Technology and Global Information Systems

7
What is a Data Warehouse?
◼ A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.
◼ Mentioned above are the 3 major features of Datawarehouse

8
Data Warehouse—Subject-Oriented

◼ Organized around major subjects, such as customer, product,


sales
◼ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
◼ Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process

9
Data Warehouse—Integrated

◼ Constructed by integrating multiple, heterogeneous data


sources
◼ relational databases, flat files, on-line transaction records

◼ Data cleaning and data integration techniques are applied.


◼ Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data


sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc.

◼ When data is moved to the warehouse, it is converted.

10
Data Warehouse—Time Variant

◼ The time horizon for the data warehouse is significantly longer


than that of operational systems
◼ Operational database: current value data
◼ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
◼ Every key structure in the data warehouse
◼ Contains an element of time, explicitly or implicitly
◼ But the key of operational data may or may not contain
“time element”

11
Data Warehouse—Nonvolatile
◼ A physically separate store of data transformed from the
operational environment
◼ Operational update of data does not occur in the data
warehouse environment
◼ Does not require transaction processing, recovery, and
concurrency control mechanisms
◼ Requires only two operations in data accessing:
◼ initial loading of data and access of data

12
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (1)

◼ 202 List the differences between OLAP and OLTP. (3)


◼ 302 Describe the similarities and differences between OLTP and
OLAP. (3)

13
Feature OLAP OLTP
Large volumes (TB) of historical Smaller volumes (GB) of
Data Volume
data (data warehouse). current, operational data.
Complex queries involving Simple and short transactions
Operations
aggregations and joins. like CRUD operations.
Response Optimized for query speed, not Requires quick responses for
Time immediate responses. real-time transactions.
Database Denormalized schema (star or Normalized schema to minimize
Design snowflake schema). redundancy (ER Based).
Low concurrency due to High concurrency to handle
Concurrency
analytical workloads. multiple user requests.
Read-intensive, historical and Write-intensive, operational
Data Type
summary data. and detailed data.
Example Use Business intelligence, trend Banking systems, order
Cases analysis, forecasting. processing, inventory systems.
Data warehouses (e.g., Transactional databases (e.g.,
Tools
Snowflake, BigQuery). MySQL, PostgreSQL).
Periodic updates (ETL
Data Updates Continuous, real-time updates.
processes). 14
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (2)
◼ 201 Illustrate the multi-dimensional data model with a neat
figure. (3)
2 marks; Figure – 1 mark
◼ 211 b) List and illustrate the schemas used for the physical
representation of the multidimensional data with examples. (8)
3 schema - a star schema, a snowflake schema, or a fact constellation
schema – 3 marks
Each with explanation – 1 mark each, figures and example – 2 marks
◼ 312 a) Explain the differences between star schema and
snowflake schema in a data warehouse (6)
◼ 402 Describe the similarities and the differences of star schema
and snowflake schema – 3 marks

15
1.2(a) Multidimensional data model- Warehouse schema

Chapter 4: Data Warehousing and On-line Analytical Processing, Data


Mining: Concepts and Techniques

16
From Tables to Data Cubes

◼ A data warehouse is based on a multidimensional data model


which views data in the form of a data cube
◼ A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
◼ Dimension tables, such as item (item_name, brand, type), or
time (day, week, month, quarter, year)
◼ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
◼ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
17
Physical Modeling of Data Warehouses

◼ Modeling data warehouses - dimensions & measures


◼ Star schema: A fact table in the middle connected to a
set of dimension tables
◼ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
◼ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

18
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

19
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

20
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 21
Multi Dimensional Data Model
- Data Cube and Concept Hierarchy
◼ Sales volume as a function of product, month, and region

Dimensions: Product, Location, Time


Hierarchical summarization paths

Industry Region Year

Category Country Quarter.


Location

Product. City Month Week

Office Day

Time
22
Data Cube Aggregation

Quarter Total annual sales


2Qtr of TVs in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

All, All, All


23
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product Quarter Country
1-D cuboids

Product, Quarter Product, Country Quarter, Country


2-D cuboids

3-D (base) cuboid


Product, Quarter, Country

(See “Data Mining: Concepts and Techniques “ 5.2.1 Multiway Array


Aggregation for Full Cube Computation)
24
Cuboids Corresponding to the Cube

Quarter
2Qtr 3Qtr 4Qtr sum
U.S.A 0-D (apex) cuboid All.

Country
Canada
1-D Product. Quarter . C
cuboids
Mexico Product,
Quarter. Product,
sum Country.
2-D cuboids

3-D (base) cuboid Product, Quarter, C

All, All, All


25
Cuboids Corresponding to the Cube
Q
1Qtr 2Qtr 3
TV
0-D (apex) cuboid All. PC
VCR
Product. sum
1-D Quarter . Country.
cuboids

Product, Quarter,
Quarter. Product, Country.
Country.
2-D cuboids

3-D (base) cuboid Product, Quarter, Country

26
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

27
1.2(b) Multidimensional data model- OLAP Operations

Chapter 4: Data Warehousing and On-line Analytical Processing, Data


Mining: Concepts and Techniques

28
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (3)

◼ 212 b) Illustrate different OLAP operations in a


multidimensional data model with examples. List the differences
between ROLAP, MOLAP, and HOLAP.
Roll Up, Drill Down, Slice & Dice, Pivot(Rotate) with examples – 5
marks ROLAP, MOLAP and HOLAP - any three differences – 1
mark each
◼ 311 a) Explain different OLAP operations on multidimensional
data with suitable examples (7)
different OLAP operations on multidimensional data – 4marks
explanation with suitable examples - 3 marks

29
Typical OLAP Operations
1. Roll-Up (Drill-Up)
◼ Summarizes data by climbing up a hierarchy or by reducing
dimensions. It allows users to view data at a higher level of
abstraction.
◼ Example: In a sales data cube, roll up from `city` to
`region` or `country`. This aggregates the data at a
broader level.
2. Drill-Down (Roll-Down)
◼ The reverse of roll-up. It moves from higher-level summarized
data to lower-level detailed data or introduces new dimensions
for detailed analysis.
◼ In a sales data cube, drill down from `region` to `city` to
see the granular data.
30
3. Slice and Dice
◼ Slice: Selects a single value from one dimension to create a
sub-cube.
◼ Example: Filter sales data for `Year = 2023`.
◼ Dice: Filters data based on multiple dimension values,
resulting in a sub-cube with specific data.
◼ Filter sales data for `Year = 2023` and `Product =
Electronics`.
4. Pivot (Rotate)
◼ Reorients the data cube for better visualization, typically
converting 3D data into 2D views (e.g., switching rows and
columns in a table).
◼ Rotate a data cube to view `products` on rows and
`regions` on columns.

31
5. Drill Across
◼ Retrieves data involving more than one fact table.
◼ Combine `sales` data with `inventory` data to analyze
stock levels and sales performance together.
6. Drill Through
◼ Navigates through the bottom level of the cube to access
back-end relational tables, often using SQL queries.
◼ From aggregated sales data, drill through to view the
raw transaction records in a store.

32
Fig. 3.10 Typical OLAP
Operations

33
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (4)

◼ 312 b) Suppose that a data warehouse consists of the three


dimensions: time, doctor, and patient, and the two measures: count
and charge, where charge is the fee that a doctor charges a patient
for a visit (8)
i. Draw a schema diagram for the above data warehouse using star
schema (4)
ii. Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2004? (4)

34
3.1.2 (b)
i. Draw a schema diagram for the above data warehouse using one of
the schemas. [star, snowflake, fact constellation]

The base cuboid [day, doctor, patient]

35
3.1.2 (b)
ii. Starting with the base cuboid [day, doctor, patient], what specific
OLAP operations should be performed in order to list the total fee
collected by each doctor in 2004
Given
The base cuboid [day, doctor, patient],
list the total fee collected by each doctor in 2004
1. *Slice* the data for the year 2004:
2. *Roll up* the data along the “patient” dimension:
3. *Aggregation*: on the “charge” measure, for each doctor. This will
give the total fee collected by each doctor in 2004. (PTO)

36
1. *Slice* the data for the year 2004:
This operation will select only the data for the year 2004 from the “time”
dimension. After this operation, the “time” dimension will be reduced to
just the year-level, and the cuboid becomes [doctor, patient,
year=2004]
2. *Roll up* the data along the “patient” dimension:
The “patient” dimension can be aggregated to summarize the fee by
doctor. This will result in the cuboid “[doctor, year=2004]” with the total
charges for each doctor.
3. *Aggregation*:
After the *roll-up* operation, you will perform an *aggregation*
(summing) on the “charge” measure, for each doctor. This will give the
total fee collected by each doctor in 2004.
Final Result: List of doctors with their corresponding total charge
collected in 2004.

37
Not In Syllabus
To obtain the same list, write an SQL query assuming the data are
stored in a relational database with the schema fee (day, month, year,
doctor, hospital, patient, count, charge)

Select doctor, Sum(charge) From fee


Where year = 2004
Group by doctor

38
1.2 Multidimensional data model- Warehouse schema,
OLAP Operations (5)
◼ 412 b) Suppose that a data warehouse for a university consists of
the following four dimensions: student, course, semester, and
instructor, and two measures: count and avg_grade. (8)
(i) Draw a snowflake schema diagram for the data warehouse.
(ii) Starting with the base cuboid, what specific OLAP operations
should one perform in order to list the average grade of CS courses
for each University student
i) a snowflake schema diagram for the data warehouse – 4 marks
[marks can be given for the correct steps to the solution]
(ii) specific OLAP operations should one perform in order to list the
average grade of CS courses for each University student – 4 marks

39
4.1.2 (b)
(i) Draw a snowflake schema diagram for the data warehouse.

The base cuboid to be [student, course, semester, instructor]


40
4.1.2 (b)
ii. Starting with the base cuboid [student, course, semester, instructor],
what specific OLAP operations (e.g., roll-up from semester to year)
should one perform in order to list the average grade of CS courses for
each Big- University student
◼ Base Cuboid [student, course, semester, instructor]:
Start with the base cuboid, which contains data at the most granular
level,
including all four dimensions: student, course, semester, and
instructor,
and the two measures: count and grade.

41
◼ Base Cuboid [student, course, semester, instructor]:
1. Selection: Slice Operation
Apply a slice operation to filter for CS courses in the "course"
dimension (course.department = "CS"). This reduces the dataset to
include only data related to Computer Science courses.
2. Projection: Drill-Down or Roll-up
Perform a drill-down operation on the "student" dimension to move
from a higher-level aggregation (e.g., university-level or department-
level data) to individual student-level data. This ensures that data is
listed for individual students.
3. Aggregation:
Use the *grade* measure to calculate the average grade obtained by
each student across all CS courses he/she has registered for. This is
achieved by aggregating the data across the other dimensions
(“semester” and “instructor”).
Final Output:
The result will be a two-dimensional representation with student as
rows and their corresponding average grade for CS courses.
42
Not In Syllabus
Step 1: Roll-Up on Course from Course_ID to Department
◼ Operation Explanation:
◼ Aggregate the data from individual courses (e.g., CS101,
CS102) to the department level (e.g., CS). This simplifies
the dataset to focus on departments rather than specific
courses.
◼ Schema After Roll-Up:

Student Department Semester Instructor Grade


Student_A CS Fall 2024 Prof_X 85
Student_A CS Spring 2024 Prof_Y 90
Student_B MATH Fall 2024 Prof_Z 78
Student_C CS Spring 2024 Prof_X 92

43
Not In Syllabus
Step 2: Dice on Course and Student with Department =
"CS" and University = "BigUniversity“
◼ Operation Explanation:
◼ Apply a filter to include only the relevant data:
◼ Department = "CS" ensures only Computer Science courses
are included.
◼ University = "BigUniversity" ensures only students from Big
University are included.
◼ Schema After Dice:

Student Department Semester Instructor Grade


Student_A CS Fall 2024 Prof_X 85
Student_A CS Spring 2024 Prof_Y 90
Student_C CS Spring 2024 Prof_X 92

44
Not In Syllabus
Step 3: Drill-Down on Student from University to
Student_Name
◼ Operation Explanation:
◼ Navigate from the higher University level (e.g.,
BigUniversity) to the individual Student_Name level (e.g.,
Student_A, Student_C). This ensures granularity at the
student level for analysis.
◼ Schema After Drill-Down

Student Department Semester Instructor Grade


Student_A CS Fall 2024 Prof_X 85
Student_A CS Spring 2024 Prof_Y 90
Student_C CS Spring 2024 Prof_X 92

45
Not In Syllabus
Step 4: Aggregation on Grade
◼ Operation Explanation:
◼ Group by Student_Name and calculate the average of the
Grade measure for all CS courses taken by each student.
◼ Schema After Aggregation:

Student Avg_Grade
Student_A 87.5
Student_C 92.0

46
Not in Syllabus
Corresponding SQL query
SELECT student_id, AVG(grade)
FROM data_warehouse_table
WHERE course = 'CS’
GROUP BY student_id
This query effectively implements the *slice*, *drill-down*, and
*aggregation* operations specified in the OLAP process.

47
1.3 Data Warehouse Architecture

◼ 211 a) Explain the three-tier architecture of the data warehouse


with a neat figure. (6)
Figure – 2 marks based on correctness, Each tier description – 4 marks

48
Why a Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
◼ Different functions and different data:
◼ missing data: Decision support requires historical data which
operational DBs do not typically maintain
◼ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
◼ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
49
Data Warehouse: A Multi-Tiered Architecture
See Section 4.1.4 of “Data Mining: Concepts and Techniques”
1. The bottom tier is a warehouse database server that is generally a
relational database system. Back-end tools and utilities are used to feed
data into the bottom tier from operational databases or other external
sources (e.g., customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning,
and transformation (e.g., to merge similar data from different sources
into a unified format), as well as load and refresh functions to update
the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying
DBMS and allows client programs to generate SQL code to be executed
at a server. This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using
either a ROLAP model or MOLAP) model.
3. The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and data mining tools (e.g., identifying
patterns and trends, prediction etc.)
50
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

OLAP
Operational Extract Query,
DBs Transform Data Serve Reports,
Load
Refresh
Warehouse Analysis,
Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


51
Data Warehouse Models
◼ Enterprise warehouse
◼ collects all of the information about subjects spanning the

entire organization
◼ Data Mart
◼ a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected


groups, such as marketing data mart
◼ Independent vs. dependent (directly from warehouse)

data mart
◼ Virtual warehouse
◼ A set of views over operational databases

◼ Only some of the possible summary views may be

materialized

52
Extraction, Transformation, and Loading (ETL)
◼ Data extraction
◼ get data from multiple, heterogeneous, and external sources
◼ Data cleaning
◼ detect errors in the data and rectify them when possible
◼ Data transformation
◼ convert data from legacy or host format to warehouse format
◼ Load
◼ sort, summarize, consolidate, compute views, check integrity,
and build indices and partitions
◼ Refresh
◼ periodic updates from the data sources to the warehouse

53
Metadata Repository
◼ Meta data is the data defining warehouse objects. It stores:
◼ Description of the structure of the data warehouse
◼ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
◼ Operational meta-data
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ business terms and definitions, ownership of data, charging policies

54
OLAP Server Architectures
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middleware
◼ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
◼ ROLAP is suited for environments with large volumes of
transactional data and the need for scalability.
◼ Multidimensional OLAP (MOLAP)
◼ Sparse array-based multidimensional storage engine
◼ Fast indexing to pre-computed summarized data
◼ MOLAP excels at providing fast performance for complex
analytical queries by leveraging pre-aggregated multidimensional
cubes.

55
OLAP Server Architectures
◼ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
◼ Flexibility, e.g., low-level: relational, high-level: array
◼ HOLAP combines the best of both worlds, offering flexibility and
scalability by using relational storage for detailed data and
multidimensional cubes for fast aggregation.
◼ Specialized SQL servers (e.g., Redbricks)
◼ Specialized support for SQL queries over star/snowflake schemas.
Implements Query optimization for complex, multidimensional
analytical queries using techniques such as:
◼ Rewriting SQL queries to improve performance.
◼ Indexing: Creating specialized indexes to optimize the retrieval
of multidimensional data from star and snowflake schemas.
◼ Materialized Views: Pre-computing and storing query results to
speed up response times for frequently requested data.

56
1.4 Data Warehousing to Data Mining
1.5 Data Mining Concepts and Applications
1.6 Knowledge Discovery in Database Vs Data mining

◼ 311 b) Illustrate the various stages in Knowledge discovery


process with a diagram. (7)
◼ Explaining the various stages in Knowledge discovery process-- 4 marks
◼ Explanation with diagram – 3 marks

◼ 401 a) Explain the knowledge discovery process (KDD) in


databases for finding useful
◼ information and patterns in data. (3)
◼ 411 a) Explain the knowledge discovery process (KDD) in databases for finding
useful
◼ information and patterns in data. (7)

57
Why Data Mining?

◼ The Explosive Growth of Data: from terabytes to petabytes


◼ Data collection and data availability
◼ Automated data collection tools, database systems, Web,
computerized society
◼ Major sources of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube
◼ We are drowning in data, but starving for knowledge!
◼ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

58
What Is Data Mining?

◼ Data mining (knowledge discovery from data)


◼ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
◼ Data mining: a misnomer?
◼ Alternative names
◼ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
◼ Watch out: Is everything “data mining”?
◼ Simple search and query processing
◼ (Deductive) expert systems

59
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
60
The knowledge discovery process consists of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the database)
4. Data transformation (where data are transformed and consolidated
into forms appropriate for mining by performing summary or aggregation
operations)
5. Data mining (an essential process where intelligent methods are
applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures
7. Knowledge presentation (visualization and knowledge representation
techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data
are prepared for mining. The data mining step may interact with the user
or a knowledge base. The interesting patterns are presented to the user
and may be stored as new knowledge in the knowledge base
February 4, 2025 Data Mining: Concepts and Techniques 61
Example: A Web Mining Framework

◼ Web mining usually involves


◼ Data cleaning
◼ Data integration from multiple sources
◼ Warehousing the data
◼ Data cube construction
◼ Data selection for data mining
◼ Data mining
◼ Presentation of the mining results
◼ Patterns and knowledge to be used or stored into
knowledge-base

62
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining

63
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
64
1.7 Architecture of typical data mining system

◼ https://www.javatpoint.com/data-mining-architecture

65
66
Architecture of typical data mining system

Data Source:
◼ The sources of data for the warehouse are the Database, data
warehouse, World Wide Web (WWW), text files, and other
documents.

67
ETL Process

◼ Extract: The data comes from multiple, heterogeneous, and


external sources
◼ Transform:
◼ We must detect errors in the data and rectify them where
possible.
◼ All data must be converted to warehouse format
◼ Sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions.
◼ Then Load the data
◼ Refresh: Periodic updates from the data sources to the
warehouse is necessary to keep the data in sych.

68
Architecture of typical data mining system

Database or Data Warehouse Server:


◼ The database or data warehouse server is responsible for
fetching the relevant data, based on the user’s data mining
request.
Data Mining Engine:
◼ The data mining engine is a major component of any data
mining system.
◼ It contains several modules for operating data mining tasks,
including characterization, association, classification, clustering,
prediction, time-series analysis, outlier analysis etc.

69
Architecture of typical data mining system

Pattern Evaluation Module:


◼ The Pattern evaluation module collaborates with the data
mining engine to focus the search on patterns of interest.
The User Interface:
◼ The graphical user interface (GUI) module communicates
between the data mining system and the user.
◼ This module helps the user to easily and efficiently use the
system without knowing the complexity of the process.
◼ This module cooperates with the data mining system when the
user specifies a query or a task and displays the results.

70
Architecture of typical data mining system

Knowledge Base:
◼ The knowledge base is used to guide the search or evaluate the
stake of the result patterns.
◼ The knowledge base may even contain user views and data

from user experiences that might be helpful in the data mining


process.
◼ The data mining engine may receive inputs from the knowledge

base to make the result more accurate and reliable.


◼ The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.

71
1.8 Data Mining Functionalities

◼ 212 a) List and explain various data mining functionalities (6)


◼ Datamining functionalities -class/concept description, mining
frequent patterns, classification and regression, clustering, outlier
analysis – 1 mark each /description
◼ 412 a) Describe different issues in data mining (6)
◼ Any six issues in data mining – 1 mark each

72
Data Mining Function: (1) Generalization
◼ Data entries can be associated with classes or concepts, such
as "computers" or "printers" for items and "big Spenders" or
"budget Spenders" for customers.
◼ Descriptions of these classes or concepts, known as
class/concept descriptions, can be derived using data
characterization and data discrimination.
◼ Data Characterization involves summarizing the general
characteristics or features of a target class (e.g., software
products with a 10% sales increase).
◼ Data Discrimination focuses on comparing the general features
of a target class against one or more contrasting classes (e.g.,
comparing software products with sales increases versus
decreases).

73
Data Mining Function: (2) Association
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Milk → Bread [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering, and
other applications?

74
Data Mining Function: (3) Classification

◼ Classification (supervised learning)


◼ Construct models (functions) based on some training examples
◼ Describe and distinguish classes or concepts for future prediction
◼ E.g., classify countries based on (climate), or classify cars

based on (gas mileage)


◼ Predict unknown class labels
◼ Typical methods
◼ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
◼ Typical applications:
◼ Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

75
Data Mining Function: (4) Cluster Analysis

◼ Unsupervised learning (i.e., Class label is unknown)


◼ Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
◼ Principle: Maximizing intra-class similarity & minimizing interclass
similarity
◼ Methods
◼ Hierarchical, Partitioning, Density-based, Grid-based Methods
◼ Clustering Graphs and Network Data
◼ Applications
◼ Customer segmentation, social network analysis, recommendation
systems, information retrieval, information security, comp. biology
◼ Data stream clustering: network intrusion detection, transaction
streams, phone records, web click-streams, weather monitoring
76
Data Mining Function: (5) Outlier Analysis

◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data
◼ Noise or exception?
◼ Methods: statistics, clustering, classification, regression
analysis, …
◼ Useful in fraud detection, rare events analysis

77
Data Mining Function: (6) Trend Analysis

◼ Identifies patterns over time, including trend shifts and


sequential patterns.
◼ Example: Stock market analysis to predict future price
movements based on historical data.

78
Data Mining Function: (7) Data Visualization

◼ Presents mined patterns using graphs, charts, and interactive


tools for better interpretation.
◼ Example: Heatmaps for website user activity, highlighting
frequently clicked areas

79
February 4, 2025 Data Mining: Concepts and Techniques 80
1.9 Data Mining Issues
◼ 412 a) Describe different issues in data mining. (6)
Any six issues in data mining – 1 mark each

81
Major Issues in Data Mining

Refer: Section 1.7 Major Issues in Data Mining of “Data Mining:


Concepts and Techniques, Jiawei Han”
1. Diversity of data types
◼ Handling complex types of data
◼ Mining dynamic, networked, and global data repositories
2. Data Quality Issues:
◼ Handling noise, uncertainty, and incompleteness of data
3. Data Preprocessing:
◼ Cleaning, Integration, and Transformation

82
Major Issues in Data Mining

4. Efficiency and Scalability


◼ Mining knowledge in multi-dimensional space
◼ Efficiency and scalability of data mining algorithms
◼ Parallel, distributed, stream, and incremental mining
methods
5. Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked
environment
◼ Pattern evaluation and pattern- or constraint-guided mining

83
Major Issues in Data Mining

6. Data mining and society


◼ Social impacts of data mining
◼ Privacy-preserving data mining
◼ Invisible data mining (recommendation systems, smart
assistants, fraud detection, etc.)
7. User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results

84

You might also like