0% found this document useful (0 votes)

19 views28 pages

Lecture 3.2.1 3.2.2

The document outlines the course objectives and outcomes for a Data Mining and Warehousing course, focusing on cluster analysis and various data types used in this analysis. It covers the importance of clustering methods, their applications, and the requirements for effective clustering in data mining. Additionally, it includes references for further reading on data mining concepts and techniques.

Uploaded by

Anshul Kunwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views28 pages

Lecture 3.2.1 3.2.2

Uploaded by

Anshul Kunwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data Mining and Warehousing (22CSH-380)

Faculty: Dr. Preeti Khera (E16576)

Lecture – 3.2.1 & 3.2.2

Cluster Analysis: Data types in cluster analysis DISCOVER . LEARN . EMPOWER

1
Data Mining and Warehousing : Course Objectives

COURSE OBJECTIVES
The Course aims to:

1. Develop understanding key concepts of data mining and obtain knowledge about
how to extract useful characteristics from data using data pre-processing techniques.
2. Demonstrate methods to apply and analyze relevant attributes, perform statistical
measure to look for meaningful variation in data, and mine association rules for
transactional datasets.
3. Teach use and application of data mining techniques such as classification, decision
tree, neural networks, back propagation and many more, in various applications.

2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-

Understand the concept of Data mining and usage of various tools for
CO1
data warehousing and data mining.

Demonstrate the strengths and weaknesses of different methods of

CO2
meaningful data mining.

Apply association rule, classification, and clustering algorithms for

CO3
large data sets.

Evaluate and employ correct data mining techniques depending on

CO4
characteristics of the dataset.
Verify and formulate the performance of various data mining
CO5
techniques according to the dataset.

3
Unit-3 Syllabus

Unit-3
What is Classification & Prediction, Issues regarding Classification and prediction,
Decision tree, Bayesian Classification, Classification by Back propagation,
Multilayer feed-forward Neural Network, Back propagation Algorithm,
Classification methods K-nearest neighbor classifiers, Genetic Algorithm.
Cluster Analysis: Data types in cluster analysis, Categories of clustering methods,
Partitioning methods. Hierarchical Clustering- CURE and Chameleon. Density Based
Methods-DBSCAN, OPTICS. Grid Based Methods- STING, CLIQUE.
Model Based Method –Statistical Approach, Neural Network approach, Outlier
Analysis

4
Table of Content
• Cluster Analysis
• Data types in Cluster Analysis
Cluster Analysis
• Cluster: a collection of data objects
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution

June 1, 2025
• As a preprocessing step for other algorithms
Data Mining: Concepts and Techniques 6
Clustering: Rich Applications and
Multidisciplinary Efforts

• Pattern Recognition
• Spatial Data Analysis
• Create thematic maps in GIS by clustering feature spaces
• Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
• Cluster Weblog data to discover groups of similar access
patterns
June 1, 2025 Data Mining: Concepts and Techniques 7
Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases,

and then use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation

database

• Insurance: Identifying groups of motor insurance policy holders with a high

average claim cost

• City-planning: Identifying groups of houses according to their house type,

value, and geographical location

• Earth-quake studies: Observed earth quake epicenters should be clustered

along continent faults
June 1, 2025 Data Mining: Concepts and Techniques 8
Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters with

• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
• The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns

June 1, 2025 Data Mining: Concepts and Techniques 9

Measure the Quality of Clustering

• Dissimilarity/Similarity metric: Similarity is expressed in terms of

a distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
• the answer is typically highly subjective.
June 1, 2025 Data Mining: Concepts and Techniques 10
Requirements of Clustering in Data Mining

• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
June 1, 2025 Data Mining: Concepts and Techniques 11
Data Structures

 x11 ... x1f ... x1p 

• Data matrix  
• (two modes)  ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 

• Dissimilarity matrix  0 
 d(2,1) 0 
• (one mode)  
 d(3,1) d ( 3,2) 0 
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

June 1, 2025 Data Mining: Concepts and Techniques 12

Type of data in clustering analysis

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

June 1, 2025 Data Mining: Concepts and Techniques 13

Interval-valued variables

• Standardize data
• Calculate the mean absolute deviation:
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )

• Calculate the standardized measurement (z-score)

xif  m f
zif  s
f

• Using mean absolute deviation is more robust than using

standard deviation
June 1, 2025 Data Mining: Concepts and Techniques 14
Similarity and Dissimilarity
Between Objects

• Distances are normally used to measure the similarity or

dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance

d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
June 1, 2025 Data Mining: Concepts and Techniques 15
Similarity and Dissimilarity
Between Objects (Cont.)

• If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
• Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)

• Also, one can use weighted distance, parametric Pearson

product moment correlation, or other disimilarity measures

June 1, 2025 Data Mining: Concepts and Techniques 16

Binary Variables
Object j
1 0 sum
• A contingency table for binary 1 a b a b
Object i
data 0 c d c d
sum a  c b  d p

• Distance measure for symmetric

d (i, j)  b c
binary variables: a b c  d
• Distance measure for asymmetric d (i, j)  b c
binary variables: a b  c
• Jaccard coefficient (similarity
simJaccard (i, j)  a
measure for asymmetric binary a b  c
variables):
June 1, 2025 Data Mining: Concepts and Techniques 17
Dissimilarity between Binary Variables

• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• gender is a symmetric attribute
• the remaining attributes are asymmetric binary
• let the values Y and P be set to 1, and the value N be set to 0
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
June 1, 2025 Data Mining: Concepts and Techniques 18
Nominal Variables

• A generalization of the binary variable in that it can take more than

2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j)  p p m

• Method 2: use a large number of binary variables

• creating a new binary variable for each of the M nominal states

June 1, 2025 Data Mining: Concepts and Techniques 19

Ordinal Variables

• An ordinal variable can be discrete or continuous

• Order is important, e.g., rank
• Can be treated like interval-scaled
rif {1,...,M f }
• replace xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif  1
zif 
Mf 1

• compute the dissimilarity using methods for interval-scaled

variables
June 1, 2025 Data Mining: Concepts and Techniques 20
Ratio-Scaled Variables

• Ratio-scaled variable: a positive measurement on a nonlinear

scale, approximately at exponential scale, such as AeBt or
Ae-Bt
• Methods:
• treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
• apply logarithmic transformation
yif = log(xif)
• treat them as continuous ordinal data treat their rank as
interval-scaled
June 1, 2025 Data Mining: Concepts and Techniques 21
Variables of Mixed Types
• A database may contain all the six types of variables
• symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio
• One may use a weighted formula to combine their effects
 pf 1 ij( f ) d ij( f )
d (i, j ) 
• f is binary or nominal:  p

f 1 ij
(f)

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

• f is interval-based: use the normalized distance
• f is ordinal or ratio-scaled
• compute ranks rif and
• and treat zif as interval-scaled r  1
zif 
if

M f  1
June 1, 2025 Data Mining: Concepts and Techniques 22
Vector Objects

• Vector objects: keywords in documents, gene features

in micro-arrays, etc.
• Broad applications: information retrieval, biologic
taxonomy, etc.
• Cosine measure

• A variant: Tanimoto coefficient

June 1, 2025 Data Mining: Concepts and Techniques 23

Summary
• Cluster Analysis
• Clustering requirements
• Data types in cluster analysis

24
Assignment
• Discuss the various data types used in cluster analysis with example.
• Discuss about various requirements of clustering in data mining.
• Explain the concept of clustering.

25
References
TEXT BOOKS
T1: Tan, Steinbach and Vipin Kumar. Introduction to Data Mining, Pearson Education, 2016.
T2: Zaki MJ, Meira Jr W, Meira W. Data mining and machine learning: Fundamental concepts and algorithms.
Cambridge University Press; 2020 Jan 30.
T3: King RS. Cluster analysis and data mining: An introduction. Mercury Learning and Information; 2015 May
12.

REFERENCE BOOKS
R1: Pei, Han and Kamber. Data Mining: Concepts and Techniques, Elsevier, 2011.
R2: Halgamuge SK, Wang L, editors. Classification and clustering for knowledge discovery. Springer Science
& Business Media; 2005 Sep 2.
R3: Bhatia P. Data mining and data warehousing: principles and practical techniques. Cambridge University
Press; 2019 Jun 27.

JOURNALS
• https://www.igi-global.com/journal/international-journal-data-warehousing-mining/1085
• https://www.springer.com/journal/41060 26
• https://link.springer.com/journal/10618
References
RESEARCH PAPER
 Alasadi SA, Bhaya WS. Review of data preprocessing techniques in data mining. Journal of Engineering and Applied
Sciences. 2017 Sep;12(16):4102-7.
 Freitas AA. A survey of evolutionary algorithms for data mining and knowledge discovery. InAdvances in evolutionary
computing: theory and applications 2003 Jan 1 (pp. 819-845). Berlin, Heidelberg: Springer Berlin Heidelberg.
 Kumbhare TA, Chobe SV. An overview of association rule mining algorithms. International Journal of Computer
Science and Information Technologies. 2014 Feb;5(1):927-30.
 Srivastava S. Weka: a tool for data preprocessing, classification, ensemble, clustering and association rule mining.
International Journal of Computer Applications. 2014 Jan 1;88(10).
 Dol SM, Jawandhiya PM. Classification technique and its combination with clustering and association rule mining in
educational data mining—A survey. Engineering Applications of Artificial Intelligence. 2023 Jun 1; 122:106071.

• WEB LINK
https://medium.com/@palshuvam007/types-of-data-in-cluster-analysis-85eb83ea3d9f

• VIDEO LINK
https://youtu.be/93GNQajqJh0 27
THANK YOU

For queries
Email: [email protected]

Web Based Customer Management System For Electric Power Nekemte City
No ratings yet
Web Based Customer Management System For Electric Power Nekemte City
80 pages
Cluster
No ratings yet
Cluster
120 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
8 CLST
No ratings yet
8 CLST
98 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
No ratings yet
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
112 pages
8 CLST
No ratings yet
8 CLST
100 pages
Clustering
No ratings yet
Clustering
84 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
8 CLST
No ratings yet
8 CLST
98 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Algorithms
No ratings yet
Algorithms
107 pages
Abstract of Bids As Read and As Calculated
100% (1)
Abstract of Bids As Read and As Calculated
135 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
CISSP Common Body of Knowledge Review in
No ratings yet
CISSP Common Body of Knowledge Review in
145 pages
What Is Epub 3 Matt Garrish
No ratings yet
What Is Epub 3 Matt Garrish
29 pages
Clustering
No ratings yet
Clustering
51 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
SVM 4001 - Instruction Manual and Safety Information
No ratings yet
SVM 4001 - Instruction Manual and Safety Information
45 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
AMC Matrix Solution 001
70% (10)
AMC Matrix Solution 001
3 pages
OITAF2024 AURO v2-LOW
No ratings yet
OITAF2024 AURO v2-LOW
42 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Lecture 3.1.5 and 3.1.6
No ratings yet
Lecture 3.1.5 and 3.1.6
18 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Smart City Bhopal - PAN CITY PROJECTS
100% (1)
Smart City Bhopal - PAN CITY PROJECTS
37 pages
Nist SP 800-229
No ratings yet
Nist SP 800-229
27 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
TK Series Magnet GPS Tracker USER MANUAL
No ratings yet
TK Series Magnet GPS Tracker USER MANUAL
26 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
FP5207
No ratings yet
FP5207
13 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Amazon - Pass4sures - Aws Certified Solutions Architect Associate
100% (3)
Amazon - Pass4sures - Aws Certified Solutions Architect Associate
69 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Catalogo Serie E1049
No ratings yet
Catalogo Serie E1049
12 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Grameenphone Report
No ratings yet
Grameenphone Report
122 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Subscription License Suites - January 2025
No ratings yet
Subscription License Suites - January 2025
10 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
DESIGN AND CONTROL OF PHOTOVOLTAIC WIND BATTERY BASED MICROGRID SYSTEM Ijariie23630
No ratings yet
DESIGN AND CONTROL OF PHOTOVOLTAIC WIND BATTERY BASED MICROGRID SYSTEM Ijariie23630
12 pages
P N M T: PNMT (Java Version) Operation Manual
No ratings yet
P N M T: PNMT (Java Version) Operation Manual
118 pages
Data Mining
No ratings yet
Data Mining
98 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Galileo Textbook Specimen
No ratings yet
Galileo Textbook Specimen
41 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Connecting Python With SQL Database
No ratings yet
Connecting Python With SQL Database
21 pages
Unit 4
No ratings yet
Unit 4
65 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
SonicOS Enhanced 5.1 Log Event Reference Guide
No ratings yet
SonicOS Enhanced 5.1 Log Event Reference Guide
56 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Artificial Intelligence Operated Elevator Using RL AIOERL
No ratings yet
Artificial Intelligence Operated Elevator Using RL AIOERL
4 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Lab 3 Exercises
No ratings yet
Lab 3 Exercises
4 pages
Data Mining Notes
No ratings yet
Data Mining Notes
3 pages
The Apogee AD-8000 8-Channel, 24-Bit Converter
No ratings yet
The Apogee AD-8000 8-Channel, 24-Bit Converter
6 pages
Netcat - Cheat Sheet
No ratings yet
Netcat - Cheat Sheet
3 pages
CS 412: Introduction To Data Mining Course Syllabus
No ratings yet
CS 412: Introduction To Data Mining Course Syllabus
7 pages
CS 412: Introduction To Data Mining Course Syllabus
No ratings yet
CS 412: Introduction To Data Mining Course Syllabus
7 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Sample Complaint Letter
No ratings yet
Sample Complaint Letter
2 pages
Ic Datasheet CH en
No ratings yet
Ic Datasheet CH en
2 pages
Data Sheet 6ES7331-7NF00-0AB0: Input Current
No ratings yet
Data Sheet 6ES7331-7NF00-0AB0: Input Current
3 pages
Tutorial 1 The Fairy On The Dead Tree
No ratings yet
Tutorial 1 The Fairy On The Dead Tree
4 pages
ORDERS D97A Message-Guideline
No ratings yet
ORDERS D97A Message-Guideline
45 pages

Lecture 3.2.1 3.2.2

Uploaded by

Lecture 3.2.1 3.2.2

Uploaded by

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Data Mining and Warehousing (22CSH-380)

Lecture – 3.2.1 & 3.2.2

Demonstrate the strengths and weaknesses of different methods of

Apply association rule, classification, and clustering algorithms for

Evaluate and employ correct data mining techniques depending on

• Marketing: Help marketers discover distinct groups in their customer bases,

• Land use: Identification of areas of similar land use in an earth observation

• Insurance: Identifying groups of motor insurance policy holders with a high

• City-planning: Identifying groups of houses according to their house type,

• Earth-quake studies: Observed earth quake epicenters should be clustered

• A good clustering method will produce high quality clusters with

June 1, 2025 Data Mining: Concepts and Techniques 9

• Dissimilarity/Similarity metric: Similarity is expressed in terms of

 x11 ... x1f ... x1p 

June 1, 2025 Data Mining: Concepts and Techniques 12

• Nominal, ordinal, and ratio variables

• Variables of mixed types

June 1, 2025 Data Mining: Concepts and Techniques 13

where m f  1n (x1 f  x2 f  ...  xnf )

• Calculate the standardized measurement (z-score)

• Using mean absolute deviation is more robust than using

• Distances are normally used to measure the similarity or

• Also, one can use weighted distance, parametric Pearson

June 1, 2025 Data Mining: Concepts and Techniques 16

• Distance measure for symmetric

• A generalization of the binary variable in that it can take more than

• Method 2: use a large number of binary variables

June 1, 2025 Data Mining: Concepts and Techniques 19

• An ordinal variable can be discrete or continuous

• compute the dissimilarity using methods for interval-scaled

• Ratio-scaled variable: a positive measurement on a nonlinear

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

• Vector objects: keywords in documents, gene features

• A variant: Tanimoto coefficient

June 1, 2025 Data Mining: Concepts and Techniques 23

You might also like