0% found this document useful (0 votes)

111 views29 pages

Cluster Analysis: G Sreenivas

Cluster analysis is the process of grouping data into clusters so that objects within a cluster are similar to each other and dissimilar to objects in other clusters. There are two main types of clustering methods: partitioning methods which construct various partitions and evaluate them, such as k-means clustering; and hierarchical methods which create a hierarchical decomposition of the data using some criterion, such as agglomerative nesting (AGNES) and divisive analysis (DIANA). The quality of clustering is measured by high intra-class similarity and low inter-class similarity.

Uploaded by

Sreenivas Ganapathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views29 pages

Cluster Analysis: G Sreenivas

Uploaded by

Sreenivas Ganapathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Cluster analysis

G SREENIVAS
Cluster Analysis

●What is Cluster Analysis ?

●Types of Data in Cluster Analysis
●A Categorization of Major Clustering
Methods
●Partitioning Methods
●Hierarchical Methods
What is Cluster Analysis?

● Clustering :
Clustering is the process of grouping a data set in a way
that the similarity between data within a cluster is
maximized while the similarity between data of different
clusters is minimized.

● Clusters :
A cluster is a collection of data objects that are similar
to one another within the same cluster and are
dissimilar to the objects in other clusters.
What Is Good Clustering?

● A good clustering method will produce high quality

clusters with
○ high intra-class similarity
○ low inter-class similarity
● The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
● The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Data Structures
● Most of the main-memory-based clustering algorithms
operate on either of the two following data structures.
● Data matrix (object-by-variable structure) :
n objects p variables

● Dissimilarity matrix (object-by-object structure) :

n objects
Measure the Quality of Clustering

● Dissimilarity/Similarity metric:
Similarity is expressed in terms of a distance function, which
is typically metric : d(i, j)

● There is a separate “quality” function that measures the

“goodness” of a cluster.

● Weights should be associated with different variables

based on applications and data semantics.

● It is hard to define “similar enough” or “good enough”

○ the answer is typically highly subjective.
Similarity and Dissimilarity Between Objects

●Distances are normally used to measure the

similarity or dissimilarity between two data
objects
●Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
●If q = 1, d is Manhattan distance
● If q = 2, d is Euclidean distance :

● Properties
○ d(i,j) 0
○ d(i,i) = 0
○ d(i,j) = d(j,i)
○ d(i,j) d(i,k) + d(k,j)
● Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity
measures.
Finding a Centroid
Use the following equation we can find the centroid of k
n-dimensional points :

Let’s find the centroid between 3 2-D points, say: (2,4) (5,2) (8,9)
Major Clustering Approaches

●Partitioning algorithms :

Construct various partitions and then evaluate

them by some criterion
●K-means, K-mediods

●Hierarchy algorithms : Create a hierarchical

decomposition of the set of data (or objects)
using some criterion
○CURE, Chameleon, BIRCH
The K-Means Clustering Method
●K-means Algorithm :
●Input: number of clusters k and a database
consisting of n objects.
●Output: a set of k clusters.
●1. Arbitrarily choose k objects as the initial
clusters.
●2. Repeat
■ (re)assign each object to the cluster to which the
object is most similar, based on the mean value
of the objects in the cluster;
■ Update the cluster means;i.e., calculate the
mean value of the objects for each cluster.
●3. Until no change
The K-Means Clustering Method
●The process iterates until the criterion
function converges.
○ E= (i=1 to k) (p € Ci) |p-mi|2

●E is the sum of square-error for all objects in

the database, p is the point of space, mi is the
mean of cluster Ci.
●Algorithm try to determine k partitions that
minimize squared-error function.
The K-Means Clustering Method

●Example 1

1.We Pick k=2

centers at random
2.We cluster our
data around these
center points
The K-Means Clustering Method
1.We recalculate
centers based on
our current clusters

1.We re-cluster our

data around our
new center points
The K-Means Clustering Method

1.We repeat the last

two steps until no
more data points are
moved into a
different cluster
Hierarchical Clustering
● Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
AGglomerative
NESting
a a (AGNES)
b b abcd
c
cd e
d
d e
e e DIvisive ANAlysis
Step 4 Step 3 Step 2 Step 1 Step 0
(DIANA)
Agglomerative , Level 2, k = 7 clusters.
Agglomerative , Level 3, k = 6 clusters.
Agglomerative , Level 4, k = 5 clusters.
Agglomerative , Level 5, k = 4 clusters.
Agglomerative, Level 6, k = 3 clusters.
Agglomerative , Level 7, k = 2 clusters.
Agglomerative, Level 8, k = 1 cluster.
AGNES (Agglomerative Nesting)

● Introduced in Kaufmann and Rousseeuw (1990)

● Use the Single-Link method and the dissimilarity matrix.
● Merge nodes that have the least dissimilarity
● Go on in a non-descending fashion
● Eventually all nodes belong to the same cluster
DIANA (Divisive Analysis)

● Introduced in Kaufmann and Rousseeuw (1990)

● Inverse order of AGNES
● The cluster is split according to some principle.
○ For example, Maximum Euclidean distance between closest
enamoring objects.
● Eventually each node forms a cluster on its own
A Dendrogram Shows How the Clusters
are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning

(tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected component
forms a cluster.
Examples of Clustering Applications
● Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
● Land use: Identification of areas of similar land use in an
earth observation database
● City-planning: Identifying groups of houses according to
their house type, value, and geographical location
● Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
● Biology
○ Plant and animal taxonomies
○ Categorize genes with similar functionality
Cluster Analysis

● Reference:
1. Chapter 8: Data mining: Concepts and Techniques:
Jiawei Han and Micheline Kamber, Morgan Kaufmann
2. http://en.wikipedia.org/wiki/Cluster_analysis
3. http://home.dei.polimi.it//matteucc/clustering/tutorial html
/heirarchical.html
THANK YOU

Multiclass Prediction Model For Student Grade Prediction Using Machine Learning
No ratings yet
Multiclass Prediction Model For Student Grade Prediction Using Machine Learning
14 pages
6th - SEM Machine Learning Notes PDF
100% (1)
6th - SEM Machine Learning Notes PDF
36 pages
Chapter 7
100% (1)
Chapter 7
31 pages
Cluster Analysis in Python Chapter1 PDF
No ratings yet
Cluster Analysis in Python Chapter1 PDF
31 pages
The Box-Jenkins Methodology For RIMA Models
No ratings yet
The Box-Jenkins Methodology For RIMA Models
172 pages
07 Hierarchical Clustering
No ratings yet
07 Hierarchical Clustering
19 pages
Supervised Vs Unsupervised Learning What S The Difference IBM 24062021 035331pm
No ratings yet
Supervised Vs Unsupervised Learning What S The Difference IBM 24062021 035331pm
9 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Clustering (Unit 3)
100% (2)
Clustering (Unit 3)
71 pages
Chapter
100% (1)
Chapter
101 pages
Cluster Analysis
No ratings yet
Cluster Analysis
47 pages
Cluster Analysis
No ratings yet
Cluster Analysis
38 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
Cluster Analysis
No ratings yet
Cluster Analysis
77 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
KNN Presentation
No ratings yet
KNN Presentation
16 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
No ratings yet
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
9 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Cluster Analysis: Abu Bashar
No ratings yet
Cluster Analysis: Abu Bashar
18 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
Answer 1722791857 NLP and Classification Practical MCQ 4991
No ratings yet
Answer 1722791857 NLP and Classification Practical MCQ 4991
26 pages
Discriminant Analysis Chapter-Seven
No ratings yet
Discriminant Analysis Chapter-Seven
7 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
23 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Clustering
No ratings yet
Clustering
24 pages
Grouping
No ratings yet
Grouping
98 pages
Bar Graph-Wps Office
No ratings yet
Bar Graph-Wps Office
16 pages
CH 6
No ratings yet
CH 6
72 pages
Cluster
100% (1)
Cluster
72 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Chap8 Basic Cluster Analysis
100% (1)
Chap8 Basic Cluster Analysis
104 pages
MS Excel Must Do's
No ratings yet
MS Excel Must Do's
3 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Chapter-18: Research Methodology
No ratings yet
Chapter-18: Research Methodology
19 pages
Frequency Distribution For Categorical Data
No ratings yet
Frequency Distribution For Categorical Data
6 pages
Performance Evaluation of Machine Learning Algorithms in Post-Operative Life Expectancy in The Lung Cancer Patients
No ratings yet
Performance Evaluation of Machine Learning Algorithms in Post-Operative Life Expectancy in The Lung Cancer Patients
11 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Jawahar Navodaya Class 11 Entrance Exam Syllabus in PDF For 2025 26
No ratings yet
Jawahar Navodaya Class 11 Entrance Exam Syllabus in PDF For 2025 26
8 pages
Data Mining
No ratings yet
Data Mining
27 pages
Unit 4
No ratings yet
Unit 4
4 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Session 18 Time Series Forecasting
No ratings yet
Session 18 Time Series Forecasting
30 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Econometric Modelling: Module - 1
No ratings yet
Econometric Modelling: Module - 1
20 pages
ML Notes MAKAUT 7th Sem
No ratings yet
ML Notes MAKAUT 7th Sem
31 pages
MEC 333 - Industry 4.O Syllabus-OE
No ratings yet
MEC 333 - Industry 4.O Syllabus-OE
3 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
012-NetNumen U31 R22 Northbound Interface User Guide (SNMP Interface)
100% (5)
012-NetNumen U31 R22 Northbound Interface User Guide (SNMP Interface)
61 pages
Association Rules
No ratings yet
Association Rules
64 pages
Sap Fund MGMT
No ratings yet
Sap Fund MGMT
21 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
CCN Lecture Notes
No ratings yet
CCN Lecture Notes
56 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Bloom's Taxonomy - My Presentation
No ratings yet
Bloom's Taxonomy - My Presentation
30 pages
Lesson Plan Manuscript Speech - Shein Tumala
No ratings yet
Lesson Plan Manuscript Speech - Shein Tumala
6 pages
Monu Adhar Card
No ratings yet
Monu Adhar Card
1 page
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Itinerary
100% (2)
Itinerary
3 pages
South Valley Academy DD Booklet
No ratings yet
South Valley Academy DD Booklet
79 pages
OBLTA Norms
No ratings yet
OBLTA Norms
107 pages
Ford Voluntary Emission Recall 06E17 Supplement #1
No ratings yet
Ford Voluntary Emission Recall 06E17 Supplement #1
6 pages
RKFL - Stenn and Anna
No ratings yet
RKFL - Stenn and Anna
9 pages
Objective AIM PIC Micro Controller Sensor GSM LCD Advantages
100% (1)
Objective AIM PIC Micro Controller Sensor GSM LCD Advantages
20 pages
A Beginners Guide To Data and Analytics
100% (1)
A Beginners Guide To Data and Analytics
22 pages
Shyam Ferro Project Report
No ratings yet
Shyam Ferro Project Report
21 pages
8.6 Answers
No ratings yet
8.6 Answers
4 pages
CATCH UP FRIDAY TEACHERS GUIDE Peace and Values Grade 1
No ratings yet
CATCH UP FRIDAY TEACHERS GUIDE Peace and Values Grade 1
3 pages
Water Quality Regulations
No ratings yet
Water Quality Regulations
25 pages
Plan de Configuration
No ratings yet
Plan de Configuration
35 pages
AISI 4140 Steel, Oil Quenched, 650°C (1200°F) Temper, 25 MM (1 in PDF
No ratings yet
AISI 4140 Steel, Oil Quenched, 650°C (1200°F) Temper, 25 MM (1 in PDF
2 pages
Rural Innovation: Sami Mahroum, Jane Atterton, Neil Ward, Allan M. Williams, Richard Naylor, Rob Hindle, Frances Rowe
No ratings yet
Rural Innovation: Sami Mahroum, Jane Atterton, Neil Ward, Allan M. Williams, Richard Naylor, Rob Hindle, Frances Rowe
77 pages
AR801 Brief - 2023
No ratings yet
AR801 Brief - 2023
5 pages
Assignment - Infographic - Top 4 AI Impacts
No ratings yet
Assignment - Infographic - Top 4 AI Impacts
3 pages
Document From Minhaz
No ratings yet
Document From Minhaz
3 pages
Chapter 01 Managerial Accounting and Cos
No ratings yet
Chapter 01 Managerial Accounting and Cos
4 pages
Agenda: Introduction To RC4 RC4 Algorithm Cryptanalysis On RC4 Future Work
No ratings yet
Agenda: Introduction To RC4 RC4 Algorithm Cryptanalysis On RC4 Future Work
17 pages
New Doc 2019-09-06 19.53
No ratings yet
New Doc 2019-09-06 19.53
14 pages
Peer Learning and Learning Community
No ratings yet
Peer Learning and Learning Community
13 pages
Lab Report ON Operating System Design
No ratings yet
Lab Report ON Operating System Design
22 pages
Lab Report ON Operating System Design
No ratings yet
Lab Report ON Operating System Design
22 pages
BIR S1905 - Registration Update Sheet
No ratings yet
BIR S1905 - Registration Update Sheet
1 page
PT # 2.1 Series Sequences
No ratings yet
PT # 2.1 Series Sequences
2 pages
IB Assignment Questions
No ratings yet
IB Assignment Questions
4 pages
WESCO Distribution Inc.: By: Deepak Chavan Y.Sreenivasa Reddy Praveen Katiyar
No ratings yet
WESCO Distribution Inc.: By: Deepak Chavan Y.Sreenivasa Reddy Praveen Katiyar
18 pages
List of Elements
No ratings yet
List of Elements
2 pages
Automatic Load Controller
No ratings yet
Automatic Load Controller
6 pages