0% found this document useful (0 votes)
111 views29 pages

Cluster Analysis: G Sreenivas

Cluster analysis is the process of grouping data into clusters so that objects within a cluster are similar to each other and dissimilar to objects in other clusters. There are two main types of clustering methods: partitioning methods which construct various partitions and evaluate them, such as k-means clustering; and hierarchical methods which create a hierarchical decomposition of the data using some criterion, such as agglomerative nesting (AGNES) and divisive analysis (DIANA). The quality of clustering is measured by high intra-class similarity and low inter-class similarity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views29 pages

Cluster Analysis: G Sreenivas

Cluster analysis is the process of grouping data into clusters so that objects within a cluster are similar to each other and dissimilar to objects in other clusters. There are two main types of clustering methods: partitioning methods which construct various partitions and evaluate them, such as k-means clustering; and hierarchical methods which create a hierarchical decomposition of the data using some criterion, such as agglomerative nesting (AGNES) and divisive analysis (DIANA). The quality of clustering is measured by high intra-class similarity and low inter-class similarity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Cluster analysis

G SREENIVAS
Cluster Analysis

●What is Cluster Analysis ?


●Types of Data in Cluster Analysis
●A Categorization of Major Clustering
Methods
●Partitioning Methods
●Hierarchical Methods
What is Cluster Analysis?

● Clustering :
Clustering is the process of grouping a data set in a way
that the similarity between data within a cluster is
maximized while the similarity between data of different
clusters is minimized.

● Clusters :
A cluster is a collection of data objects that are similar
to one another within the same cluster and are
dissimilar to the objects in other clusters.
What Is Good Clustering?

● A good clustering method will produce high quality


clusters with
○ high intra-class similarity
○ low inter-class similarity
● The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
● The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Data Structures
● Most of the main-memory-based clustering algorithms
operate on either of the two following data structures.
● Data matrix (object-by-variable structure) :
n objects p variables

● Dissimilarity matrix (object-by-object structure) :


n objects
Measure the Quality of Clustering

● Dissimilarity/Similarity metric:
Similarity is expressed in terms of a distance function, which
is typically metric : d(i, j)

● There is a separate “quality” function that measures the


“goodness” of a cluster.

● Weights should be associated with different variables


based on applications and data semantics.

● It is hard to define “similar enough” or “good enough”


○ the answer is typically highly subjective.
Similarity and Dissimilarity Between Objects

●Distances are normally used to measure the


similarity or dissimilarity between two data
objects
●Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
●If q = 1, d is Manhattan distance
● If q = 2, d is Euclidean distance :

● Properties
○ d(i,j) 0
○ d(i,i) = 0
○ d(i,j) = d(j,i)
○ d(i,j) d(i,k) + d(k,j)
● Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity
measures.
Finding a Centroid
Use the following equation we can find the centroid of k
n-dimensional points :

Let’s find the centroid between 3 2-D points, say: (2,4) (5,2) (8,9)
Major Clustering Approaches

●Partitioning algorithms :

Construct various partitions and then evaluate


them by some criterion
●K-means, K-mediods

●Hierarchy algorithms : Create a hierarchical


decomposition of the set of data (or objects)
using some criterion
○CURE, Chameleon, BIRCH
The K-Means Clustering Method
●K-means Algorithm :
●Input: number of clusters k and a database
consisting of n objects.
●Output: a set of k clusters.
●1. Arbitrarily choose k objects as the initial
clusters.
●2. Repeat
■ (re)assign each object to the cluster to which the
object is most similar, based on the mean value
of the objects in the cluster;
■ Update the cluster means;i.e., calculate the
mean value of the objects for each cluster.
●3. Until no change
The K-Means Clustering Method
●The process iterates until the criterion
function converges.
○ E= (i=1 to k) (p € Ci) |p-mi|2

●E is the sum of square-error for all objects in


the database, p is the point of space, mi is the
mean of cluster Ci.
●Algorithm try to determine k partitions that
minimize squared-error function.
The K-Means Clustering Method

●Example 1

1.We Pick k=2


centers at random
2.We cluster our
data around these
center points
The K-Means Clustering Method
1.We recalculate
centers based on
our current clusters

1.We re-cluster our


data around our
new center points
The K-Means Clustering Method

1.We repeat the last


two steps until no
more data points are
moved into a
different cluster
Hierarchical Clustering
● Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
AGglomerative
NESting
a a (AGNES)
b b abcd
c
cd e
d
d e
e e DIvisive ANAlysis
Step 4 Step 3 Step 2 Step 1 Step 0
(DIANA)
Agglomerative , Level 2, k = 7 clusters.
Agglomerative , Level 3, k = 6 clusters.
Agglomerative , Level 4, k = 5 clusters.
Agglomerative , Level 5, k = 4 clusters.
Agglomerative, Level 6, k = 3 clusters.
Agglomerative , Level 7, k = 2 clusters.
Agglomerative, Level 8, k = 1 cluster.
AGNES (Agglomerative Nesting)

● Introduced in Kaufmann and Rousseeuw (1990)


● Use the Single-Link method and the dissimilarity matrix.
● Merge nodes that have the least dissimilarity
● Go on in a non-descending fashion
● Eventually all nodes belong to the same cluster
DIANA (Divisive Analysis)

● Introduced in Kaufmann and Rousseeuw (1990)


● Inverse order of AGNES
● The cluster is split according to some principle.
○ For example, Maximum Euclidean distance between closest
enamoring objects.
● Eventually each node forms a cluster on its own
A Dendrogram Shows How the Clusters
are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning


(tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected component
forms a cluster.
Examples of Clustering Applications
● Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
● Land use: Identification of areas of similar land use in an
earth observation database
● City-planning: Identifying groups of houses according to
their house type, value, and geographical location
● Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
● Biology
○ Plant and animal taxonomies
○ Categorize genes with similar functionality
Cluster Analysis

● Reference:
1. Chapter 8: Data mining: Concepts and Techniques:
Jiawei Han and Micheline Kamber, Morgan Kaufmann
2. http://en.wikipedia.org/wiki/Cluster_analysis
3. http://home.dei.polimi.it//matteucc/clustering/tutorial html
/heirarchical.html
THANK YOU

You might also like