Cluster Analysis: G Sreenivas

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Cluster analysis

G SREENIVAS
Cluster Analysis

●What is Cluster Analysis ?


●Types of Data in Cluster Analysis
●A Categorization of Major Clustering
Methods
●Partitioning Methods
●Hierarchical Methods
What is Cluster Analysis?

● Clustering :
Clustering is the process of grouping a data set in a way
that the similarity between data within a cluster is
maximized while the similarity between data of different
clusters is minimized.

● Clusters :
A cluster is a collection of data objects that are similar
to one another within the same cluster and are
dissimilar to the objects in other clusters.
What Is Good Clustering?

● A good clustering method will produce high quality


clusters with
○ high intra-class similarity
○ low inter-class similarity
● The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
● The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Data Structures
● Most of the main-memory-based clustering algorithms
operate on either of the two following data structures.
● Data matrix (object-by-variable structure) :
n objects p variables

● Dissimilarity matrix (object-by-object structure) :


n objects
Measure the Quality of Clustering

● Dissimilarity/Similarity metric:
Similarity is expressed in terms of a distance function, which
is typically metric : d(i, j)

● There is a separate “quality” function that measures the


“goodness” of a cluster.

● Weights should be associated with different variables


based on applications and data semantics.

● It is hard to define “similar enough” or “good enough”


○ the answer is typically highly subjective.
Similarity and Dissimilarity Between Objects

●Distances are normally used to measure the


similarity or dissimilarity between two data
objects
●Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
●If q = 1, d is Manhattan distance
● If q = 2, d is Euclidean distance :

● Properties
○ d(i,j) 0
○ d(i,i) = 0
○ d(i,j) = d(j,i)
○ d(i,j) d(i,k) + d(k,j)
● Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity
measures.
Finding a Centroid
Use the following equation we can find the centroid of k
n-dimensional points :

Let’s find the centroid between 3 2-D points, say: (2,4) (5,2) (8,9)
Major Clustering Approaches

●Partitioning algorithms :

Construct various partitions and then evaluate


them by some criterion
●K-means, K-mediods

●Hierarchy algorithms : Create a hierarchical


decomposition of the set of data (or objects)
using some criterion
○CURE, Chameleon, BIRCH
The K-Means Clustering Method
●K-means Algorithm :
●Input: number of clusters k and a database
consisting of n objects.
●Output: a set of k clusters.
●1. Arbitrarily choose k objects as the initial
clusters.
●2. Repeat
■ (re)assign each object to the cluster to which the
object is most similar, based on the mean value
of the objects in the cluster;
■ Update the cluster means;i.e., calculate the
mean value of the objects for each cluster.
●3. Until no change
The K-Means Clustering Method
●The process iterates until the criterion
function converges.
○ E= (i=1 to k) (p € Ci) |p-mi|2

●E is the sum of square-error for all objects in


the database, p is the point of space, mi is the
mean of cluster Ci.
●Algorithm try to determine k partitions that
minimize squared-error function.
The K-Means Clustering Method

●Example 1

1.We Pick k=2


centers at random
2.We cluster our
data around these
center points
The K-Means Clustering Method
1.We recalculate
centers based on
our current clusters

1.We re-cluster our


data around our
new center points
The K-Means Clustering Method

1.We repeat the last


two steps until no
more data points are
moved into a
different cluster
Hierarchical Clustering
● Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
AGglomerative
NESting
a a (AGNES)
b b abcd
c
cd e
d
d e
e e DIvisive ANAlysis
Step 4 Step 3 Step 2 Step 1 Step 0
(DIANA)
Agglomerative , Level 2, k = 7 clusters.
Agglomerative , Level 3, k = 6 clusters.
Agglomerative , Level 4, k = 5 clusters.
Agglomerative , Level 5, k = 4 clusters.
Agglomerative, Level 6, k = 3 clusters.
Agglomerative , Level 7, k = 2 clusters.
Agglomerative, Level 8, k = 1 cluster.
AGNES (Agglomerative Nesting)

● Introduced in Kaufmann and Rousseeuw (1990)


● Use the Single-Link method and the dissimilarity matrix.
● Merge nodes that have the least dissimilarity
● Go on in a non-descending fashion
● Eventually all nodes belong to the same cluster
DIANA (Divisive Analysis)

● Introduced in Kaufmann and Rousseeuw (1990)


● Inverse order of AGNES
● The cluster is split according to some principle.
○ For example, Maximum Euclidean distance between closest
enamoring objects.
● Eventually each node forms a cluster on its own
A Dendrogram Shows How the Clusters
are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning


(tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected component
forms a cluster.
Examples of Clustering Applications
● Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
● Land use: Identification of areas of similar land use in an
earth observation database
● City-planning: Identifying groups of houses according to
their house type, value, and geographical location
● Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
● Biology
○ Plant and animal taxonomies
○ Categorize genes with similar functionality
Cluster Analysis

● Reference:
1. Chapter 8: Data mining: Concepts and Techniques:
Jiawei Han and Micheline Kamber, Morgan Kaufmann
2. http://en.wikipedia.org/wiki/Cluster_analysis
3. http://home.dei.polimi.it//matteucc/clustering/tutorial html
/heirarchical.html
THANK YOU

You might also like