0% found this document useful (0 votes)
17 views25 pages

Clustering

Clustering in data mining

Uploaded by

Bhavani Viswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views25 pages

Clustering

Clustering in data mining

Uploaded by

Bhavani Viswa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

CLUSTERING

ClusterAnalysis
Types of Data in Cluster Analysis

A Categorization of Major Clustering

Methods
Cluster Analysis
 Cluster is a collection of data objects:
 Similar to one another within the same cluster,
 Dissimilar to the objects in other clusters.

 Cluster analysis is an analysis used to finding similarities


between data according to the characteristics found in the
data and grouping similar data objects into clusters.

 Cluster analysis is used to form groups or clusters of similar


records based on several measures made on these records.

 Cluster is Unsupervised learning, i.e., no predefined classes.

 This data has been applied in many areas, including


astronomy, archaeology, medicine, chemistry, education,
psychology, linguistics and sociology.
Application of Clustering
 Use of Clustering in Medical Image
Database
 Pattern Recognition
 Spatial Data Analysis
 Detect spatial clusters or for other
spatial mining tasks
 Image Processing
 Economic Science
 WWW
Requirements of Clustering in Data Mining

 The following are typical requirements of


clustering in data mining.

 Scalability.
 Ability to deal with different types of attributes.
 Discovery of clusters with arbitrary shape.
 Ability to deal with noisy data.
 Minimal requirements for domain knowledge to
determine input parameters.
 Insensitivity to the order of input records.
 High dimensionality.
 Constraint-based clustering.
 Interpretability and usability.
TYPES OF DATA IN CLUSTER ANALYSIS

 Types of data in cluster analysis are:

 Interval-scaled variables. Example: salary, height.

 Binary variables. Example: gender (Male/Female), has_cancer


(True/False)

 Nominal (categorical) variables. Example: religion (Christian,


Muslim, Buddhist, Hindu, etc.)

 Ordinal variables. Example: military-rank (soldier, sergeant,


captain, etc.)

 Ratio-scaled variables. Example: Population growth (1, 10, 100,


1000, ...)

 Variables of mixed types.


CATEGORISATION OF MAJOR CLUSTERING
METHODS

 We can divide the clustering methods


into two main groups:
 Hierarchical methods
 Partitioning methods

Additional three main categories:


 Density-based methods,
 Model-based clustering
 Grid-based methods.
Partitioning Methods
Definition: Given a database of n objects. A partition
method constructs k partitions of the data, where
each partition represents a cluster and k<=n.

It classifies the data into k groups, which satisfy the


following requirements,
1. Each group must contain at least one object.
2. Each object must belong to exactly one group.

Partitioning Algorithms:
K-means
K- medoids
K- means Partitioning
 Each cluster is represented by the mean value of the objects
in the cluster. Hence, it is known as Centroid-Based
technique.

Working method:
 First, it randomly selects k of the objects, each of which
initially represents a cluster mean.

 For each of the remaining objects, an object is assigned


to the cluster to which it is most similar, based on the
distance between the object and the cluster mean.

 Then, it compute the new mean for each cluster.

 The above process will be continue until the criterion


function converges.
K-Means Clustering
k-means algorithm is implemented as below:
Input: Number of clusters – k, database of n objects
Output: Set of k clusters that minimize the squared error

Choose k objects as the initial cluster centers Repeat

(Re)assign each object to the cluster to which the object is most


similar based on the mean value of the objects in the cluster
Update the cluster means
Until no change

9
K-Means Clustering

1
0
K-Means Clustering Method

10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3

2 each 2 the 2

1
objects
1
cluster 1

0 0
0
0 1 2 3 4 5 6 7 8 9 to 0 1
10
2 3 4 5 6 7 8 9 means 0 1 2 3 4 5 6 7 8 9
10
10
most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

3
the 3

2 2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9
means 0
0 1 2 3 4 5 6 7 8 9
10 10

9
K-Means Method

Strength:Relatively efficient: O(tkn), where n is # objects, k


is # clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness

Applicable only when mean is defined – Categorical data


Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

1
2
Variations of the K-Means
Method
A few variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster
means categorical data: k-modes
Handling
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 A mixture of categorical and numerical data: k-prototype method
Expectation Maximization
 Assigns objects to clusters based on the probability of membership
Scalability of k-means
 Compressible, Discardable, To be maintained in main memory
 Clustering Features

1
3
Problem of the K-Means
Method
The k-means algorithm is sensitive to outliers
 Since an object with an extremely large value may substantially distort the
distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a cluster


as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9
10

1
4
K-Medoids Clustering
Method
PAM (Partitioning Around Medoids)
 starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
 All pairs are analyzed for replacement
 PAM works effectively for small data sets, but does not scale well for
large data sets

CLARA
CLARANS

1
5
K-
Medoids
Input: k, and database of n objects
Output: A set of k clusters
Method:
 Arbitrarily choose k objects as initial medoids
 Repeat
Assign each remaining object to cluster with nearest medoid
Randomly select a non-medoid oarndom
Compute cost S of swapping oj with oarndom
If S < 0 swap to form new set of k medoids
Until no change
Working Principle: Minimize sum of the dissimilarities between each object and its
corresponding reference point. That is, an absolute-error criterion is used
E  k  |po |
j 1 pCj j

1
6
K-
medoids
 Case 1: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to one of oi where i < > j then p is reassigned to oi.
 Case 2: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to orandomthen p is reassigned to orandom.

 Case 3: p currently belongs to medoid oi (i< >j) If oj is replaced by oarndomas a medoid


and p is still closest to oi assignment does not change
 Case 4: p currently belongs to medoid oi (i < > j). If oj is replaced by orandomas a medoid
and p is closest to orandomthen p is reassigned to orandom.

1
7
K-
medoids

 After reassignment difference in squared
error E is calculated. Total cost of

swapping – Sum of costs incurred by all
non-medoid objects
 If total cost is negative, o is replaced with o as
j random

E will be reduced

1
8
K-medoids
Algorithm

1
9
Hierarchical Methods
Agglomerative hierarchical clustering:
Each object initially represents a
cluster of its own. Then clusters are
successively merged until the desired
cluster structure is obtained.

Divisive hierarchical clustering:


All objects initially belong to one
cluster. Then the cluster is divided into sub-
clusters, which are successively divided
into their own sub-clusters. This process
continues until the desired cluster structure
Hierarchical Methods

 The hierarchical clustering methods could be further


divided according to the manner that the similarity
measure is calculated:

 Single-link clustering (also called the


connectedness, the minimum method or the nearest
neighbour method)

 Complete-link clustering (also called the diameter,


the maximum method or the furthest neighbour
method)

 Average-link clustering (also called minimum


variance method)
Density – Based Clustering Method

 This method is used to discover clusters with


arbitrary shape.There are three methods,

 DBSCAN: grows clusters according to a


density-based connectivity analysis.

 OPTICS: produce cluster obtained from a


wide range of parameter settings.

 DENCLUE: clusters objects based on a set of


density distribution functions.
Grid Based Clustering
 It uses multiresolution grid data structure.
 It quantizes the object space into a finite number of
cells, that form a grid structure.
 In that grid structure, all clustering operations are
performed.

 Approaches of Grid-based clustering:


 STING: which explores statistical information stored in
the grid cells.

 WAVE CLUSTER: wavelet transform method used to


cluster objects.

 CLIQUE: represents a grid & density based approach


for clustering in high-dimensional data space.
STING: STatistical Information Grid

 In this technique, the spatial area is divided


into rectangular cells.

 Each cell at high level is partitioned to form


a number of cells at next lower level.

 Statistical information regarding the


attributes in each grid cell is precomputed
and stored.

 These statistical parameters are useful for


query processing.
Wave Cluster

 It is a multiresolution clustering algorithm.

 It first summarizes the data on to the data


space.

 Then it transform the original feature space,


to find dense region by using wavelet
transformation.

You might also like