0% found this document useful (0 votes)
24 views14 pages

Lecture 6

Uploaded by

mehrtash.soltani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views14 pages

Lecture 6

Uploaded by

mehrtash.soltani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 14

What is Cluster Analysis?

• Cluster: a collection of data objects


– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering: Rich Applications
and Multidisciplinary Efforts
• Pattern Recognition
• Spatial Data Analysis
– Create thematic maps in GIS by clustering feature spaces
– Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Quality: What Is Good
Clustering?
• A good clustering method will produce high quality
clusters with
– high intra-class similarity
– low inter-class similarity

• The quality of a clustering result depends on both the


similarity measure used by the method and its
implementation
• The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns
Measure the Quality of
Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering in Data
Mining
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
Typical Alternatives to Calculate the
Distance between Clusters
• Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an


element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

• Centroid: distance between the centroids of two clusters, i.e., dis(K i,


Kj) = dis(Ci, Cj)

• Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj)
= dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
iN 1(t )
• Centroid: the “middle” of a cluster Cm  N
ip

• Radius: square root of average distance from any point


of the cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N

• Diameter: square root of average mean squared


distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
Partitioning Algorithms: Basic
Concept
• Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters, s.t., min sum of
squared distance
 km1tmiKm (Cm  tmi ) 2
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in
four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of
the current partition (the centroid is the center, i.e.,
mean point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
– Go back to Step 2, stop when no more new assignment
The K-Means Clustering Method
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Comments on the K-Means
Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means

– Dissimilarity calculations

– Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)


– Replacing means of clusters with modes

– Using new dissimilarity measures to deal with categorical objects

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method


Summary
• Cluster analysis groups objects based on their similarity and
has wide applications
• Measure of similarity can be computed for various types of
data
• Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods, grid-
based methods, and model-based methods
• Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical, distance-
based or deviation-based approaches
• There are still lots of research issues on cluster analysis

You might also like