0% found this document useful (0 votes)
12 views58 pages

L 8 Clustering

Uploaded by

bn23mer2r15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views58 pages

L 8 Clustering

Uploaded by

bn23mer2r15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Clustering

[email protected] © M. Shahbaz – 2006


Lecture Outline
• What is Clustering
• Supervised and Unsupervised
Classification
• Types of Clustering Algorithms
• Most Common Techniques
• Areas of Applications
• Discussion
• Result

[email protected] © M. Shahbaz – 2006


Clustering - Definition

─ Process of grouping similar items together


─ Clusters should be very similar to each other
but…
─ Should be very different from the objects of other
clusters/ other clusters
─ We can say that intra-cluster similarity between
objects is high and inter-cluster similarity is low
─ Important human activity --- used from early
childhood in distinguishing between different
items such as cars and cats, animals and plants
etc.
Supervised and Unsupervised Classification

─ What is Classification?
─ What is Supervised Classification/Learning?
─ What is Unsupervised Classification/Learning?
─ SOM – Self Organizing Maps
Types of Clustering Algorithms

─ Clustering has been a popular area of research


─ Several methods and techniques have been
developed to determine natural grouping among
the objects

Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Survey.
ACM Computing Surveys, 1999. 31: pp. 264-323.

Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data. 1988,


Englewood Cliffs, NJ: Prentice Hall. 013022278X
Types of Clustering Algorithms
Clustering

Hierarchical Partitioning Grid-Based Clustering Algorithms For


Methods Methods Methods Algorithms Used in High Dimensional
Machine Learning Data

Agglomerative Divisive Gradient Descent Evolutionary


Algorithms Algorithms and Artificial Methods
Neural Networks

Subspace Projection Co-Clustering


Clustering Techniques Techniques

Relocation Probabilistic K-medoids K-means Methods Density-Based


Algorithms Clustering Methods Algorithms

Density-Based Density Functions


Connectivity Clustering
Clustering
Classification vs. Clustering
Classification:
Supervised learning:
Learns a method for predicting the
instance class from pre-labeled
(classified) instances
Clustering

Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
Clustering Evaluation

• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
–distance measures
–high similarity within a cluster, low across
clusters
The Distance Function

• Simplest case: one numeric attribute A


– Distance(X,Y) = A(X) – A(Y)
• Several numeric attributes:
– Distance(X,Y) = Euclidean distance between
X,Y

• Are all attributes equally important?


– Weighting the attributes might be necessary
Simple Clustering: K-means

Works with numeric data only


1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of
its assigned items
4) Repeat steps 2,3 until convergence
(change in cluster assignments less than
a threshold)
K-means example, step 1

k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3

X
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
K-means example, step 3

k1 k1
Y

Move k2
each cluster
center k3
k2
to the mean
of each cluster k3

X
K-means example, step 4

Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?

X
K-means example, step 4 …

k1
Y
A: three
points with
animation k3
k2

X
K-means example, step 4b

k1
Y
re-compute
cluster
means k3
k2

X
K-means example, step 5

k1
Y

k2
move cluster
centers to k3
cluster means

X
Squared Error Criterion
Pros and cons of K-Means
K-means variations

• K-medoids – instead of mean, use


medians of each cluster
–Mean of 1, 3, 5, 7, 9 is 5
–Mean of 1, 3, 5, 7, 1009 is 205
–Median of 1, 3, 5, 7, 1009 is 5
–Median advantage: not affected by extreme
values
• For large databases, use sampling
k-Medoids
K-means clustering summary

Advantages Disadvantages
• Simple, understandable • Must pick number of
• items automatically clusters before hand
assigned to clusters • All items forced into a
cluster
• Too sensitive to outliers
since an object with an
extremely large value
may substantially
distort the distribution
of data
Clustering Summary
• unsupervised
• many approaches
–K-means – simple, sometimes useful
• K-medoids is less sensitive to outliers
–Hierarchical clustering – works for symbolic
attributes
–Can be used to fill in missing values
New Centroid for Cluster 2 New Centroid for
(A3+B1+B2+B3+C2)/5=6,6 Cluster 3
(A2+C1)/2=1.5,3.5

You might also like