0% found this document useful (0 votes)
8 views20 pages

CLUSTERING

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

CLUSTERING

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Clustering

1
Supervised learning vs. unsupervised
learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute
in future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.

2
Clustering:
• Clustering is a technique for finding similarity groups in data, called
clusters. I.e.,
• it groups data instances that are similar to (near) each other in one cluster
and data instances that are very different (far away) from each other into
different clusters.
• Clustering is often called an unsupervised learning task as no class
values denoting an a priori grouping of the data instances are given,
which is the case in supervised learning.
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
• In fact, association rule mining is also unsupervised

3
An illustration
• The data set has three natural groups of data points, i.e., 3
natural clusters.

4
What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes together to make
“small”, “medium” and “large” T-Shirts.
• Example 2: In marketing, segment customers according to
their similarities
• To do targeted marketing.
• Example 3: Given a collection of text documents, we want to
organize them according to their content similarities,
• To produce a topic hierarchy

5
Applications:
• In fact, clustering is one of the most utilized data mining
techniques.
• It has a long history, and used in almost every field, e.g., medicine,
psychology, botany, sociology, biology, archeology, marketing,
insurance, libraries, etc.
• In recent years, due to the rapid increase of online documents, text
clustering becomes important.

6
Common ways to represent
clusters:
• 1.Use the centroid of each cluster to represent the cluster.
• compute the radius and
• standard deviation of the cluster to determine its spread in each
dimension

• The centroid representation alone works well if the clusters are of


the hyper-spherical shape.
• If clusters are elongated or are of other shapes, centroids are not
sufficient

7
2.Using classification model
• All the data points in a
cluster are regarded to
have the same class label,
e.g., the cluster ID.
• run a supervised learning
algorithm on the data to find
a classification model.

8
3.Use frequent values to
represent cluster
• This method is mainly for clustering of categorical data (e.g.,
k-modes clustering).
• Main method used in text clustering, where a small set of
frequent words in each cluster is selected to represent the
cluster.

9
Clusters of arbitrary shapes:
• Hyper-elliptical and hyper-spherical clusters
are usually easy to represent, using their
centroid together with spreads.
• Irregular shape clusters are hard to represent.
They may not be useful in some applications.
• Using centroids are not suitable (upper figure) in
general
• K-means clusters may be more useful (lower
figure), e.g., for making 2 size T-shirts.

10
Hierarchical Clustering:
• Produce a nested sequence of clusters, a tree, also called
Dendrogram.

11
Explanation:
• At the bottom of the tree, there are 5 clusters (5 data points).
• At the next level, cluster 6 contains data points 1 and 2, and cluster 7
contains data points 4 and 5.
• As we move up the tree, we have fewer and fewer clusters.
• Since the whole clustering tree is stored, the user can choose to view
clusters at any level of the tree.

12
Types of hierarchical clustering:
• Agglomerative (bottom up) clustering: It builds the dendrogram
(tree) from the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e., the
root cluster).
• Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
• Splits the root into a set of child clusters. Each child cluster is recursively
divided further
• stops when only singleton clusters of individual data points remain, i.e.,
each cluster with only a single point

13
Agglomerative clustering:
It is more popular then divisive methods.
• At the beginning, each data point forms a cluster (also called a node).
• Merge nodes/clusters that have the least distance.
• Go on merging
• Eventually all nodes belong to one cluster

14
Agglomerative clustering
algorithm:

15
An example: working of the
algorithm

16
Measuring the distance of two
clusters:
• A few ways to measure distances of two clusters.
• Results in different variations of the algorithm.
• Single link
• Complete link
• Average link
• Centroids
• …

17
Single link method:
• The distance between two clusters
is the distance between two closest
data points in the two clusters, one
data point from each cluster.
• It can find arbitrarily shaped
clusters, but
• It may cause the undesirable “chain
effect” by noisy points

Two natural clusters are


split into two

18
Complete link method:
• The distance between two clusters is the distance of two
furthest data points in the two clusters.
• It is sensitive to outliers because they are far away

19
Average link and centroid
methods:
• Average link: A compromise between
• the sensitivity of complete-link clustering to outliers and
• the tendency of single-link clustering to form long chains that do
not correspond to the intuitive notion of clusters as compact,
spherical objects.
• In this method, the distance between two clusters is the average
distance of all pair-wise distances between the data points in two
clusters.
• Centroid method: In this method, the distance between two
clusters is the distance between their centroids

20

You might also like