CLUSTERING
CLUSTERING
1
Supervised learning vs. unsupervised
learning
• Supervised learning: discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute
in future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
2
Clustering:
• Clustering is a technique for finding similarity groups in data, called
clusters. I.e.,
• it groups data instances that are similar to (near) each other in one cluster
and data instances that are very different (far away) from each other into
different clusters.
• Clustering is often called an unsupervised learning task as no class
values denoting an a priori grouping of the data instances are given,
which is the case in supervised learning.
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
• In fact, association rule mining is also unsupervised
3
An illustration
• The data set has three natural groups of data points, i.e., 3
natural clusters.
4
What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes together to make
“small”, “medium” and “large” T-Shirts.
• Example 2: In marketing, segment customers according to
their similarities
• To do targeted marketing.
• Example 3: Given a collection of text documents, we want to
organize them according to their content similarities,
• To produce a topic hierarchy
5
Applications:
• In fact, clustering is one of the most utilized data mining
techniques.
• It has a long history, and used in almost every field, e.g., medicine,
psychology, botany, sociology, biology, archeology, marketing,
insurance, libraries, etc.
• In recent years, due to the rapid increase of online documents, text
clustering becomes important.
6
Common ways to represent
clusters:
• 1.Use the centroid of each cluster to represent the cluster.
• compute the radius and
• standard deviation of the cluster to determine its spread in each
dimension
7
2.Using classification model
• All the data points in a
cluster are regarded to
have the same class label,
e.g., the cluster ID.
• run a supervised learning
algorithm on the data to find
a classification model.
8
3.Use frequent values to
represent cluster
• This method is mainly for clustering of categorical data (e.g.,
k-modes clustering).
• Main method used in text clustering, where a small set of
frequent words in each cluster is selected to represent the
cluster.
9
Clusters of arbitrary shapes:
• Hyper-elliptical and hyper-spherical clusters
are usually easy to represent, using their
centroid together with spreads.
• Irregular shape clusters are hard to represent.
They may not be useful in some applications.
• Using centroids are not suitable (upper figure) in
general
• K-means clusters may be more useful (lower
figure), e.g., for making 2 size T-shirts.
10
Hierarchical Clustering:
• Produce a nested sequence of clusters, a tree, also called
Dendrogram.
11
Explanation:
• At the bottom of the tree, there are 5 clusters (5 data points).
• At the next level, cluster 6 contains data points 1 and 2, and cluster 7
contains data points 4 and 5.
• As we move up the tree, we have fewer and fewer clusters.
• Since the whole clustering tree is stored, the user can choose to view
clusters at any level of the tree.
12
Types of hierarchical clustering:
• Agglomerative (bottom up) clustering: It builds the dendrogram
(tree) from the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e., the
root cluster).
• Divisive (top down) clustering: It starts with all data points in one
cluster, the root.
• Splits the root into a set of child clusters. Each child cluster is recursively
divided further
• stops when only singleton clusters of individual data points remain, i.e.,
each cluster with only a single point
13
Agglomerative clustering:
It is more popular then divisive methods.
• At the beginning, each data point forms a cluster (also called a node).
• Merge nodes/clusters that have the least distance.
• Go on merging
• Eventually all nodes belong to one cluster
14
Agglomerative clustering
algorithm:
15
An example: working of the
algorithm
16
Measuring the distance of two
clusters:
• A few ways to measure distances of two clusters.
• Results in different variations of the algorithm.
• Single link
• Complete link
• Average link
• Centroids
• …
17
Single link method:
• The distance between two clusters
is the distance between two closest
data points in the two clusters, one
data point from each cluster.
• It can find arbitrarily shaped
clusters, but
• It may cause the undesirable “chain
effect” by noisy points
18
Complete link method:
• The distance between two clusters is the distance of two
furthest data points in the two clusters.
• It is sensitive to outliers because they are far away
19
Average link and centroid
methods:
• Average link: A compromise between
• the sensitivity of complete-link clustering to outliers and
• the tendency of single-link clustering to form long chains that do
not correspond to the intuitive notion of clusters as compact,
spherical objects.
• In this method, the distance between two clusters is the average
distance of all pair-wise distances between the data points in two
clusters.
• Centroid method: In this method, the distance between two
clusters is the distance between their centroids
20