Unsupervised Learning: Clustering

Uploaded by

ruwantikaganga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views57 pages

Unsupervised Learning: Clustering

Uploaded by

ruwantikaganga

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS3121 - Introduction to Data Science

Unsupervised Learning
Dr. Nisansa de Silva,
Department of Computer Science & Engineering
http://nisansads.staff.uom.lk/
Supervised Learning vs. Unsupervised Learning
▪ Supervised learning: discover patterns in the data that
relate data attributes with a target (class) attribute.
– These patterns are then utilized to predict the values of the target
attribute in future data instances.

▪ Unsupervised learning: The data have no target attribute.

– We want to explore the data to find some intrinsic structures in
them.

2
Clustering
▪ The organization of unlabeled data into similarity groups
called clusters.
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
▪ Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping of
the data instances are given, which is the case in
supervised learning.
▪ Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
▪ This lecture focuses on clustering.

3
Historic Application of Clustering
▪ John Snow, London physician plotted the
location of cholera deaths on a map
during an outbreak in the 1850s.
▪ The locations indicated that cases were
clustered around certain intersections
where there were polluted wells, thus
exposing both the problem and the
solution.

4
Computer vision application: Image segmentation

https://tariq-hasan.github.io/concepts/computer-vision-semantic-segmentation/

5
An illustration
▪ The data set has three natural groups of data points, i.e., 3 natural clusters.

6
What is clustering for?
▪ Let us see some real-life examples
▪ Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-
Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
▪ Example 2: In marketing, segment customers according to their similarities
– To do targeted marketing.
▪ Example 3: Given a collection of text documents, we want to organize them according to
their content similarities,
– To produce a topic hierarchy
▪ In fact, clustering is one of the most utilized data mining techniques.
– It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance, libraries, etc.
– In recent years, due to the rapid increase of online documents, text clustering becomes important.

7
What do we need for clustering?
1. Proximity measure. Either,
• Similarity measure 𝑠 𝑥𝑖 , 𝑥𝑘 : large if 𝑥𝑖 , 𝑥𝑘 are similar
• Dissimilarity (or distance) measure 𝑑 𝑥𝑖 , 𝑥𝑘 : small if 𝑥𝑖 , 𝑥𝑘 are similar
large 𝑑, small 𝑠 large 𝑠, small 𝑑

2. Criterion function to evaluate a clustering (Clustering quality)

Good clustering Bad clustering

3. Algorithm to compute clustering (Clustering techniques)

8
Distance (Dissimilarity) Measures
▪ Euclidian distance
2
– 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖,𝑘 − 𝑥𝑗,𝑘
– Translation invariant

▪ Manhattan (city block) distance

– 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖,𝑘 − 𝑥𝑗,𝑘
– Approximation of Euclidian distance

▪ They are special cases of Minkowski distance:

1
𝑝 𝑝
– 𝑑𝑝 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖,𝑘 − 𝑥𝑗,𝑘
– Where 𝑝 is a positive integer

9
Clustering Quality: Cluster Evaluation (a Hard Problem)
▪ Intra-cluster cohesion (compactness):
– Maximized
– Cohesion measures how near the data points in a cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used measure.

▪ Inter-cluster separation (isolation):

– Minimized
– Separation means that different cluster centroids should be far away from one another.

▪ In most applications, expert judgments are still the key

▪ However, the overall quality of a clustering result depends on the algorithm, the
distance function, and the application.

10
How Many Clusters?

3 clusters or 2 clusters?
▪ Possible approaches
1. Fix the number of clusters to 𝑘
2. Find the best clustering according to the criterion function (number of clusters may vary)

11
Clustering Techniques
Clustering