0% found this document useful (0 votes)
8 views

Lecture 01 - Unsupervised Learning (Optional)

Uploaded by

ruwantikaganga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 01 - Unsupervised Learning (Optional)

Uploaded by

ruwantikaganga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

CS3121 - Introduction to Data Science

Unsupervised Learning
Dr. Nisansa de Silva,
Department of Computer Science & Engineering
http://nisansads.staff.uom.lk/
Supervised Learning vs. Unsupervised Learning
▪ Supervised learning: discover patterns in the data that
relate data attributes with a target (class) attribute.
– These patterns are then utilized to predict the values of the target
attribute in future data instances.

▪ Unsupervised learning: The data have no target attribute.


– We want to explore the data to find some intrinsic structures in
them.

2
Clustering
▪ The organization of unlabeled data into similarity groups
called clusters.
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
▪ Clustering is often called an unsupervised learning
task as no class values denoting an a priori grouping of
the data instances are given, which is the case in
supervised learning.
▪ Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
▪ This lecture focuses on clustering.

3
Historic Application of Clustering
▪ John Snow, London physician plotted the
location of cholera deaths on a map
during an outbreak in the 1850s.
▪ The locations indicated that cases were
clustered around certain intersections
where there were polluted wells, thus
exposing both the problem and the
solution.

4
Computer vision application: Image segmentation

https://tariq-hasan.github.io/concepts/computer-vision-semantic-segmentation/

5
An illustration
▪ The data set has three natural groups of data points, i.e., 3 natural clusters.

6
What is clustering for?
▪ Let us see some real-life examples
▪ Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-
Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
▪ Example 2: In marketing, segment customers according to their similarities
– To do targeted marketing.
▪ Example 3: Given a collection of text documents, we want to organize them according to
their content similarities,
– To produce a topic hierarchy
▪ In fact, clustering is one of the most utilized data mining techniques.
– It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance, libraries, etc.
– In recent years, due to the rapid increase of online documents, text clustering becomes important.

7
What do we need for clustering?
1. Proximity measure. Either,
• Similarity measure 𝑠 𝑥𝑖 , 𝑥𝑘 : large if 𝑥𝑖 , 𝑥𝑘 are similar
• Dissimilarity (or distance) measure 𝑑 𝑥𝑖 , 𝑥𝑘 : small if 𝑥𝑖 , 𝑥𝑘 are similar
large 𝑑, small 𝑠 large 𝑠, small 𝑑

2. Criterion function to evaluate a clustering (Clustering quality)

Good clustering Bad clustering

3. Algorithm to compute clustering (Clustering techniques)

8
Distance (Dissimilarity) Measures
▪ Euclidian distance
2
– 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖,𝑘 − 𝑥𝑗,𝑘
– Translation invariant

▪ Manhattan (city block) distance


– 𝑑 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖,𝑘 − 𝑥𝑗,𝑘
– Approximation of Euclidian distance

▪ They are special cases of Minkowski distance:


1
𝑝 𝑝
– 𝑑𝑝 𝑥𝑖 , 𝑥𝑗 = σ𝑑𝑘=1 𝑥𝑖,𝑘 − 𝑥𝑗,𝑘
– Where 𝑝 is a positive integer

9
Clustering Quality: Cluster Evaluation (a Hard Problem)
▪ Intra-cluster cohesion (compactness):
– Maximized
– Cohesion measures how near the data points in a cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used measure.

▪ Inter-cluster separation (isolation):


– Minimized
– Separation means that different cluster centroids should be far away from one another.

▪ In most applications, expert judgments are still the key


▪ However, the overall quality of a clustering result depends on the algorithm, the
distance function, and the application.

10
How Many Clusters?

3 clusters or 2 clusters?
▪ Possible approaches
1. Fix the number of clusters to 𝑘
2. Find the best clustering according to the criterion function (number of clusters may vary)

11
Clustering Techniques
Clustering

Hierarchical Partitional Bayesian

Graph Decision
Divisive Agglomerative Centroid Model Based Spectral Nonparametric
Theoretic Based

12
Clustering Techniques
Clustering

Hierarchical Partitional Bayesian

Graph Decision
Divisive Agglomerative Centroid Model Based Spectral Nonparametric
Theoretic Based

▪ Hierarchical algorithms find successive clusters using previously established clusters. These
algorithms can be either aggolomerative (“bottom-up”) or divisive (“top-down”)
– Aggolomerative algorithms begin with each element as a separate cluster and merge them into
successively larger clusters.
– Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

13
Clustering Techniques
Clustering

Hierarchical Partitional Bayesian

Graph Decision
Divisive Agglomerative Centroid Model Based Spectral Nonparametric
Theoretic Based

▪ Partitional algorithms typically determine all clusters at once, but can also be used
as divisive algorithms in the hierarchical clustering

14
Clustering Techniques
Clustering

Hierarchical Partitional Bayesian

Graph Decision
Divisive Agglomerative Centroid Model Based Spectral Nonparametric
Theoretic Based

▪ Bayesian algorithms try to generate a posteriori distribution over the collection of all
partitions of the data.

15
Clustering Techniques
Clustering

Hierarchical Partitional Bayesian

Graph Decision
Divisive Agglomerative Centroid Model Based Spectral Nonparametric
Theoretic Based

K-means

16
K-means algorithm

17
K-means Clustering
▪ K-means (MacQueen, 1967) is a partitional clustering
algorithm

▪ Let the set of data points (or instances) D be {x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr,
and r is the number of attributes (dimensions) in the data.

▪ The k-means algorithm partitions the given data into k


clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user

18
K-means algorithm
▪ Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).

19
Stopping/Convergence Criterion
1. no (or minimum) re-assignments of data points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
k
SSE  
j 1
xC j
dist (x, m j ) 2
(1)

– Ci is the jth cluster


– mj is the centroid of cluster Cj (the mean vector of all the data points in Cj)
– dist(x, mj) is the distance between data point x and centroid mj.

20
K-means Clustering Example: Iteration 1: Step 1
▪ Randomly initialize the cluster centers (synaptic weights)

21
K-means Clustering Example: Iteration 1: Step 2
▪ Determine cluster membership for each input (“winner-takes-all” inhibitory circuit)

22
K-means Clustering Example: Iteration 1: Step 3
▪ Re-estimate cluster centers (adapt synaptic weights)

23
K-means Clustering Example: Iteration 1: Result

24
K-means Clustering Example: Iteration 2

25
K-means Clustering Example: Iteration 2: Result

26
Strengths and Weaknesses of k-means
▪ Strengths: ▪ Weaknesses:
– Simple: easy to understand and to – The algorithm is only applicable if the mean
implement is defined.
▪ For categorical data, k-mode - the centroid is
– Efficient: Time complexity: 𝑂(𝑡𝑘𝑛), represented by most frequent values.
where 𝑛 is the number of data points, – The user needs to specify k.
𝑘 is the number of clusters, and – The algorithm is sensitive to outliers
𝑡 is the number of iterations. ▪ Outliers are data points that are very far away
from other data points.
– Since both 𝑘 and 𝑡 are small. k-means is ▪ Outliers could be errors in the data recording or
considered a linear algorithm. some special data points with very different
values.

▪ K-means is the most popular clustering algorithm.


▪ Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find
due to complexity.

29
K-means Issues: Difficult to Handle Outliers
▪ One method is to remove some data
points in the clustering process that are
much further away from the centroids
than other data points.
– To be safe, we may want to monitor these
possible outliers over a few iterations and
then decide to remove them.

▪ Another method is to perform random


sampling. Since in sampling we only
choose a small subset of the data
points, the chance of selecting an
outlier is very small.
– Assign the rest of the data points to the
clusters by distance or similarity comparison,
or classification

30
K-means Issues: Sensitivity to Initial Seeds

31
K-means Issues: Special Data Structures
▪ The k-means algorithm is not suitable for discovering clusters that are not hyper-
ellipsoids (or hyper-spheres).

Two natural clusters k-means clusters


32
K-means summary
▪ Despite weaknesses, k-means is still the most popular algorithm due to its simplicity,
efficiency and
– other clustering algorithms have their own lists of weaknesses.

▪ No clear evidence that any other clustering algorithm performs better in general
– although they may be more suitable for some specific types of data or applications.

▪ Comparing different clustering algorithms is a difficult task. No one knows the correct
clusters!

33
Hierarchical Clustering

34
Hierarchical Clustering
▪ Produce a nested sequence of clusters, a tree, also called Dendrogram.
▪ Preferred way to represent a hierarchical clustering.

36
Hierarchical Clustering
▪ So far we only talked about “flat” clustering.

▪ For some data, hierarchical clustering is more appropriate than “flat” clustering.
▪ Hierarchical Clustering

37
Example: Biological Taxonomy

38
Types of Hierarchical Clustering
▪ Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the
bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single cluster (i.e., the root cluster).

▪ Divisive (top down) clustering: It starts with all data points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster is recursively divided further
– stops when only singleton clusters of individual data points remain, i.e., each cluster with
only a single point

39
Divisive Hierarchical Clustering
▪ Any “flat” algorithm which produces a fixed number of clusters can be used.
▪ Set 𝑐 = 2

40
Agglomerative Hierarchical Clustering
It is more popular then divisive methods.
▪ At the beginning, each data point forms a cluster (also called a node).
▪ Merge nodes/clusters that have the least distance.
▪ Go on merging
▪ Eventually all nodes belong to one cluster

41
An example: Working of the Algorithm

42
Measuring the Cluster Distance
▪ Four common ways to measure cluster distance
– Minimum distance 𝑑𝑚𝑖𝑛 𝐷𝑖 , 𝐷𝑗 = min 𝑥−𝑦
𝑥∈𝐷𝑖 , 𝑦∈𝐷𝑗
– Maximum distance 𝑑𝑚𝑎𝑥 𝐷𝑖 , 𝐷𝑗 = max 𝑥−𝑦
𝑥∈𝐷𝑖 , 𝑦∈𝐷𝑗
1
– Average distance 𝑑𝑎𝑣𝑔 𝐷𝑖 , 𝐷𝑗 = σ σ 𝑥−𝑦
𝑛𝑖 𝑛𝑗 𝑥∈𝐷𝑖 𝑦∈𝐷𝑗

– Mean distance 𝑑𝑚𝑒𝑎𝑛 𝐷𝑖 , 𝐷𝑗 = 𝜇𝑖 − 𝜇𝑗

▪ A few ways to measure distances of two clusters.


▪ Results in different variations of the algorithm.
– Single link
– Complete link
– Average link
– Centroids

43
Single Link Method (Nearest Neighbor)
▪ The distance between two clusters is the distance between two closest data points
in the two clusters, one data point from each cluster.
▪ Agglomerative clustering with minimum distance.
▪ Generates minimum spanning tree.
▪ Encourages growth of elongated clusters.
▪ It can find arbitrarily shaped clusters, but
– It may cause the undesirable “chain effect” by noisy points Two natural clusters are
split into two

𝑑𝑚𝑖𝑛 𝐷𝑖 , 𝐷𝑗 = min 𝑥−𝑦


𝑥∈𝐷𝑖 , 𝑦∈𝐷𝑗

44
Complete Link Method (Farthest Neighbor)
▪ The distance between two clusters is the distance
of two furthest data points in the two clusters.
▪ Agglomerative clustering with maximum distance
▪ Encourages compact clusters
▪ It is sensitive to outliers because they are far away
▪ Does not work if elongated clusters are present
– 𝑑𝑚𝑎𝑥 𝐷1 , 𝐷2 < 𝑑𝑚𝑎𝑥 𝐷2 , 𝐷3
– Thus 𝐷1 and 𝐷2 are merged instead of 𝐷2 and 𝐷3 .

𝑑𝑚𝑎𝑥 𝐷𝑖 , 𝐷𝑗 = max 𝑥−𝑦


𝑥∈𝐷𝑖 , 𝑦∈𝐷𝑗

45
Average Link and Centroid Methods
▪ Average link: A compromise between
– the sensitivity of complete-link clustering to outliers and
– the tendency of single-link clustering to form long chains that do not correspond to the intuitive
notion of clusters as compact, spherical objects.
– In this method, the distance between two clusters is the average distance of all pair-wise distances
between the data points in two clusters.

▪ Centroid method: In this method, the distance between two clusters is the distance
between their centroids

46
Divisive vs. Agglomerative
▪ Agglomerative is faster to compute, in general
▪ Divisive may be less “blind” to the global structure of the data.

Divisive Agglomerative
When taking the first step (split), have When taking the first step merging, do
access to all the data; can find the best not consider the global structure of the
possible split in 2 parts. data, only look at pairwise structure.

48
How to choose a clustering algorithm
▪ Clustering research has a long history. A vast collection of algorithms are available.
– We only introduced several main algorithms.
▪ Choosing the “best” algorithm is a challenge.
– Every algorithm has limitations and works well with certain data distributions.
– It is very hard, if not impossible, to know what distribution the application data follow. The data may not
fully follow any “ideal” structure or distribution required by the algorithms.
– One also needs to decide how to standardize the data, to choose a suitable distance function and to
select other parameter values.
▪ Due to these complexities, the common practice is to
– run several algorithms using different distance functions and parameter settings, and
– then carefully analyze and compare the results.
▪ The interpretation of the results must be based on insight into the meaning of the original
data together with knowledge of the algorithms used.
▪ Clustering is highly application dependent and to certain extent subjective (personal
preferences).

49
Cluster Evaluation

50
Cluster Evaluation is a Hard Problem
▪ The quality of a clustering is very hard to evaluate because
– We do not know the correct clusters

▪ Some methods are used:


– User inspection
▪ Study centroids, and spreads
▪ Rules from a decision tree.
▪ For text documents, one can read some documents in clusters.

51
Evaluation Measures: Ground Truth
▪ We use some labeled data (for classification)
▪ Assumption: Each class is a cluster.
▪ After clustering, a confusion matrix is constructed. From the matrix, we compute
various measurements, entropy, purity, precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The clustering method produces k clusters,
which divides D into k disjoint subsets, D1, D2, …, Dk.

52
Evaluation Measures: Entropy

53
Evaluation Measures: Purity

54
An example

55
A remark about ground truth evaluation
▪ Commonly used to compare different clustering algorithms.
▪ A real-life data set for clustering has no class labels.
– Thus although an algorithm may perform very well on some labeled data sets, no
guarantee that it will perform well on the actual application data at hand.

▪ The fact that it performs well on some label data sets does give us some
confidence of the quality of the algorithm.
▪ This evaluation method is said to be based on external data or information.

56
Evaluation based on internal information
▪ Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used measure.

▪ Inter-cluster separation (isolation):


– Separation means that different cluster centroids should be far away from one another.

▪ In most applications, expert judgments are still the key.

57
Indirect evaluation
▪ In some applications, clustering is not the primary task, but used to help
perform another task.
▪ We can use the performance on the primary task to compare clustering
methods.
▪ For instance, in an application, the primary task is to provide
recommendations on book purchasing to online shoppers.
– If we can cluster books according to their features, we might be able to provide better
recommendations.
– We can evaluate different clustering algorithms based on how well they help with the
recommendation task.
– Here, we assume that the recommendation can be reliably evaluated.

58
Summary

59
Summary
▪ Clustering is has along history and still is in active research
– More are still coming every year.
▪ We only introduced several main algorithms. There are many others, e.g.,
– Density based algorithm
– Sub-space clustering
– Scale-up methods,
– Neural networks based methods
– Fuzzy clustering
– Co-clustering
▪ Clustering is hard to evaluate, but very useful in practice.
– This partially explains why there are still a large number of clustering algorithms being devised
every year.
▪ Clustering is highly application dependent and to some extent subjective.
▪ Competitive learning in neuronal networks performs clustering analysis of the input
data

60
References
▪ “CS583 - Chapter 4: Unsupervised Learning” by Bing Liu
▪ “Class 13 - Unsupervised learning – Clustering” by Shimon Ullman, Tomaso Poggio,
Danny Harari, Daneil Zysman, and Darren Seibert” by amalalhait
▪ “Machine Learning” by Andrew Ng

61

You might also like