0% found this document useful (0 votes)
14 views80 pages

4.1 Clustering

The document outlines the syllabus and key concepts of a course on Machine Learning, focusing on clustering techniques such as K-Means and Hierarchical clustering. It discusses the goals of clustering, similarity measures, and the algorithms used in clustering, as well as their strengths and weaknesses. Additionally, it covers practical aspects like determining the number of clusters and the challenges posed by high-dimensional data.

Uploaded by

q7ak26tja0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views80 pages

4.1 Clustering

The document outlines the syllabus and key concepts of a course on Machine Learning, focusing on clustering techniques such as K-Means and Hierarchical clustering. It discusses the goals of clustering, similarity measures, and the algorithms used in clustering, as well as their strengths and weaknesses. Additionally, it covers practical aspects like determining the number of clusters and the challenges posed by high-dimensional data.

Uploaded by

q7ak26tja0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

MIT School of Computing

Department of Information Technology

Third Year Engineering

21BTIT504- Fundamentals of Machine Learning


Class - T.Y. (SEM-VII)
PL
UnitD- IV

AY 2023-2024 SEM-VI

1
MIT School of Computing
Department of Information Technology

Unit-VI Syllabus

Clustering, Hierarchical clustering, KNN clustering, K-Means


clustering, Bayesian Belief networks, Hidden Markov Model

PL
D

2
Introduction to Clustering
Clustering
(Unsupervised Learning)
Given: Examples: < x1 , x2 , … x n >
Find: A natural clustering (grouping) of the data

Example Applications:
Identify similar energy, use customer profiles
<x> = time series of energy usage

Identify anomalies in user behavior for computer security


<x> = sequences of user commands
Why cluster?
• Labeling is expensive
• Gain insight into the structure of the data
• Find prototypes in the data
Goal of Clustering
• Given a set of data points, each described by a
set of attributes, find clusters such that:
F1 x
– Inter-cluster similarity is x
x xx xx
x
maximized xx
xxxx
x
x xx x
– Intra-cluster similarity is
minimized F2

• Requires the definition of a similarity measure


What is a natural grouping of these objects?
Slide from Eamonn Keogh
What is a natural grouping of these objects?
Slide from Eamonn Keogh

Clustering is subjective

Simpson's Family School Employees Females Males


What is Similarity?
Slide based on one by Eamonn Keogh

Similarity is
hard to define,
but…
“We know it
when we see it”
Defining Distance Measures
Slide from Eamonn Keogh
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and
O2 is a real number denoted by D(O1,O2)

Peter Piotr

0.23 3 342.7
Slide based on one by Eamonn Keogh

What properties should a distance measure


have?

• D(A,B) = D(B,A) Symmetry


• D(A,A) = 0 Constancy of Self-Similarity
• D(A,B) = 0 iif A= B Positivity (Separation)
• D(A,B)  D(A,C) + D(B,C) Triangular Inequality
Slide based on one by Eamonn Keogh

Intuitions behind desirable distance


measure properties

D(A,B) = D(B,A)
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like
Alex.”

D(A,A) = 0
Otherwise you could claim “Alex looks more like Bob, than Bob does.”
Slide based on one by Eamonn Keogh

Two Types of Clustering


• Partitional algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical
decomposition of the set of objects using some criterion

Hierarchical Partitional
Slide based on one by Eamonn Keogh

Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K non-overlapping clusters.
• Since only one set of clusters is output, the
user normally has to input the desired
number of clusters K.
Slide based on one by Eamonn Keogh

Partition Algorithm 1: k-means


1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh


K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh


K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh


K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh


K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
expression in condition 2 5

4
k1

k2
1
k3

0
0 1 2 3 4 5

expression in condition 1
Slide based on one by Eamonn Keogh
Comments on k-Means
• Strengths
– Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum.
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

Slide based on one by Eamonn Keogh


How do we measure similarity?

Peter Piotr

0.23 3 342.7

Slide based on one by Eamonn Keogh


A generic technique for measuring similarity
To measure the similarity between two objects,
transform one into the other, and measure how
much effort it took. The measure of effort
becomes the distance measure.

The distance between Patty and Selma:


Change dress color, 1 point
Change earring shape, 1 point
Change hair part, 1 point
D(Patty,Selma) = 3
The distance between Marge and Selma:
Change dress color, 1 point
Add earrings, 1 point
Decrease height, 1 point
Take up smoking, 1 point
Lose weight, 1 point This is called the “edit
D(Marge,Selma) = 5 distance” or the
“transformation distance”

Slide based on one by Eamonn Keogh


Edit Distance Example How similar are the names
It is possible to transform any string Q “Peter” and “Piotr”?
Assume the following cost function
into string C, using only Substitution, Substitution 1
Insertion and Deletion. Unit
Assume that each of these operators Insertion 1 Unit
Deletion 1 Unit
has a cost associated with it.
D(Peter,Piotr) is 3
The similarity between two strings can
be defined as the cost of the cheapest
transformation from Q to C. Peter
Note that for now we have ignored the issue of how we can find this cheapest
transformation Substitution (i for e)
Piter
Insertion (o)

Pioter
Deletion (e)

Piotr
Slide based on one by Eamonn Keogh
But should we use Euclidean Distance?

IF we plot the reduction in


variance per value for K..
But should we use Euclidean Distance?
To apply partitional
clustering we need to:

▪ Select features to characterize the data

▪ Collect representative data

▪ Choose a clustering algorithm

▪ Specify the number of clusters


Um, what about k?
• Use Elbow Method to decide on Optimum
number of Cluster
When k = 1, the objective function is 873.0

1 2 3 4 5 6 7 8 9 10
Slide based on one by Eamonn Keogh
When k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10
Slide based on one by Eamonn Keogh
When k = 3, the objective function is 133.6

1 2 3 4 5 6 7 8 9 10
Slide based on one by Eamonn Keogh
We can plot the objective function values for k equals 1 to 7…

The abrupt change at k = 3, is highly suggestive of three clusters in the


data. This technique for determining the number of clusters is known as
“knee finding” or “elbow finding”.

1.00E+03
Objective Function

9.00E+02

8.00E+02

7.00E+02

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3 4 5 6
k
High-Dimensional Data poses
Problems for Clustering
• Difficult to find true clusters
– Irrelevant and redundant features
– All points are equally close

• Solutions: Dimension Reduction


– Feature subset selection
– Cluster ensembles using random projection (in
a later lecture….)
Stopping Criteria for K-Means
Clustering
• There are essentially three stopping criteria that
can be adopted to stop the K-means algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached

Animation-
https://shabal.in/visuals/kmeans/1.html
• There are various distance measures to
calculate similarity between data points or
clusters. Some of them are listed below:-
• · Euclidean distance
• · Manhattan distance
• · Canberra distance
• · Binary distance
• · Minkowski distance
Solved Example
• Apply Classic K(=2)-Means algorithm over the
data (185, 72), (170, 56), (168, 60), (179,68),
(182,72), (188,77) up to two iterations and
show the clusters. Initially choose first two
objects as initial centroids.
Do Not Solve by this Type/method
2. Hierarchical clustering

Hierarchical clustering is one of the type of clustering. It


divides the data points into a hierarchy of clusters.
It can be divided into two types- Agglomerative and
Divisive clustering.
i) Agglomerative clustering
• Agglomerative clustering follows a bottom-up
approach.
• Each data point is assigned to an individual cluster.
• At each iteration, the clusters are merged together based
upon their similarity and the process repeats until one
cluster or K clusters are formed.
ii) Divisive clustering
• Divisive clustering follows a top-down approach.
• It is the opposite of Agglomerative clustering. In
divisive clustering, all the data points are assigned
to a single cluster.
• At each iteration, the clusters are separated into
other clusters based upon dissimilarity and the
process repeats until we are left with n clusters.
Hierarchical Clustering
The number of dendrograms with n Since we cannot test all possible
leafs = (2n -3)!/[(2(n -2)) (n -2)!] trees we will have to heuristic
search of all possible trees. We
Number Number of Possible
of Leafs Dendrograms could do this..
2 1
3 3
4 15
Bottom-Up (agglomerative):
5 105 Starting with each item in its own
... …
10 34,459,425
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.

Top-Down (divisive): Starting with


all the data in a single cluster,
consider every possible way to
divide the cluster into two. Choose
the best division and recursively
operate on both sides.
Slide based on one by Eamonn Keogh

Dendogram: A Useful Tool for Summarizing Similarity


Measurements
Terminal Branch Root
The similarity between two objects in a
Internal Branch
Internal Node
dendrogram is represented as the
Leaf height of the lowest internal node they
share.
What is "closest cluster"
▪ Euclidean Distance
▪ Average
▪ Nearest
▪ Farthest
What is a Dendrogram?
▪ The Hierarchical clustering technique can
be visualized using a Dendrogram.
▪ A Dendrogram is a tree like diagram that
records the sequences of merges or splits.
Slide based on one by Eamonn Keogh

One potential use of a dendrogram: detecting outliers


The single isolated branch is suggestive of a data
point that is very different to all others

Outlier
Agglomerative and Divisive Clustering
Agglomerative Clustering Algorithm
1. Form as many clusters as there are data points (e.g. begin with N clusters)
2. Take two nearest data points and make them a cluster (now you will be left with N-
1 clusters)
3. Take two nearest data points and make them a cluster (now you will be left with N-
2 clusters)
4. Repeat step 3 until there is one cluster.
Slide based on one by Eamonn Keogh

We begin with a distance


matrix which contains the
distances between every
pair of objects in our
database.
0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative): Starting This slide and next 4 based on
with each item in its own cluster, slides by Eamonn Keogh
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Slide based on one by Eamonn Keogh

We know how to measure the distance between two


objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
• Single linkage (nearest neighbor): In this method the distance
between two clusters is determined by the distance of the two closest objects
(nearest neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
Group average linkage: In this method, the distance between two clusters
is calculated as the average distance between all pairs of objects in the two
different clusters.
Measuring the distance between two sub-
clusters
Slide based on one by Eamonn Keogh

Single linkage 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

Average linkage
The similarity between two objects in a dendrogram is represented as
the height of the lowest internal node they share
Slide based on one by Eamonn Keogh

Hierarchal Clustering Methods Summary

▪ No need to specify the number of clusters in


advance
▪ Hierarchal nature maps nicely onto human
intuition for some domains
▪ They do not scale well: time complexity of at least
O(n2), where n is the number of total objects
▪ Like any heuristic search algorithms, local optima
are a problem
▪ Interpretation of results is (very) subjective

You might also like