0% found this document useful (0 votes)

14 views80 pages

4.1 Clustering

The document outlines the syllabus and key concepts of a course on Machine Learning, focusing on clustering techniques such as K-Means and Hierarchical clustering. It discusses the goals of clustering, similarity measures, and the algorithms used in clustering, as well as their strengths and weaknesses. Additionally, it covers practical aspects like determining the number of clusters and the challenges posed by high-dimensional data.

Uploaded by

q7ak26tja0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views80 pages

4.1 Clustering

Uploaded by

q7ak26tja0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

MIT School of Computing

Department of Information Technology

Third Year Engineering

21BTIT504- Fundamentals of Machine Learning

Class - T.Y. (SEM-VII)
PL
UnitD- IV

AY 2023-2024 SEM-VI

1
MIT School of Computing
Department of Information Technology

Unit-VI Syllabus

Clustering, Hierarchical clustering, KNN clustering, K-Means

clustering, Bayesian Belief networks, Hidden Markov Model

PL
D

2
Introduction to Clustering
Clustering
(Unsupervised Learning)
Given: Examples: < x1 , x2 , … x n >
Find: A natural clustering (grouping) of the data

Example Applications:
Identify similar energy, use customer profiles
<x> = time series of energy usage

Identify anomalies in user behavior for computer security

<x> = sequences of user commands
Why cluster?
• Labeling is expensive
• Gain insight into the structure of the data
• Find prototypes in the data
Goal of Clustering
• Given a set of data points, each described by a
set of attributes, find clusters such that:
F1 x
– Inter-cluster similarity is x
x xx xx
x
maximized xx
xxxx
x
x xx x
– Intra-cluster similarity is
minimized F2

• Requires the definition of a similarity measure

What is a natural grouping of these objects?
Slide from Eamonn Keogh
What is a natural grouping of these objects?
Slide from Eamonn Keogh

Clustering is subjective

Simpson's Family School Employees Females Males

What is Similarity?
Slide based on one by Eamonn Keogh

Similarity is
hard to define,
but…
“We know it
when we see it”
Defining Distance Measures
Slide from Eamonn Keogh
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and
O2 is a real number denoted by D(O1,O2)

Peter Piotr

0.23 3 342.7
Slide based on one by Eamonn Keogh

What properties should a distance measure

have?

• D(A,B) = D(B,A) Symmetry

• D(A,A) = 0 Constancy of Self-Similarity
• D(A,B) = 0 iif A= B Positivity (Separation)
• D(A,B)  D(A,C) + D(B,C) Triangular Inequality
Slide based on one by Eamonn Keogh

Intuitions behind desirable distance

measure properties

D(A,B) = D(B,A)
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like
Alex.”

D(A,A) = 0
Otherwise you could claim “Alex looks more like Bob, than Bob does.”
Slide based on one by Eamonn Keogh

Two Types of Clustering

• Partitional algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical
decomposition of the set of objects using some criterion

Hierarchical Partitional
Slide based on one by Eamonn Keogh

Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K non-overlapping clusters.
• Since only one set of clusters is output, the
user normally has to input the desired
number of clusters K.
Slide based on one by Eamonn Keogh

Partition Algorithm 1: k-means

1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in the last
iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh

K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1
3

2
k2

k3
0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh

K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh

K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k3
1 k2

0
0 1 2 3 4 5

Slide based on one by Eamonn Keogh

K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
expression in condition 2 5

4
k1

k2
1
k3

0
0 1 2 3 4 5

expression in condition 1
Slide based on one by Eamonn Keogh
Comments on k-Means
• Strengths
– Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum.
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

Slide based on one by Eamonn Keogh

How do we measure similarity?

Peter Piotr

0.23 3 342.7

Slide based on one by Eamonn Keogh

A generic technique for measuring similarity
To measure the similarity between two objects,
transform one into the other, and measure how
much effort it took. The measure of effort
becomes the distance measure.

The distance between Patty and Selma:

Change dress color, 1 point
Change earring shape, 1 point
Change hair part, 1 point
D(Patty,Selma) = 3
The distance between Marge and Selma:
Change dress color, 1 point
Add earrings, 1 point
Decrease height, 1 point
Take up smoking, 1 point
Lose weight, 1 point This is called the “edit
D(Marge,Selma) = 5 distance” or the
“transformation distance”

Slide based on one by Eamonn Keogh

Edit Distance Example How similar are the names
It is possible to transform any string Q “Peter” and “Piotr”?
Assume the following cost function
into string C, using only Substitution, Substitution 1
Insertion and Deletion. Unit
Assume that each of these operators Insertion 1 Unit
Deletion 1 Unit
has a cost associated with it.
D(Peter,Piotr) is 3
The similarity between two strings can
be defined as the cost of the cheapest
transformation from Q to C. Peter
Note that for now we have ignored the issue of how we can find this cheapest
transformation Substitution (i for e)
Piter
Insertion (o)

Pioter
Deletion (e)

Piotr
Slide based on one by Eamonn Keogh
But should we use Euclidean Distance?

IF we plot the reduction in

variance per value for K..
But should we use Euclidean Distance?
To apply partitional
clustering we need to:

▪ Select features to characterize the data

▪ Collect representative data

▪ Choose a clustering algorithm

▪ Specify the number of clusters

Um, what about k?
• Use Elbow Method to decide on Optimum
number of Cluster
When k = 1, the objective function is 873.0

1 2 3 4 5 6 7 8 9 10
Slide based on one by Eamonn Keogh
When k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10
Slide based on one by Eamonn Keogh
When k = 3, the objective function is 133.6

1 2 3 4 5 6 7 8 9 10
Slide based on one by Eamonn Keogh
We can plot the objective function values for k equals 1 to 7…

The abrupt change at k = 3, is highly suggestive of three clusters in the

data. This technique for determining the number of clusters is known as
“knee finding” or “elbow finding”.

1.00E+03
Objective Function

9.00E+02

8.00E+02

7.00E+02

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3 4 5 6
k
High-Dimensional Data poses
Problems for Clustering
• Difficult to find true clusters
– Irrelevant and redundant features
– All points are equally close

• Solutions: Dimension Reduction

– Feature subset selection
– Cluster ensembles using random projection (in
a later lecture….)
Stopping Criteria for K-Means
Clustering
• There are essentially three stopping criteria that
can be adopted to stop the K-means algorithm:
1. Centroids of newly formed clusters do not change
2. Points remain in the same cluster
3. Maximum number of iterations is reached

Animation-
https://shabal.in/visuals/kmeans/1.html
• There are various distance measures to
calculate similarity between data points or
clusters. Some of them are listed below:-
• · Euclidean distance
• · Manhattan distance
• · Canberra distance
• · Binary distance
• · Minkowski distance
Solved Example
• Apply Classic K(=2)-Means algorithm over the
data (185, 72), (170, 56), (168, 60), (179,68),
(182,72), (188,77) up to two iterations and
show the clusters. Initially choose first two
objects as initial centroids.
Do Not Solve by this Type/method
2. Hierarchical clustering

Hierarchical clustering is one of the type of clustering. It

divides the data points into a hierarchy of clusters.
It can be divided into two types- Agglomerative and
Divisive clustering.
i) Agglomerative clustering
• Agglomerative clustering follows a bottom-up
approach.
• Each data point is assigned to an individual cluster.
• At each iteration, the clusters are merged together based
upon their similarity and the process repeats until one
cluster or K clusters are formed.
ii) Divisive clustering
• Divisive clustering follows a top-down approach.
• It is the opposite of Agglomerative clustering. In
divisive clustering, all the data points are assigned
to a single cluster.
• At each iteration, the clusters are separated into
other clusters based upon dissimilarity and the
process repeats until we are left with n clusters.
Hierarchical Clustering
The number of dendrograms with n Since we cannot test all possible
leafs = (2n -3)!/[(2(n -2)) (n -2)!] trees we will have to heuristic
search of all possible trees. We
Number Number of Possible
of Leafs Dendrograms could do this..
2 1
3 3
4 15
Bottom-Up (agglomerative):
5 105 Starting with each item in its own
... …
10 34,459,425
cluster, find the best pair to merge
into a new cluster. Repeat until all
clusters are fused together.

Top-Down (divisive): Starting with

all the data in a single cluster,
consider every possible way to
divide the cluster into two. Choose
the best division and recursively
operate on both sides.
Slide based on one by Eamonn Keogh

Dendogram: A Useful Tool for Summarizing Similarity

Measurements
Terminal Branch Root
The similarity between two objects in a
Internal Branch
Internal Node
dendrogram is represented as the
Leaf height of the lowest internal node they
share.
What is "closest cluster"
▪ Euclidean Distance
▪ Average
▪ Nearest
▪ Farthest
What is a Dendrogram?
▪ The Hierarchical clustering technique can
be visualized using a Dendrogram.
▪ A Dendrogram is a tree like diagram that
records the sequences of merges or splits.
Slide based on one by Eamonn Keogh

One potential use of a dendrogram: detecting outliers

The single isolated branch is suggestive of a data
point that is very different to all others

Outlier
Agglomerative and Divisive Clustering
Agglomerative Clustering Algorithm
1. Form as many clusters as there are data points (e.g. begin with N clusters)
2. Take two nearest data points and make them a cluster (now you will be left with N-
1 clusters)
3. Take two nearest data points and make them a cluster (now you will be left with N-
2 clusters)
4. Repeat step 3 until there is one cluster.
Slide based on one by Eamonn Keogh

We begin with a distance

matrix which contains the
distances between every
pair of objects in our
database.
0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative): Starting This slide and next 4 based on
with each item in its own cluster, slides by Eamonn Keogh
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all Choose

possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose

possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose

possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster,
find the best pair to merge into a
new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose

possible … the best
merges…
Slide based on one by Eamonn Keogh

We know how to measure the distance between two

objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
• Single linkage (nearest neighbor): In this method the distance
between two clusters is determined by the distance of the two closest objects
(nearest neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
Group average linkage: In this method, the distance between two clusters
is calculated as the average distance between all pairs of objects in the two
different clusters.
Measuring the distance between two sub-
clusters
Slide based on one by Eamonn Keogh

Single linkage 29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

Average linkage
The similarity between two objects in a dendrogram is represented as
the height of the lowest internal node they share
Slide based on one by Eamonn Keogh

Hierarchal Clustering Methods Summary

▪ No need to specify the number of clusters in

advance
▪ Hierarchal nature maps nicely onto human
intuition for some domains
▪ They do not scale well: time complexity of at least
O(n2), where n is the number of total objects
▪ Like any heuristic search algorithms, local optima
are a problem
▪ Interpretation of results is (very) subjective

MCQ - Class 9 - Matter in Our Surroundings
100% (4)
MCQ - Class 9 - Matter in Our Surroundings
22 pages
Unpaired T-Tests PDF
No ratings yet
Unpaired T-Tests PDF
3 pages
Law 2
No ratings yet
Law 2
12 pages
Clustering
No ratings yet
Clustering
55 pages
Lecture4 Slides
No ratings yet
Lecture4 Slides
43 pages
Clustering
No ratings yet
Clustering
35 pages
Uncontrolled Rectifier
No ratings yet
Uncontrolled Rectifier
18 pages
8 Cluster
No ratings yet
8 Cluster
33 pages
The Salvatore Saga Part
No ratings yet
The Salvatore Saga Part
45 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
Module 4
No ratings yet
Module 4
63 pages
Lab 3 Unit 2
No ratings yet
Lab 3 Unit 2
7 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
Myths Derived From Scripture
No ratings yet
Myths Derived From Scripture
2 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
AIML Chapter 13
No ratings yet
AIML Chapter 13
26 pages
Soccer Training For Goalkeepers
86% (7)
Soccer Training For Goalkeepers
170 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Rogers Sip Trunk
No ratings yet
Rogers Sip Trunk
49 pages
Unit 2
No ratings yet
Unit 2
89 pages
Cluster
No ratings yet
Cluster
50 pages
Pas Bahasa Inggris Kelas Ix
No ratings yet
Pas Bahasa Inggris Kelas Ix
7 pages
Clustering
No ratings yet
Clustering
80 pages
ML Unit-5
No ratings yet
ML Unit-5
30 pages
Clustering
No ratings yet
Clustering
24 pages
ML Unit-5
No ratings yet
ML Unit-5
21 pages
Managing Corporate Social Responsibility - 2011 - Coombs
No ratings yet
Managing Corporate Social Responsibility - 2011 - Coombs
10 pages
Unit 5 Clustering
No ratings yet
Unit 5 Clustering
70 pages
ML Unit-5
No ratings yet
ML Unit-5
31 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Rubric For Preparation of Design/Computational Plate
No ratings yet
Rubric For Preparation of Design/Computational Plate
1 page
K-Means Clustering and K-Nearest Neighbors Algorithm
No ratings yet
K-Means Clustering and K-Nearest Neighbors Algorithm
62 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
K Means
No ratings yet
K Means
3 pages
Phần trả lời
No ratings yet
Phần trả lời
4 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
EC101 Autumn 2023 Quiz 2
No ratings yet
EC101 Autumn 2023 Quiz 2
3 pages
Social Work in A Digital Age - Ethical and Risk Management Challenges
No ratings yet
Social Work in A Digital Age - Ethical and Risk Management Challenges
12 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
8.4 - 1 Strong and Weak Acids (H+) and PH Calculations
No ratings yet
8.4 - 1 Strong and Weak Acids (H+) and PH Calculations
4 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Kmea
No ratings yet
Kmea
53 pages
Clustering: Compsci 590.03 Instructor: Ashwin Machanavajjhala
No ratings yet
Clustering: Compsci 590.03 Instructor: Ashwin Machanavajjhala
40 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
7 Cluster Analysis
No ratings yet
7 Cluster Analysis
62 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Hug Benefits PDF
No ratings yet
Hug Benefits PDF
8 pages
Data Mining Assignment 3
No ratings yet
Data Mining Assignment 3
9 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
No ratings yet
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
3 pages
Bulletin168bmet W
No ratings yet
Bulletin168bmet W
43 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Chapter 15 Exercises No Answers
No ratings yet
Chapter 15 Exercises No Answers
3 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
01 Bio Cell 2024
No ratings yet
01 Bio Cell 2024
28 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Traffic Control in Atm
No ratings yet
Traffic Control in Atm
8 pages
Seismic Fragility of Transportation Lifeline Piers in The Philippines, Under Confinement and Shear Failure.
No ratings yet
Seismic Fragility of Transportation Lifeline Piers in The Philippines, Under Confinement and Shear Failure.
20 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Allied Telesis
No ratings yet
Allied Telesis
2 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Edu 402 Quiz Solved: Brutal Facts
No ratings yet
Edu 402 Quiz Solved: Brutal Facts
7 pages
RevModPhys 75 121
No ratings yet
RevModPhys 75 121
60 pages
Developing Skills Speaking, Listening, Writing and Reading.
No ratings yet
Developing Skills Speaking, Listening, Writing and Reading.
12 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Dr. Sameh Ahmad Muhamad Abdelghany Lecturer of Clinical Pharmacology Mansura Faculty of Medicine
No ratings yet
Dr. Sameh Ahmad Muhamad Abdelghany Lecturer of Clinical Pharmacology Mansura Faculty of Medicine
52 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
Allison Transmission HUB 016 Splitter 2
No ratings yet
Allison Transmission HUB 016 Splitter 2
4 pages
The Genesis or
No ratings yet
The Genesis or
151 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Circadian Rhythms
No ratings yet
Circadian Rhythms
10 pages
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet