0% found this document useful (0 votes)

19 views33 pages

8 Cluster

The document discusses clustering analysis and the K-means algorithm, outlining the goals of clustering, distance functions, and the steps involved in the K-means process. It highlights the importance of similarity functions, the challenges of clustering, and the NP-hard nature of the clustering problem. Additionally, it covers the time complexity of the K-means algorithm and considerations for choosing the number of clusters (K).

Uploaded by

gera766

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views33 pages

8 Cluster

Uploaded by

gera766

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Machine Learning CS 4641-B Summer 2020

Lecture 09. Clustering analysis

and K-means
Xin Chen

These slides are based on slides from Mahdi Roozbahani 1

Logistics
• Project proposal
– Background & motivation
• Why do people care?
• More importantly, what is the existing approaches? How do you understand them?
– Objectives:
• Something based on the background information, what is missing? What is more
important? What is your new angle?
• Writing a report/proposal
– For any figures, plots and sentences that is not “yours”, you need to clearly
cite where and who it come from.
– Do not look like this. (If you submit it to somewhere like a conference, this
would be a serious issue)
• There is a dataset: A.
• There is a paper/link B, which is about the topic.
• The figure says C.
– Everything on the report is your understanding. (How is this material you cite
related to yours?)
• In general, remember to start from “small” and “solid”.
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Clustering images

Goal of clustering:
Divide objects into groups and objects within a group
are more similar than those outside the group.
Clustering other objects
Clustering hand digits
Clustering is subjective

What is considered similar/dissimilar?

What is clustering in general?
• First we need to pick similarity/dissimilarity function?

• The algorithm figures out the grouping of objects based on

the chosen dissimilarity/dissimilarity function:
– Points within a cluster is similar
– Points across cluster are not similar

• Issues for clustering:

– How to represent objects? (vector space? Normalization)
– What is similarity/dissimilarity function?
– What are the algorithm steps?
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Properties of similarity function
• Desired properties of dissimilarity function
– Symmetry: 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥
• Otherwise you can claim “Alex looks like Bob, but Bob
looks nothing like Alex.
– Positive separability:
𝑑 𝑥, 𝑦 = 0, if and only if x = y
• Otherwise there are objects that are different, but you
cannot tell apart
– Triangular inequality: 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦
• Otherwise you can claim “Alex is very like Bob, and Alex
is very like Carl, but Bob is very unlike Carl.
Distance functions for vectors
• Suppose two data points, both in 𝑅𝑑
– 𝑥 = (𝑥1 , 𝑥2 , … , 𝑥𝑑 )𝑇
– 𝑦 = (𝑦1 , 𝑦2 , … , 𝑦𝑑 )𝑇

𝑛
• Euclidian distance: 𝑑 𝑥, 𝑦 = 𝑖=1(𝑥𝑖 − 𝑦𝑖 )2

𝑝 𝑛
• Minkowski distance: 𝑑 𝑥, 𝑦 = 𝑖=1(𝑥𝑖 − 𝑦𝑖 )𝑝
– Manhattan distance: p = 1, 𝑑 𝑥, 𝑦 = 𝑑𝑖=1 |𝑥𝑖 − 𝑦𝑖 |
– “inf”-distance: p = ∞, 𝑑 𝑥, 𝑦 = max(|𝑥𝑖 − 𝑦𝑖 |)
Example
Some problems with Euclidean
distance
Hamming Distance
• Manhattan distance is also called Hamming
distance when all features are binary
– Count the number of difference between two binary
vectors
– Example,𝑥, 𝑦 ∈ *0, 1+17

𝑑 𝑥, 𝑦 = 5
Edit distance
• Transform one of the objects into the other, and
measure how much effort it takes

𝑑 𝑥, 𝑦 = 5 ∗ 1 + 1 ∗ 3 + 2 ∗ 1 = 10
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Results of K-means clustering

Clustering using intensity only and color only

K-Means algorithm

Visualizing K-Means Clustering

K-means algorithm
K-Means step 1
K-Means step 2
K-Means step 3
K-Means step 4
K-Means step 5
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Questions
• Will different initialization lead to different results?
– Yes
– No
– Sometimes
• Will the algorithm always stop after some iterations?
– Yes
– No (We have to set a maximum number of iterations)
– Sometime

Yes. Does it always converge to a optimum?

=> No, it is likely to converge to a local optimum.
Formal statement of the clustering
problem
• Given 𝑛 data points, *𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 + ∈ 𝑅𝑑
• Find 𝑘 cluster centers, *𝑐1 , 𝑐 2 , … , 𝑐 𝑘 + ∈ 𝑅𝑑
• And assign each data point 𝑖 to one cluster, 𝜋 𝑖 ∈
*1, … , 𝑘+
• Such that the averaged square distances from each
data point to its respective cluster center is small

𝑛
1
𝑚𝑖𝑛 | 𝑥𝑖 − 𝑐𝜋 𝑖 |2
𝑛
𝑖=1
Clustering is NP-Hard
• Find 𝑘 cluster centers, *𝑐1 , 𝑐 2 , … , 𝑐 𝑘 + ∈ 𝑅 𝑑 and assign each
data point to one cluster, 𝑛𝜋 𝑖 ∈ 1, … , 𝑘 , minimize
1
𝑚𝑖𝑛 | 𝑥 𝑖 − 𝑐 𝜋 𝑖 |2
𝑛
𝑖=1
• A search problem over the space of discrete assignments
– For all n data points together, there are 𝑘 𝑛 possibility
– The cluster assignment determines cluster centers.
An example
• For all n data points together, there are 𝑘 𝑛
possibilities, where k is the number of
clusters.
• An example:
– X={A, B, C}, n=3 (data points) k = 2 clusters
Convergence of K-Means
• Will K-Means objective oscillate?
𝑛
1
𝑚𝑖𝑛 | 𝑥𝑖 − 𝑐𝜋 𝑖 |2
𝑛
𝑖=1

• The minimum value of the objective is finite.

• Each iteration of K-means algorithm decrease the objective.

– Both cluster assignment step and center adjustment step
decrease objective 𝑎𝑟𝑔𝑚𝑖𝑛𝑗=1,…,𝑘 | 𝑥 𝑖 − 𝑐 𝑗 |2 for each data
point 𝑖
Time Complexity
• Assume computing distance between two
instances is O(d) where d is the dimensionality of
the vectors.
• Reassigning clusters for all datapoints:
‣O(kn) distance computations (when there is one
feature)
‣ O(knd) (when there is d features)
• Computing centroids: Each instance vector gets
added once to some centroid (Finding centroid
for each feature): O(nd).
• Assume these two steps are each done once for I
iterations: O(Iknd).

Slide credit: Ray Mooney.

How to choose 𝐾?

Distortion score: computing the sum of squared

distances from each point to its assigned center.

Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Unit 4
No ratings yet
Unit 4
125 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
KMeans Variants
No ratings yet
KMeans Variants
27 pages
ML Lec13
No ratings yet
ML Lec13
3 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Kmea
No ratings yet
Kmea
53 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Cluster
No ratings yet
Cluster
50 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Clustering
No ratings yet
Clustering
24 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Clustering Part1
No ratings yet
Clustering Part1
19 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
K Mean Clustering
No ratings yet
K Mean Clustering
45 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Pmes Repository Folder...
No ratings yet
Pmes Repository Folder...
32 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
K-Means Cluster Analysis UC Business Analytics R Programming Guide
No ratings yet
K-Means Cluster Analysis UC Business Analytics R Programming Guide
19 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Clustering
No ratings yet
Clustering
80 pages
Faculty Name List
No ratings yet
Faculty Name List
6 pages
All Projects S24
No ratings yet
All Projects S24
154 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering
No ratings yet
Clustering
125 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Estrada Arroyo Aquino Education Contribution
100% (2)
Estrada Arroyo Aquino Education Contribution
2 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
K Mean
No ratings yet
K Mean
7 pages
K Means
No ratings yet
K Means
36 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Work Immersion Pertinent Papers
No ratings yet
Work Immersion Pertinent Papers
19 pages
K Means
No ratings yet
K Means
33 pages
YSR Part1 PDF
No ratings yet
YSR Part1 PDF
16 pages
GenAI Curriculum (DataSpoof)
No ratings yet
GenAI Curriculum (DataSpoof)
4 pages
Lect 4
No ratings yet
Lect 4
34 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Bachelor of Paramedicine - Victoria University
No ratings yet
Bachelor of Paramedicine - Victoria University
18 pages
True or False Low Intermediate (B1)
No ratings yet
True or False Low Intermediate (B1)
2 pages
Penilaian Kurikulum
No ratings yet
Penilaian Kurikulum
9 pages
Extrinsic Intrinsic Approaches
100% (1)
Extrinsic Intrinsic Approaches
8 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Getting Away With Murder Final
No ratings yet
Getting Away With Murder Final
34 pages
Out
No ratings yet
Out
61 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Okasha Basma TESOL 1 6
No ratings yet
Okasha Basma TESOL 1 6
5 pages
Research Methodology Term Report
No ratings yet
Research Methodology Term Report
4 pages
Lecture 2
No ratings yet
Lecture 2
5 pages
SHS-Earth-and-Life-Science-Q2W8 2
No ratings yet
SHS-Earth-and-Life-Science-Q2W8 2
3 pages
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
No ratings yet
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
44 pages
LMroboticsq 3
No ratings yet
LMroboticsq 3
3 pages
The Effect of Sociocultural and Economic Factor in Broken Homes and Childhood Development
No ratings yet
The Effect of Sociocultural and Economic Factor in Broken Homes and Childhood Development
5 pages
Sony Walkman Digital Music Player NWZ
No ratings yet
Sony Walkman Digital Music Player NWZ
10 pages
TPCN Monthly List of Subcontractors 06-2017
No ratings yet
TPCN Monthly List of Subcontractors 06-2017
3 pages
Syllabus 2014 Noor Fatima Public School
No ratings yet
Syllabus 2014 Noor Fatima Public School
8 pages
Maths Class X Term 2 Sample Paper Test 05 2021 22
No ratings yet
Maths Class X Term 2 Sample Paper Test 05 2021 22
3 pages
Research of Emmaculate Maki
No ratings yet
Research of Emmaculate Maki
2 pages
Narrative Report
No ratings yet
Narrative Report
3 pages
Wind Turbine Design Project: Investigate
No ratings yet
Wind Turbine Design Project: Investigate
5 pages
HCIA-Cloud Service V2.2 Exam Outline
No ratings yet
HCIA-Cloud Service V2.2 Exam Outline
3 pages
Grade 11 DLL Entreo Q1 Week 13
No ratings yet
Grade 11 DLL Entreo Q1 Week 13
3 pages
Municipal Social Welfare and Development Office Social Case Study
No ratings yet
Municipal Social Welfare and Development Office Social Case Study
2 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)

8 Cluster

Uploaded by

8 Cluster

Uploaded by

Machine Learning CS 4641-B Summer 2020

Lecture 09. Clustering analysis

These slides are based on slides from Mahdi Roozbahani 1

What is considered similar/dissimilar?

• The algorithm figures out the grouping of objects based on

• Issues for clustering:

Clustering using intensity only and color only

Visualizing K-Means Clustering

Yes. Does it always converge to a optimum?

• The minimum value of the objective is finite.

• Each iteration of K-means algorithm decrease the objective.

Slide credit: Ray Mooney.

Distortion score: computing the sum of squared

You might also like