0% found this document useful (0 votes)
19 views33 pages

8 Cluster

The document discusses clustering analysis and the K-means algorithm, outlining the goals of clustering, distance functions, and the steps involved in the K-means process. It highlights the importance of similarity functions, the challenges of clustering, and the NP-hard nature of the clustering problem. Additionally, it covers the time complexity of the K-means algorithm and considerations for choosing the number of clusters (K).

Uploaded by

gera766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views33 pages

8 Cluster

The document discusses clustering analysis and the K-means algorithm, outlining the goals of clustering, distance functions, and the steps involved in the K-means process. It highlights the importance of similarity functions, the challenges of clustering, and the NP-hard nature of the clustering problem. Additionally, it covers the time complexity of the K-means algorithm and considerations for choosing the number of clusters (K).

Uploaded by

gera766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning CS 4641-B Summer 2020

Lecture 09. Clustering analysis


and K-means
Xin Chen

These slides are based on slides from Mahdi Roozbahani 1


Logistics
• Project proposal
– Background & motivation
• Why do people care?
• More importantly, what is the existing approaches? How do you understand them?
– Objectives:
• Something based on the background information, what is missing? What is more
important? What is your new angle?
• Writing a report/proposal
– For any figures, plots and sentences that is not “yours”, you need to clearly
cite where and who it come from.
– Do not look like this. (If you submit it to somewhere like a conference, this
would be a serious issue)
• There is a dataset: A.
• There is a paper/link B, which is about the topic.
• The figure says C.
– Everything on the report is your understanding. (How is this material you cite
related to yours?)
• In general, remember to start from “small” and “solid”.
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Clustering images

Goal of clustering:
Divide objects into groups and objects within a group
are more similar than those outside the group.
Clustering other objects
Clustering hand digits
Clustering is subjective

What is considered similar/dissimilar?


What is clustering in general?
• First we need to pick similarity/dissimilarity function?

• The algorithm figures out the grouping of objects based on


the chosen dissimilarity/dissimilarity function:
– Points within a cluster is similar
– Points across cluster are not similar

• Issues for clustering:


– How to represent objects? (vector space? Normalization)
– What is similarity/dissimilarity function?
– What are the algorithm steps?
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Properties of similarity function
• Desired properties of dissimilarity function
– Symmetry: 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥
• Otherwise you can claim “Alex looks like Bob, but Bob
looks nothing like Alex.
– Positive separability:
𝑑 𝑥, 𝑦 = 0, if and only if x = y
• Otherwise there are objects that are different, but you
cannot tell apart
– Triangular inequality: 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦
• Otherwise you can claim “Alex is very like Bob, and Alex
is very like Carl, but Bob is very unlike Carl.
Distance functions for vectors
• Suppose two data points, both in 𝑅𝑑
– 𝑥 = (𝑥1 , 𝑥2 , … , 𝑥𝑑 )𝑇
– 𝑦 = (𝑦1 , 𝑦2 , … , 𝑦𝑑 )𝑇

𝑛
• Euclidian distance: 𝑑 𝑥, 𝑦 = 𝑖=1(𝑥𝑖 − 𝑦𝑖 )2

𝑝 𝑛
• Minkowski distance: 𝑑 𝑥, 𝑦 = 𝑖=1(𝑥𝑖 − 𝑦𝑖 )𝑝
– Manhattan distance: p = 1, 𝑑 𝑥, 𝑦 = 𝑑𝑖=1 |𝑥𝑖 − 𝑦𝑖 |
– “inf”-distance: p = ∞, 𝑑 𝑥, 𝑦 = max(|𝑥𝑖 − 𝑦𝑖 |)
Example
Some problems with Euclidean
distance
Hamming Distance
• Manhattan distance is also called Hamming
distance when all features are binary
– Count the number of difference between two binary
vectors
– Example,𝑥, 𝑦 ∈ *0, 1+17

𝑑 𝑥, 𝑦 = 5
Edit distance
• Transform one of the objects into the other, and
measure how much effort it takes

𝑑 𝑥, 𝑦 = 5 ∗ 1 + 1 ∗ 3 + 2 ∗ 1 = 10
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Results of K-means clustering

Clustering using intensity only and color only


K-Means algorithm

Visualizing K-Means Clustering


K-means algorithm
K-Means step 1
K-Means step 2
K-Means step 3
K-Means step 4
K-Means step 5
Outline
• Clustering
• Distance function
• K-means algorithm
• Analysis of K-means
Questions
• Will different initialization lead to different results?
– Yes
– No
– Sometimes
• Will the algorithm always stop after some iterations?
– Yes
– No (We have to set a maximum number of iterations)
– Sometime

Yes. Does it always converge to a optimum?


=> No, it is likely to converge to a local optimum.
Formal statement of the clustering
problem
• Given 𝑛 data points, *𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 + ∈ 𝑅𝑑
• Find 𝑘 cluster centers, *𝑐1 , 𝑐 2 , … , 𝑐 𝑘 + ∈ 𝑅𝑑
• And assign each data point 𝑖 to one cluster, 𝜋 𝑖 ∈
*1, … , 𝑘+
• Such that the averaged square distances from each
data point to its respective cluster center is small

𝑛
1
𝑚𝑖𝑛 | 𝑥𝑖 − 𝑐𝜋 𝑖 |2
𝑛
𝑖=1
Clustering is NP-Hard
• Find 𝑘 cluster centers, *𝑐1 , 𝑐 2 , … , 𝑐 𝑘 + ∈ 𝑅 𝑑 and assign each
data point to one cluster, 𝑛𝜋 𝑖 ∈ 1, … , 𝑘 , minimize
1
𝑚𝑖𝑛 | 𝑥 𝑖 − 𝑐 𝜋 𝑖 |2
𝑛
𝑖=1
• A search problem over the space of discrete assignments
– For all n data points together, there are 𝑘 𝑛 possibility
– The cluster assignment determines cluster centers.
An example
• For all n data points together, there are 𝑘 𝑛
possibilities, where k is the number of
clusters.
• An example:
– X={A, B, C}, n=3 (data points) k = 2 clusters
Convergence of K-Means
• Will K-Means objective oscillate?
𝑛
1
𝑚𝑖𝑛 | 𝑥𝑖 − 𝑐𝜋 𝑖 |2
𝑛
𝑖=1

• The minimum value of the objective is finite.

• Each iteration of K-means algorithm decrease the objective.


– Both cluster assignment step and center adjustment step
decrease objective 𝑎𝑟𝑔𝑚𝑖𝑛𝑗=1,…,𝑘 | 𝑥 𝑖 − 𝑐 𝑗 |2 for each data
point 𝑖
Time Complexity
• Assume computing distance between two
instances is O(d) where d is the dimensionality of
the vectors.
• Reassigning clusters for all datapoints:
‣O(kn) distance computations (when there is one
feature)
‣ O(knd) (when there is d features)
• Computing centroids: Each instance vector gets
added once to some centroid (Finding centroid
for each feature): O(nd).
• Assume these two steps are each done once for I
iterations: O(Iknd).

Slide credit: Ray Mooney.


How to choose 𝐾?

Distortion score: computing the sum of squared


distances from each point to its assigned center.

You might also like