100% found this document useful (2 votes)

216 views28 pages

Clustering K-Means

This document discusses clustering algorithms, beginning with an introduction to clustering and its goal of organizing unlabeled data into similarity groups. It then covers K-means clustering in detail, including the K-means algorithm, convergence criteria, examples, strengths, and weaknesses. Specifically, K-means partitions data into k clusters by minimizing distances between data points and their assigned cluster centers. [/SUMMARY]

Uploaded by

Faysal Ahammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

216 views28 pages

Clustering K-Means

Uploaded by

Faysal Ahammed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

9.

54
Class 13
Unsupervised learning
Clustering

Shimon Ullman + Tomaso Poggio

Danny Harari + Daneil Zysman + Darren Seibert
Outline
• Introduction to clustering
• K-means
• Bag of words (dictionary learning)
• Hierarchical clustering
• Competitive learning (SOM)
What is clustering?
• The organization of unlabeled data into
similarity groups called clusters.
• A cluster is a collection of data items which
are “similar” between them, and “dissimilar”
to data items in other clusters.
Historic application of clustering
Computer vision application:
Image segmentation
What do we need for clustering?
Distance (dissimilarity) measures

 They are special cases of Minkowski distance:

d p (xi , x j )   k 1 xik  x jk 
 m p p

 
(p is a positive integer)
Cluster evaluation (a hard problem)

• Intra-cluster cohesion (compactness):

– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are
still the key
How many clusters?
Clustering techniques

Divisive
Clustering techniques
Clustering techniques

Divisive

K-means
K-Means clustering

• K-means (MacQueen, 1967) is a partitional clustering

algorithm
• Let the set of data points D be {x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in X  Rr, and r is the
number of dimensions.
• The k-means algorithm partitions the given data into
k clusters:
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means algorithm

• Given k, the k-means algorithm works as follows:

1. Choose k (random) data points (seeds) to be the initial
centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
memberships
4. If a convergence criterion is not met, repeat steps 2 and 3
K-means convergence (stopping)
criterion

• no (or minimum) re-assignments of data points to

different clusters, or
• no (or minimum) change of centroids, or
• minimum decrease in the sum of squared error (SSE),
k
SSE  xC d (x, m j ) 2
j
j 1
– Cj is the jth cluster,
– mj is the centroid of cluster Cj (the mean vector of all the
data points in Cj),
– d(x, mj) is the (Eucledian) distance between data point x
and centroid mj.
K-means clustering example:
step 1
K-means clustering example –
step 2
K-means clustering example –
step 3
K-means clustering example
K-means clustering example
K-means clustering example
Why use K-means?
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.
Weaknesses of K-means

• The algorithm is only applicable if the mean is

defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
Outliers
Dealing with outliers

• Remove some data points that are much further away

from the centroids than other data points
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Perform random sampling: by choosing a small subset of
the data points, the chance of selecting an outlier is
much smaller
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
Sensitivity to initial seeds

Random selection of seeds (centroids) Random selection of seeds (centroids)

Iteration 1 Iteration 2 Iteration 1 Iteration 2

Special data structures
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
K-means summary

• Despite weaknesses, k-means is still the most

popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering
algorithm performs better in general
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!

Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Clustering (Unit 3)
100% (2)
Clustering (Unit 3)
71 pages
7 Classification
100% (3)
7 Classification
63 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
ML Project Shivani Pandey
100% (2)
ML Project Shivani Pandey
49 pages
Combined ML
100% (1)
Combined ML
705 pages
Feature Selection Methods
No ratings yet
Feature Selection Methods
24 pages
Cluster
100% (1)
Cluster
72 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Assignment # 01 Bscs - 7 Semester: Machine Learning
100% (1)
Assignment # 01 Bscs - 7 Semester: Machine Learning
5 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
9 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
100% (1)
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
11 pages
Machine Learning
100% (5)
Machine Learning
56 pages
ML Practical File
100% (2)
ML Practical File
43 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
L3 - Supervised and Unsupervised Learning
100% (3)
L3 - Supervised and Unsupervised Learning
24 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Feature Selection Techniques in ML With Python-1
No ratings yet
Feature Selection Techniques in ML With Python-1
7 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Ensemble Learning: Wisdom of The Crowd
100% (1)
Ensemble Learning: Wisdom of The Crowd
12 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
44 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Building Powerful Image Classification Models Using Very Little Data
No ratings yet
Building Powerful Image Classification Models Using Very Little Data
20 pages
K Means
100% (2)
K Means
329 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
3 pages
Classification
100% (2)
Classification
105 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
ML Notes
100% (2)
ML Notes
125 pages
Deep Learning CNN
100% (1)
Deep Learning CNN
28 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
Deep Learning 2017 Lecture7GAN
No ratings yet
Deep Learning 2017 Lecture7GAN
62 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
Lec16 - Autoencoders
No ratings yet
Lec16 - Autoencoders
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Machine Learning Bits
100% (2)
Machine Learning Bits
28 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Roba Mul PPT Brent Kungadder
50% (2)
Roba Mul PPT Brent Kungadder
14 pages
CompleteFoundationGuideforIITJEEMathematicsBook8!27!08-2024!12!17-39TeacherAssets TeacherManual IIT 11042024 063208 Class 8
No ratings yet
CompleteFoundationGuideforIITJEEMathematicsBook8!27!08-2024!12!17-39TeacherAssets TeacherManual IIT 11042024 063208 Class 8
164 pages
Stability & Determinacy of Trusses PDF
No ratings yet
Stability & Determinacy of Trusses PDF
5 pages
Engineering Economics Formulas
No ratings yet
Engineering Economics Formulas
2 pages
Levelling and Profile Ploting PDF
100% (4)
Levelling and Profile Ploting PDF
5 pages
Wang-2024-Deep Reinforcement Learning For Dema
No ratings yet
Wang-2024-Deep Reinforcement Learning For Dema
13 pages
LAS DRAWING-Q2-Classification of Drawing Tools
No ratings yet
LAS DRAWING-Q2-Classification of Drawing Tools
3 pages
File: XFINAL06new2: I. Course Description
No ratings yet
File: XFINAL06new2: I. Course Description
3 pages
Excercise 13.1
No ratings yet
Excercise 13.1
16 pages
Class 11 Maths Sets-2021
No ratings yet
Class 11 Maths Sets-2021
12 pages
Math 9 - Q1 - Week 5 - Module 6 - Quadratic Inequalities - Reproduction
No ratings yet
Math 9 - Q1 - Week 5 - Module 6 - Quadratic Inequalities - Reproduction
34 pages
DPP 1 NLM
No ratings yet
DPP 1 NLM
31 pages
11 Sartori - The Influence of Electoral Systems
100% (1)
11 Sartori - The Influence of Electoral Systems
26 pages
5/13/2012 Prof. P. K. Dash 1
No ratings yet
5/13/2012 Prof. P. K. Dash 1
37 pages
Schedule Risk Analysis
No ratings yet
Schedule Risk Analysis
40 pages
Linkers in The English Language
No ratings yet
Linkers in The English Language
3 pages
Screenshot 2024-05-27 at 7.54.47 PM
No ratings yet
Screenshot 2024-05-27 at 7.54.47 PM
29 pages
Engineering Maths Mid Sem 1st Year
No ratings yet
Engineering Maths Mid Sem 1st Year
3 pages
Optimal Control of 2-Link Underactuated Robot Manipulator: Amit Kumar, Shrey Kasera, L. B. Prasad
No ratings yet
Optimal Control of 2-Link Underactuated Robot Manipulator: Amit Kumar, Shrey Kasera, L. B. Prasad
6 pages
Software Midterm
No ratings yet
Software Midterm
10 pages
2023-24 Physics Lab Manual Class 12
No ratings yet
2023-24 Physics Lab Manual Class 12
294 pages
Risk Matrix
No ratings yet
Risk Matrix
1 page
Iwata (E 1974) PDF
No ratings yet
Iwata (E 1974) PDF
21 pages
Limitations of Mathematical Model PDF
No ratings yet
Limitations of Mathematical Model PDF
16 pages
3 3 1 Optical Applications With CST MICROWAVE STUDIO
No ratings yet
3 3 1 Optical Applications With CST MICROWAVE STUDIO
36 pages
Inmo 2000
No ratings yet
Inmo 2000
1 page
The Impact of Reward and Recognition On Employee Engagement at Pt. Bank Sulutgo, Manado
No ratings yet
The Impact of Reward and Recognition On Employee Engagement at Pt. Bank Sulutgo, Manado
13 pages
(Tutorial) Dynamic Analysis For High Speed Two - Final - Blue
No ratings yet
(Tutorial) Dynamic Analysis For High Speed Two - Final - Blue
28 pages
What Is KMC
No ratings yet
What Is KMC
2 pages
Q.15. Derive Expression For Ratio of Tension On Tight Side and Slack Side
100% (1)
Q.15. Derive Expression For Ratio of Tension On Tight Side and Slack Side
8 pages

Clustering K-Means

Uploaded by

Clustering K-Means

Uploaded by

9.

Shimon Ullman + Tomaso Poggio

 They are special cases of Minkowski distance:

• Intra-cluster cohesion (compactness):

• K-means (MacQueen, 1967) is a partitional clustering

• Given k, the k-means algorithm works as follows:

• no (or minimum) re-assignments of data points to

• The algorithm is only applicable if the mean is

• Remove some data points that are much further away

Random selection of seeds (centroids) Random selection of seeds (centroids)

Iteration 1 Iteration 2 Iteration 1 Iteration 2

• Despite weaknesses, k-means is still the most

You might also like