0% found this document useful (0 votes)
27 views47 pages

Lecture6 Clustering

The document discusses clustering techniques in machine learning, including hierarchical clustering, k-means clustering, and evaluating clustering results. It covers linkage metrics, choosing the number of clusters k, initializing cluster centroids, and dealing with outliers. Examples are provided to illustrate k-means clustering and evaluating clusters.

Uploaded by

sowmyasanthavel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views47 pages

Lecture6 Clustering

The document discusses clustering techniques in machine learning, including hierarchical clustering, k-means clustering, and evaluating clustering results. It covers linkage metrics, choosing the number of clusters k, initializing cluster centroids, and dealing with outliers. Examples are provided to illustrate k-means clustering and evaluating clusters.

Uploaded by

sowmyasanthavel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Clustering

Partial of the content of this class are copied from online materials. In particular:
1. Introduction to Computational Thinking and Data Science, by Pro. Eric Grimson, Prof. John Guttag and Dr. Ana Bell.
https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-20
16/
, MIT.
2. Unsupervised Learning Clustering, by Shimon Ullman, Tomaso Poggio Danny Harari, DaneilZysman, Darren Seibert
http://www.mit.edu/~9.54/fall14/slides/Class13.pdf, MIT
Machine learning paradigm
• Observe set of examples: training data
• Infer something about process that generated that data
• Use inference to make predictions about previously unseen data: test
data
• Supervised: give a set of feature/label pairs, find a rule that predicts
the label associated with a previously unseen input
• Unsupervised: given a set of feature vectors (without labels) group
them into "natural clusters".
What is Clustering?
What do we need for Clustering?
Distance Measures
Clustering is an Optimization Problem

• Why not divide variability by size of cluster (like variance)?


• So cluster with more points are likely to look less cohesive according to this
measure. So big and bad is worse than small and bad.
• If one wants to compare the coherence of two clusters of different sizes, one
needs to divide the variability of each cluster by the size of the cluster.
• Is optimization problem finding a C that minimizes dissimilarity(C)?
• No, otherwise could put each example in its own cluster.
• Need a constraint, e.g.
• Minimum distance between clusters
• Number of clusters
Clustering Techniques
Hierarchical clustering
Linkage metrics
Example of hierarchical clustering
Clustering Algorithms
K-means Algorithm
An Example: Step 1
Step 2:
Step 3:
Result of first iteration
Second Iteration
Result of Second Iteration
Why Use K-means?
Issues with K-means
• Choosing the "wrong" k can lead to strange results
• Consider k = 3
• Result can depend upon initial
centroids
• Number of iterations
• Even final results
• Greedy algorithm can find different local optimas
• The algorithm is sensitive to outliners
Dealing with Outliers
How to choose K
Sensitivity to Initial Seeds
Mitigating dependence on initial centroids
An Example
Data Sample
Class Example
Class Cluster
Evaluating a clustering
Patients
Kmeans
Examining results
Result
How many positives are there?
A Hypothesis
Testing multiple values of K

You might also like