0% found this document useful (0 votes)

14 views61 pages

Clustering

1. Choose the number of clusters k randomly initialize k cluster centers 2. Assign each point to the nearest cluster center 3. Recalculate the position of each cluster center as the mean of its points 4. Repeat steps 2-3 until convergence, where points no longer change clusters The k-means algorithm aims to partition observations into k clusters where each observation belongs to the cluster with the nearest mean, leading to a set of clusters with low intra-cluster variance. However, it can be sensitive to initialization and may get stuck in local optima.

Uploaded by

efi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views61 pages

Clustering

Uploaded by

efi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Clustering

1
The Problem of Clustering
 Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
members of a cluster are in some sense as
close to each other as possible.

2
Example
x x
xx x
x x x x
x x x x x x x
x xx x x
x x x xx x
x x x

x x
x x x x x
x x x
x
3
Problems With Clustering
 Clustering in two dimensions looks easy.
 Clustering small amounts of data looks easy.
 The Curse of Dimensionality
 Many applications involve not 2, but 10 or
10,000 dimensions.

4
Clustering Evaluation
 Manual inspection
 Benchmarking on existing labels
 Cluster quality measures
 distance measures
 high similarity within a cluster, low across clusters

5
Distance Measures
 Each clustering problem is based on some
kind of “distance” between points.
 Two major classes of distance measure:
1. Euclidean
2. Non-Euclidean

6
Euclidean Vs. Non-Euclidean
 A Euclidean space has some number of real-
valued dimensions and “dense” points.
 There is a notion of “average” of two points.
 A Euclidean distance is based on the locations of
points in such a space.
 A Non-Euclidean distance is based on
properties of points, but not their “location” in a
space.

7
Some Euclidean Distances
 L2 norm : d(x,y) = square root of the sum of
the squares of the differences between x and
y in each dimension.
 The most common notion of “distance.”
 L1 norm : sum of the differences in each
dimension.
 Manhattan distance = distance if you had to travel
along coordinates only.

8
Non-Euclidean Distances
 Jaccard distance for sets = 1 minus ratio of
sizes of intersection and union.
Jaccard(x, y) = 1 - |x y|
|x y|
 Cosine distance = angle between vectors
from the origin to the points in question.
 Edit distance = number of inserts and deletes
to change one string into another.

9
Jaccard Distance for Bit-Vectors
 Example: p1 = 10111; p2 = 10011.
 Size of intersection = 3; size of union = 4, Jaccard
similarity (not distance) = 3/4.
 Need to make a distance function satisfying
triangle inequality and other laws.
 d(x,y) = 1 – (Jaccard similarity) works.

10
Cosine Distance
 Think of a point as a vector from the origin
(0,0,…,0) to its location.
 Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors: p1.p2/|p2||p1|.
 Example p1 = 00111; p2 = 10011.
 p1.p2 = 2; |p1| = |p2| = 3.
 cos() = 2/3;  is about 48 degrees.

11
Edit Distance
 The edit distance of two strings is the number
of inserts and deletes of characters needed to
turn one into the other.
 Equivalently: d(x,y) = |x| + |y| -2|LCS(x,y)|.
 LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.

12
Example
 x = abcde ; y = bcduve.
 Turn x into y by deleting a, then inserting u
and v after d.
 Edit-distance = 3.
 Or, LCS(x,y) = bcde.
 |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3.

13
Clustering Algorithms

k -Means Algorithms
Hierarchical Clustering

14
Methods of Clustering
 Point Assignment (Partitioning “flat” algorithms ):
 Usually start with a random (partial) partitioning and Maintain
a set of clusters.
 Refine it iteratively
 Place points into their “nearest” cluster.
 k means/medoids clustering
 Model based clustering
 Hierarchical (Agglomerative):
 Initially, each point in cluster by itself.
 Repeatedly combine the two “nearest” clusters into one.

15
Partional Clustering
 Also called flat clustering
 The most famous algorithm is K-Means

16
k –Means Algorithm(s)
 Assumes Euclidean space.
 Start by picking k, the number of clusters.
 Initialize clusters by picking one point per
cluster.
 For instance, pick one point at random, then k -1
other points, each as far away as possible from
the previous points.

17
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change
in cluster assignments less than a
threshold)

18
K-means example, step 1

k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3

X
19
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
20
K-means example, step 3

k1 k1
Y

Move k2
each cluster
center k3
k2
to the mean
of each cluster k3

X
21
K-means example, step 4

Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?

X
22
K-means example, step 4 …

k1
Y
A: three
points with
animation k3
k2

X
23
K-means example, step 4b

k1
Y
re-compute
cluster
means k3
k2

X
24
K-means example, step 5

k1
Y

k2
move cluster
centers to k3
cluster means

X
25
Discussion

What can be the problems with

K-means clustering?

26
Issue 1: How Many Clusters?
 Number of clusters k is given
 Partition n docs into predetermined number of

clusters
 Finding the “right” number of clusters is part of the
problem

27
Getting k Right
 Try different k, looking at the change in the
average distance to centroid, as k increases.
 Average falls rapidly until right k, then
changes little.

Best value
Average of k
distance to
centroid
k
28
Example
x x
xx x
x x x x
many long x x x x x x x
distances x xx x x
to centroid. x x x xx x
x x x

x x
x x x x x
x x x
x
29
Example
x x
xx x
x x x x
Just right; x x x
distances x x x x
x xx x x
rather short. xx x
x x x
x x x

x x
x x x x x
x x x
x
30
Example
x x
xx x
x x x x
x x x x x x x
x xx x x
Too many; x x x xx x
little improvement x x x
in average
distance. x x
x x x x x
x x x
x
31
Issue: 2
 Result can vary significantly depending on
initial choice of seeds (number and
position)
 Can get trapped in local minimum
 Example: initial
cluster
centers

instances

 Q: What can be done?

32
Seed Choice
Example showing
 Results can vary based on random seed
selection. sensitivity to seeds
 Some seeds can result in poor
convergence rate, or convergence to sub-
optimal clusterings.
 Select good seeds using a heuristic
(e.g., item least similar to any existing In the above, if you start
mean) with B and E as centroids
you converge to {A,B,C}
 Try out multiple starting points
and {D,E,F}
 Initialize with the results of another If you start with D and F
method. you converge to
To increase chance of finding global {A,B,D,E} {C,F}
optimum: restart with different random
seeds.
33
K-means issues, variations, etc.
 Recomputing the centroid after every assignment
(rather than after all points are re-assigned) can
improve speed of convergence of K-means

34
K-means clustering - outliers ?
What can be done about outliers?

35
K-means clustering summary
Advantages Disadvantages
 Simple,  Must pick number of
understandable clusters before hand
 items automatically  All items forced into

assigned to clusters a cluster

 Too sensitive to
outliers

36
Clustering Algorithms

 Hierarchical algorithms
 Bottom-up, agglomerative
 Top-down, divisive

37
Hierarchical Clustering
 Two important questions:
1. How do you determine the “nearness” of
clusters?
2. How do you represent a cluster of more than
one point?

38
Hierarchical Clustering --- (2)
 Key problem: as you build clusters, how do
you represent the location of each cluster, to
tell which pair of clusters is closest?
 Euclidean case: each cluster has a centroid =
average of its points.
 Measure intercluster distances by distances of
centroids.

39
Example
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o
(5,0)

40
And in the Non-Euclidean Case?
 The only “locations” we can talk about are the
points themselves.
 I.e., there is no “average” of two points.
 Approach 1: clustroid = point “closest” to
other points.
 Treat clustroid as if it were centroid, when
computing intercluster distances.

41
“Closest” Point?
 Possible meanings:
1. Smallest maximum distance to the other points.
2. Smallest average distance to other points.
3. Smallest sum of squares of distances to other
points.
4. Etc., etc.

42
Example

clustroid

1 2
6 4
3
5 clustroid

intercluster
distance

43
*Hierarchical clustering

 Bottom up
 Start with single-instance clusters
 At each step, join the two closest clusters
 Design decision: distance between clusters
 E.g. two closest instances in clusters
vs. distance between means
 Top down
 Start with one universal cluster
 Find two clusters
 Proceed recursively on each subset
 Can be very fast
 Both methods produce a
dendrogram
g a c i e d k b j f h 44
Hierarchical Clustering
 Build a tree-based hierarchical taxonomy (dendrogram) from a
set of documents.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

 One option to produce a hierarchical clustering is recursive

application of a partitional clustering algorithm to produce a
hierarchical clustering. (top down)

45
Hierarchical Agglomerative
Clustering (HAC)
 Assumes a similarity function for determining
the similarity of two instances.
 Starts with all instances in a separate cluster
and then repeatedly joins the two clusters
that are most similar until there is only one
cluster.
 The history of merging forms a binary tree or
hierarchy.

46
A Dendogram: Hierarchical Clustering

• Dendrogram: Decomposes
data objects into a several
levels of nested partitioning
(tree of clusters).

• Clustering of the data

objects is obtained by
cutting the dendrogram at
the desired level, then each
connected component
forms a cluster.

47
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj

48
Hierarchical Clustering algorithms
 Agglomerative (bottom-up):
 Start with each item being a single cluster.

 Eventually all items belong to the same cluster.

 Divisive (top-down):
 Start with all items belong to the same cluster.
 Eventually each node forms a cluster on its own.
 Does not require the number of clusters k in advance
 Needs a termination/readout condition

49
Dendrogram: Document Example
 As clusters agglomerate, docs likely to fall
into a hierarchy of “topics” or concepts.

d3
d5
d1 d3,d4,d
d4
5
d2
d1,d2 d4,d5 d3

50
“Closest pair” of clusters
 Many variants to defining closest pair of clusters
 “Center of gravity”
 Clusters whose centroids (centers of gravity) are the most

cosine-similar
 Single-link
 Similarity of the most similar (single-link)

 The smallest minimum pairwise distance

 Complete-link
 Similarity of the “furthest” points,

 The smallest maximum pairwise distance

 Average-link
 Average similarity between pairs of elements

51
Major issue - labeling
 After clustering algorithm finds clusters - how
can they be useful to the end user?
 Need pithy label for each cluster

52
How to Label Clusters
 Show titles of typical documents
 Titles are easy to scan

 Authors create them for quick scanning!

 But you can only show a few titles which may not

fully represent cluster

 Show words/phrases prominent in cluster
 More likely to fully represent cluster

 Use distinguishing words/phrases

 Differential labeling
 But harder to scan
53
Evaluation of clustering

 Perhaps the most substantive issue in data

mining in general:
 how do you measure goodness?
 Most measures focus on computational
efficiency
 Time and space
 For application of clustering to search:
 Measure retrieval effectiveness

54
Approaches to evaluating
 Anecdotal
 User inspection
 Ground “truth” comparison
 Cluster retrieval
 Purely quantitative measures
 Average distance between cluster members
 Microeconomic / utility

55
Anecdotal evaluation
 Probably the commonest (and surely the easiest)
 “I wrote this clustering algorithm and look what it

found!”
 No benchmarks, no comparison possible
 Any clustering algorithm will pick up the easy stuff
like partition by languages
 Generally, unclear scientific value.

56
User inspection
 Induce a set of clusters or a navigation tree
 Have subject matter experts evaluate the results and
score them
 some degree of subjectivity

 Often combined with search results clustering

 Not clear how reproducible across tests.
 Expensive / time-consuming

57
Ground “truth” comparison
 Take a union of docs from a taxonomy & cluster
 Yahoo!, ODP, newspaper sections …

 Compare clustering results to baseline

 e.g., 80% of the clusters found map “cleanly” to
taxonomy nodes
 How would we measure this?

 But is it the “right” answer?

 There can be several equally right answers

58
Microeconomic viewpoint
 Anything - including clustering - is only as good as
the economic utility it provides
 For clustering: net economic gain produced by an
approach (vs. another approach)
 Examples
 recommendation systems

59
Other Clustering Approaches
 EM – probability based clustering
 Bayesian clustering
 SOM – self-organizing maps
 …

60
Soft Clustering

 Clustering typically assumes that each instance is

given a “hard” assignment to exactly one cluster.
 Does not allow uncertainty in class membership or for
an instance to belong to more than one cluster.
 Soft clustering gives probabilities that an instance
belongs to each of a set of clusters.
 Each instance is assigned a probability distribution
across a set of discovered categories (probabilities of
all categories must sum to 1).

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6436)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (997)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1854)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5144)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2126)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Math 143C Greensheet
No ratings yet
Math 143C Greensheet
3 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Topic 5 Allocate and Level Resources
No ratings yet
Topic 5 Allocate and Level Resources
52 pages
31-Analysis of Clocked Synchronous Sequential Circuits-07!03!2023
No ratings yet
31-Analysis of Clocked Synchronous Sequential Circuits-07!03!2023
16 pages
Life Contingencies and Life Table 6 - April
100% (1)
Life Contingencies and Life Table 6 - April
17 pages
Potencia Taller - 7b - Solucion
No ratings yet
Potencia Taller - 7b - Solucion
9 pages
Linear Algebra Assignment 2
No ratings yet
Linear Algebra Assignment 2
1 page
Decision Analysis
No ratings yet
Decision Analysis
24 pages
Levenberg Marquardt Algorithm
100% (5)
Levenberg Marquardt Algorithm
5 pages
MTH603 MIDTERM SOLVED MCQS by JUNAID
100% (3)
MTH603 MIDTERM SOLVED MCQS by JUNAID
39 pages
Face
No ratings yet
Face
25 pages
Import As Import As Import As: # Importing The Libraries
No ratings yet
Import As Import As Import As: # Importing The Libraries
3 pages
Sensitivity Analysis Computer Solution
No ratings yet
Sensitivity Analysis Computer Solution
10 pages
Analytics Prepbook Laterals 2019-2020
100% (1)
Analytics Prepbook Laterals 2019-2020
40 pages
CSC 201-Design and Analysis of Algorithms-Fall2016
No ratings yet
CSC 201-Design and Analysis of Algorithms-Fall2016
5 pages
Thermo de Hoff 06
100% (1)
Thermo de Hoff 06
33 pages
Artificial Intelligenec All PYQP
No ratings yet
Artificial Intelligenec All PYQP
23 pages
Ds 2016
No ratings yet
Ds 2016
8 pages
EEE373 Electric Motor Drive: Asst. Prof. Dr. Mongkol Konghirun Ee, Kmutt
No ratings yet
EEE373 Electric Motor Drive: Asst. Prof. Dr. Mongkol Konghirun Ee, Kmutt
16 pages
Breadth First Search
No ratings yet
Breadth First Search
17 pages
Convolution and Sampling Theorem
100% (1)
Convolution and Sampling Theorem
44 pages
Chapter 03 - Random Variables
No ratings yet
Chapter 03 - Random Variables
14 pages
Econometrics - Basic Concepts - Apte -Aug 2021
No ratings yet
Econometrics - Basic Concepts - Apte -Aug 2021
1 page
A Pseudospectral Method For Periodic Functions
No ratings yet
A Pseudospectral Method For Periodic Functions
5 pages
Dynamic Sketching:: Simulating The Process of Observational Drawing
No ratings yet
Dynamic Sketching:: Simulating The Process of Observational Drawing
31 pages
Stats 1 Ch6 - Probability Distributions
No ratings yet
Stats 1 Ch6 - Probability Distributions
4 pages
Stock Watson 3U ExerciseSolutions Chapter12 Students
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter12 Students
6 pages
NACA Symmetric Universal Formula
No ratings yet
NACA Symmetric Universal Formula
10 pages
Forecasting With Option Implied Information
No ratings yet
Forecasting With Option Implied Information
34 pages
SIM Calculus (1)
No ratings yet
SIM Calculus (1)
7 pages
15-Social Network Analysis
No ratings yet
15-Social Network Analysis
18 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

What can be the problems with

 Q: What can be done?

assigned to clusters a cluster

fish reptile amphib. mammal worm insect crustacean

 One option to produce a hierarchical clustering is recursive

• Clustering of the data

 Eventually all items belong to the same cluster.

 The smallest minimum pairwise distance

 The smallest maximum pairwise distance

 Authors create them for quick scanning!

fully represent cluster

 Use distinguishing words/phrases

 Perhaps the most substantive issue in data

 Often combined with search results clustering

 Compare clustering results to baseline

 But is it the “right” answer?

 Clustering typically assumes that each instance is

You might also like