0% found this document useful (0 votes)
33 views64 pages

Clusters

The document discusses clustering tweets by geography using k-means clustering to identify trending topics within different regions. It outlines the k-means algorithm which partitions points into k clusters by alternating between assigning points to the nearest cluster center and recalculating the cluster centers. The document provides examples of initializing the cluster centers and the iterative process of k-means clustering on tweet location data.

Uploaded by

uilsonvx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views64 pages

Clusters

The document discusses clustering tweets by geography using k-means clustering to identify trending topics within different regions. It outlines the k-means algorithm which partitions points into k clusters by alternating between assigning points to the nearest cluster center and recalculating the cluster centers. The document provides examples of initializing the cluster centers and the iterative process of k-means clustering on tweet location data.

Uploaded by

uilsonvx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Clustering

Everything Data
CompSci 216 Spring 2018
2

Announcements (Thu. Mar 2)


•  Homework #7 will be posted today.

•  Projects teams and number assignments


are posted. Please let me know of
changes.
–  Once you submit your proposal on Tuesday
no more changes will be entertained.
3

Announcements (Thu. Mar 2)


•  Project presentations on Tuesday to
instructors
–  3 minutes per team
–  Introduce your team members
–  Describe problem, dataset and how you will
quantify success
–  You may use 1-2 slides (PDF format)
4

Geo-tags of tweets
80

60

40

20

-20

-40

-60
-200 -150 -100 -50 0 50 100 150 200
5

Trending topics
•  How would you compute trending topics?
–  Most frequent hashtags
–  Frequent keywords or phrases (which are not
stopwords)
–  …

•  But interesting trends in one region may


not represent interesting trends in another.
6

Idea: Cluster tweets by geography


80 12
"geo_data_head10000_kmeans_10" using 1:2:3

60
10

40
8

20
6
0

4
-20

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
7

Trending topics by geography


•  We can now compute trending topics
within each cluster (region).
8

Example: Market Segmentation


http://www.esriro.ro/library/fliers/
pdfs/tapestry_segmentation.pdf#page=2
9

Example: Phylogenetic Trees


10

Other Examples
•  Image segmentation
•  Document clustering
•  De-duplication …
11

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering


–  K-medoids
–  Hierarchical Clustering
12

How did we create 10 clusters?


80 12
"geo_data_head10000_kmeans_10" using 1:2:3

60
10

40
8

20
6
0

4
-20

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
13

Can compare apples vs oranges …


•  … if they are in the same feature space.

•  X = {x1, x2, …, xn} is a dataset

•  Each xi is assumed to be a point in some


d-dimensional space
–  xi = [xi1 , xi2 , …, xid ]
–  Each dimension represents a feature
14

K-means
•  Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!

aij is 1 if
xj is assigned to cluster Ci

Assignment Function
15

K-means
•  Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!

!! = !!! , !!! , … , !!! !


!! µi is the mean of
!!" = ! ! points in cluster Ci.
!∈!! |!! |
Cluster Representative
16

K-means
•  Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!

Square of the straight line distance


between xj and its center µi.
17

Chicken-and-Egg problem
•  How do we minimize RSS(C) ?

–  If we know the cluster representatives (or the


means), then it is easy to find the assignment
function (which minimizes RSS(C))
•  Assign point to the closest cluster representative
–  If we know the assignment function,
computing the cluster representatives is easy
•  Compute the mean of the points in the cluster
18

K-means Algorithm
•  Idea: Alternate these two steps.
–  Pick some initialization for cluster
representatives µ0.

–  E-step:
Assign points to the closest representative in µi.
–  M-step:
Recompute the representatives µi+1 as means of
the current clusters.
19

K-means: Initialization
80 12
"geo_data_head10000_run_0" using 1:2:3

60
+ +
10

40 + + + +
+ 8

20 + + 6
0

+ 4
-20

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
20

K-means: Iteration 1
80 12
"geo_data_head10000_run_1" using 1:2:3

60
10
+ +
40
+ + +
+ 8

20 + +
+
6
0

-20 + 4

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
21

K-means: Iteration 2
80 12
"geo_data_head10000_run_2" using 1:2:3

60
10
+ +
40 + + +
+ 8

20 +
6
0 + +
4
-20 +
2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
22

K-means: Iteration 10
80 12
"geo_data_head10000_kmeans_10" using 1:2:3

60
+ +
10

40
+ + +
+ 8

20

+ +
6
0
+
4
-20
+
2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
23

Initialization
•  Many heuristics
–  Random: K random points in the dataset

–  Farthest First:
•  Pick the first center at random
•  Pick the ith center as the point “farthest away”
from the last (i-1) centers

–  K-means++: (see paper)


•  Nice theoretical guarantees on quality of clustering
24

Stopping
•  Alternate E and M steps till the cluster
representatives do not change.

•  Guaranteed to converge
–  To a local optima …
–  … but not necessarily to a global optima

•  Finding the optimal solution (with least RSS(C)) is NP-


hard, even for 2 clusters.
25

Where k-means fails …

y y

x
26

Scaling / changing features can help

y R

0.5 * x
27

Limitations of k-means
•  Scaling/changing the feature space can
change the solution.
•  Cluster points into spherical regions.

•  Number of clusters should be known


apriori
28

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering


–  K-medoids
–  Hierarchical Clustering

29

Distance Metrics
•  Function d that maps pairs of points x, y to
real numbers (usually between 0, 1)

•  Symmetric: d(x,y) = d(y,x)


•  Triangle Inequality: d(x,y) + d(y,z) ≥ d(x,z)

•  Choice of distance metric is usually


application dependent
30

Euclidean Distance
!−! ! = !! − !! ! !
!

!

•  Straight line distance between two points
x = [x1, x2, …, xd] and y = [y1, y2, …, yd]

•  K-means minimizes the sum of the Euclidean


distances between the points and the centers
–  We use the mean as a center
31

Minkowski (Lp) Distance


!
!
!
!! = ! !! − !! ! !

!

•  L2 = ?
32

Minkowski (Lp) Distance

!
!
!
!! = ! !! − !! ! !

!

•  L2 = Euclidean
•  L1 = ?
33

Minkowski (Lp) Distance

!
!
!
!! = ! !! − !! ! !

!

•  L1 = city block / Manhalan

•  L∞ = ?
34

Vector-based Similarities
•  Cosine Similarity (inverse of a distance)
Dot Product

! !! ∙ !!
!

! !
! !! ! !!

L2 Norm

–  can be used in conjunction with TFIDF scores


35

Vector-based Similarities
•  Pearson’s Correlation Coefficient
–  cosine similarity on mean normalized vectors

Mean of xi’s Dot Product

! !! − ! ∙ (!! − !)
!

! !
! !! − ! ! !! − !

L2 Norm
36

Set-based Distances
•  Let A and B be two sets.

|! ∩ !|
Jaccard !, ! = ! !

|! ∪ !|

–  Again, a measure of similarity (inverse of distance)


37

Scaling / Changing features …


•  … can be thought of as using a different
distance function.

•  How do we cluster for general distance


functions?
38

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering


–  K-medoids
–  Hierarchical Clustering
39

K-means for general distance functions?

•  Mean of a set of points does not always make


sense.
–  Mean of a set of movies or a set of documents?

•  Mean m of a set of points P minimizes the sum


of Euclidean distances between m and every
point in P
–  Best cluster representative under Euclidean Distance

–  The above is not true for a general distance metric.


40

K-medoids
•  Allows a general distance metric d(x,y).

•  Same algorithm as K-means …


•  … but we don’t pick the new centers using
mean of the cluster.
41

K-medoids
–  Pick some initialization for cluster
representatives µ0.
–  E-step:
Assign points to the closest representative in µi.
–  M-step:
Recompute the representatives µi+1 as the medoid,
or one of the points in the cluster with the
minimum distance from all the other points.
42

Medoid
•  m is the medoid of a set of points P if

! = ! argmin !(!, !) !
!!∈!! !∈!

Point that minimizes the sum of distances to
all other points in the set.
43

Computing the medoid


! = ! argmin !(!, !) !
!!∈!! !∈!

•  Need to compute all |P|2 distances.

•  In comparison, computing the mean in k-


means only requires computing d averages
involving |P| numbers each.
44

K-medoids summary
•  Same algorithm as K-means, but uses medoids
instead of means

•  Centers are always points that appear in the


original dataset

•  Can use any distance measure for clustering.

•  Still need to know the number of clusters a priori…


45

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering


–  K-mediods
–  Hierarchical Clustering
46

Hierarchical Clustering
•  Rather than compute a single clustering,
compute a family of clusterings.

•  Can choose the clusters a posteriori.


47

Agglomerative Clustering
•  Initialize each point to its own cluster

•  Repeat:
–  Pick the two clusters that are closest
–  Merge them into one cluster
–  Stop when there is only one cluster left
48

Example
Step 1: {1} {2} {3} {4} {5} {6} {7}
Step 2: {1} {2, 3} {4} {5} {6} {7}
Step 3: {1, 7} {2, 3} {4} {5} {6}
Step 4: {1, 7} {2, 3} {4, 5} {6}
Step 5: {1, 7} {2, 3, 6} {4, 5}
Step 6: {1, 7} {2, 3, 4, 5, 6}
Step 7: {1, 2, 3, 4, 5, 6, 7}

Example based on Ryan Tibshirani’s slides
49

Dendrogram Entire dataset

Height of a node is
proportional to distance
between children clusters Each node
is a
cluster

Individual points in
the dataset
50

Dendrogram

A horizontal cut in the


dendrogram results in a
clustering
51

Distance between clusters


Step 1: {1} {2} {3} {4} {5} {6} {7}
Step 2: {1} {2, 3} {4} {5} {6} {7}
Step 3: {1, 7} {2, 3} {4} {5} {6}
Step 4: {1, 7} {2, 3} {4, 5} {6}

What are the next two closest
clusters?
52

Single Linkage
!!"#$%& (!! , !! ) = ! min !(!, !)!
!∈!! ,!∈!!

Distance between two clusters is


the distance between the two
closest points in the clusters.

{6} is closer to {4,5} than {2,3}
according to single linkage
53

Complete Linkage
!!"#$%&'& (!! , !! ) = ! max !(!, !)!
!∈!! ,!∈!!

Distance between two clusters is


the distance between the two
farthest points in the clusters.

{6} is closer to {2,3} than {4,5}
according to complete linkage
54

Single vs Complete Linkage


4

3 1

2
> 1
1

0
0 1 2 3 4 5 6 7
55

Single Linkage
4

3 1

2
> 1
1

0
0 1 2 3 4 5 6 7
56

Single Linkage
4

0
0 1 2 3 4 5 6 7

Chaining: Single linkage can result in clusters


that are spread out and not compact
57

Complete Linkage
4

3 3

2
< 2
1

0
0 1 2 3 4 5 6 7
58

Complete Linkage
4

0
0 1 2 3 4 5 6 7

Complete linkage returns more compact clusters


in this case.
59

Single vs Complete Linkage


3

1.02 5.02 6.99


1
4 6

0
0 1 2 3 4 5 6 7 8
60

In both cases …
3

1.02 5.02 6.99


1
4 6

0
0 1 2 3 4 5 6 7 8
61

Single Linkage
3

1.02 5.02 6.99


1
4 6

0
0 1 2 3 4 5 6 7 8
62

Complete Linkage
3

1.02 5.02 6.99


1
4 6

0
0 1 2 3 4 5 6 7 8

Complete Linkage is sensitive to outliers.


63

Average Linkage
!!!! ,!!!! !(!, !)
!!"# (!! , !! ) = ! ! !!
|!! | ∙ |!! |

Distance between two clusters is
the average distance between
every pair of points in the
clusters.
64

Hierarchical Clustering summary


•  Create a family of hierarchical clusterings
–  Visualized using a dendrograms
–  Users can choose number of clusters after
clustering is done.

•  Can use any distance function

•  Different choices for measuring distance


between clusters.

You might also like