0% found this document useful (0 votes)

33 views64 pages

Clusters

The document discusses clustering tweets by geography using k-means clustering to identify trending topics within different regions. It outlines the k-means algorithm which partitions points into k clusters by alternating between assigning points to the nearest cluster center and recalculating the cluster centers. The document provides examples of initializing the cluster centers and the iterative process of k-means clustering on tweet location data.

Uploaded by

uilsonvx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views64 pages

Clusters

Uploaded by

uilsonvx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Clustering

Everything Data
CompSci 216 Spring 2018
2

Announcements (Thu. Mar 2)

•  Homework #7 will be posted today.

•  Projects teams and number assignments

are posted. Please let me know of
changes.
–  Once you submit your proposal on Tuesday
no more changes will be entertained.
3

Announcements (Thu. Mar 2)

•  Project presentations on Tuesday to
instructors
–  3 minutes per team
–  Introduce your team members
–  Describe problem, dataset and how you will
quantify success
–  You may use 1-2 slides (PDF format)
4

Geo-tags of tweets
80

-20

-40

-60
-200 -150 -100 -50 0 50 100 150 200
5

Trending topics
•  How would you compute trending topics?
–  Most frequent hashtags
–  Frequent keywords or phrases (which are not
stopwords)
–  …

•  But interesting trends in one region may

not represent interesting trends in another.
6

Idea: Cluster tweets by geography

80 12
"geo_data_head10000_kmeans_10" using 1:2:3

60
10

40
8

20
6
0

4
-20

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
7

Example: Market Segmentation

http://www.esriro.ro/library/fliers/
pdfs/tapestry_segmentation.pdf#page=2
9

Example: Phylogenetic Trees

Other Examples
•  Image segmentation
•  Document clustering
•  De-duplication …
11

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering

–  K-medoids
–  Hierarchical Clustering
12

How did we create 10 clusters?

80 12
"geo_data_head10000_kmeans_10" using 1:2:3

60
10

40
8

20
6
0

4
-20

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
13

Can compare apples vs oranges …

•  … if they are in the same feature space.

•  X = {x1, x2, …, xn} is a dataset

•  Each xi is assumed to be a point in some

d-dimensional space
–  xi = [xi1 , xi2 , …, xid ]
–  Each dimension represents a feature
14

K-means
•  Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!

aij is 1 if
xj is assigned to cluster Ci

Assignment Function
15

K-means
•  Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!

!! = !!! , !!! , … , !!! !

!! µi is the mean of
!!" = ! ! points in cluster Ci.
!∈!! |!! |
Cluster Representative
16

K-means
•  Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!

Square of the straight line distance

between xj and its center µi.
17

Chicken-and-Egg problem
•  How do we minimize RSS(C) ?

–  If we know the cluster representatives (or the

means), then it is easy to ﬁnd the assignment
function (which minimizes RSS(C))
•  Assign point to the closest cluster representative
–  If we know the assignment function,
computing the cluster representatives is easy
•  Compute the mean of the points in the cluster
18

K-means Algorithm
•  Idea: Alternate these two steps.
–  Pick some initialization for cluster
representatives µ0.

–  E-step:
Assign points to the closest representative in µi.
–  M-step:
Recompute the representatives µi+1 as means of
the current clusters.
19

K-means: Initialization
80 12
"geo_data_head10000_run_0" using 1:2:3

60
+ +
10

40 + + + +
+ 8

20 + + 6
0

+ 4
-20

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
20

K-means: Iteration 1
80 12
"geo_data_head10000_run_1" using 1:2:3

60
10
+ +
40
+ + +
+ 8

20 + +
+
6
0

-20 + 4

2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
21

K-means: Iteration 2
80 12
"geo_data_head10000_run_2" using 1:2:3

60
10
+ +
40 + + +
+ 8

20 +
6
0 + +
4
-20 +
2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
22

K-means: Iteration 10
80 12
"geo_data_head10000_kmeans_10" using 1:2:3

60
+ +
10

40
+ + +
+ 8

+ +
6
0
+
4
-20
+
2
-40

-60 0
-200 -150 -100 -50 0 50 100 150 200
23

Initialization
•  Many heuristics
–  Random: K random points in the dataset

–  Farthest First:
•  Pick the ﬁrst center at random
•  Pick the ith center as the point “farthest away”
from the last (i-1) centers

–  K-means++: (see paper)

•  Nice theoretical guarantees on quality of clustering
24

Stopping
•  Alternate E and M steps till the cluster
representatives do not change.

•  Guaranteed to converge
–  To a local optima …
–  … but not necessarily to a global optima

•  Finding the optimal solution (with least RSS(C)) is NP-

hard, even for 2 clusters.
25

Where k-means fails …

y y

x
26

Scaling / changing features can help

y R

0.5 * x
27

Limitations of k-means
•  Scaling/changing the feature space can
change the solution.
•  Cluster points into spherical regions.

•  Number of clusters should be known

apriori
28

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering

–  K-medoids
–  Hierarchical Clustering

29

Distance Metrics
•  Function d that maps pairs of points x, y to
real numbers (usually between 0, 1)

•  Symmetric: d(x,y) = d(y,x)

•  Triangle Inequality: d(x,y) + d(y,z) ≥ d(x,z)

•  Choice of distance metric is usually

application dependent
30

Euclidean Distance
!−! ! = !! − !! ! !
!

!

•  Straight line distance between two points
x = [x1, x2, …, xd] and y = [y1, y2, …, yd]

•  K-means minimizes the sum of the Euclidean

distances between the points and the centers
–  We use the mean as a center
31

Minkowski (Lp) Distance

!
!
!
!! = ! !! − !! ! !

!

•  L2 = ?
32

Minkowski (Lp) Distance

!
!
!
!! = ! !! − !! ! !

!

•  L2 = Euclidean
•  L1 = ?
33

Minkowski (Lp) Distance

!
!
!
!! = ! !! − !! ! !

!

•  L1 = city block / Manhalan

•  L∞ = ?
34

Vector-based Similarities
•  Cosine Similarity (inverse of a distance)
Dot Product

! !! ∙ !!
!

! !
! !! ! !!

L2 Norm

–  can be used in conjunction with TFIDF scores

Vector-based Similarities
•  Pearson’s Correlation Coeﬃcient
–  cosine similarity on mean normalized vectors

Mean of xi’s Dot Product

! !! − ! ∙ (!! − !)
!

! !
! !! − ! ! !! − !

L2 Norm
36

Set-based Distances
•  Let A and B be two sets.

|! ∩ !|
Jaccard !, ! = ! !

|! ∪ !|

–  Again, a measure of similarity (inverse of distance)

Scaling / Changing features …

•  … can be thought of as using a diﬀerent
distance function.

•  How do we cluster for general distance

functions?
38

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering

–  K-medoids
–  Hierarchical Clustering
39

K-means for general distance functions?

•  Mean of a set of points does not always make

sense.
–  Mean of a set of movies or a set of documents?

•  Mean m of a set of points P minimizes the sum

of Euclidean distances between m and every
point in P
–  Best cluster representative under Euclidean Distance

–  The above is not true for a general distance metric.

K-medoids
•  Allows a general distance metric d(x,y).

•  Same algorithm as K-means …

•  … but we don’t pick the new centers using
mean of the cluster.
41

K-medoids
–  Pick some initialization for cluster
representatives µ0.
–  E-step:
Assign points to the closest representative in µi.
–  M-step:
Recompute the representatives µi+1 as the medoid,
or one of the points in the cluster with the
minimum distance from all the other points.
42

Medoid
•  m is the medoid of a set of points P if

! = ! argmin !(!, !) !
!!∈!! !∈!

Point that minimizes the sum of distances to
all other points in the set.
43

Computing the medoid

! = ! argmin !(!, !) !
!!∈!! !∈!

•  Need to compute all |P|2 distances.

•  In comparison, computing the mean in k-

means only requires computing d averages
involving |P| numbers each.
44

K-medoids summary
•  Same algorithm as K-means, but uses medoids
instead of means

•  Centers are always points that appear in the

original dataset

•  Can use any distance measure for clustering.

•  Still need to know the number of clusters a priori…

Outline
•  K-means Clustering

•  Distance Metrics

•  Using distance metrics for clustering

–  K-mediods
–  Hierarchical Clustering
46

Hierarchical Clustering
•  Rather than compute a single clustering,
compute a family of clusterings.

•  Can choose the clusters a posteriori.

Agglomerative Clustering
•  Initialize each point to its own cluster

•  Repeat:
–  Pick the two clusters that are closest
–  Merge them into one cluster
–  Stop when there is only one cluster left
48

Example
Step 1: {1} {2} {3} {4} {5} {6} {7}
Step 2: {1} {2, 3} {4} {5} {6} {7}
Step 3: {1, 7} {2, 3} {4} {5} {6}
Step 4: {1, 7} {2, 3} {4, 5} {6}
Step 5: {1, 7} {2, 3, 6} {4, 5}
Step 6: {1, 7} {2, 3, 4, 5, 6}
Step 7: {1, 2, 3, 4, 5, 6, 7}

Example based on Ryan Tibshirani’s slides
49

Dendrogram Entire dataset

Height of a node is
proportional to distance
between children clusters Each node
is a
cluster

Individual points in
the dataset
50

Dendrogram

A horizontal cut in the

dendrogram results in a
clustering
51

Distance between clusters

Step 1: {1} {2} {3} {4} {5} {6} {7}
Step 2: {1} {2, 3} {4} {5} {6} {7}
Step 3: {1, 7} {2, 3} {4} {5} {6}
Step 4: {1, 7} {2, 3} {4, 5} {6}

What are the next two closest
clusters?
52

Single Linkage
!!"#$%& (!! , !! ) = ! min !(!, !)!
!∈!! ,!∈!!

Distance between two clusters is

the distance between the two
closest points in the clusters.

{6} is closer to {4,5} than {2,3}
according to single linkage
53

Complete Linkage
!!"#$%&'& (!! , !! ) = ! max !(!, !)!
!∈!! ,!∈!!

Distance between two clusters is

the distance between the two
farthest points in the clusters.

{6} is closer to {2,3} than {4,5}
according to complete linkage
54

Single vs Complete Linkage

3 1

2
> 1
1

0
0 1 2 3 4 5 6 7
55

Single Linkage
4

3 1

2
> 1
1

0
0 1 2 3 4 5 6 7
56

Single Linkage
4

0
0 1 2 3 4 5 6 7

Chaining: Single linkage can result in clusters

that are spread out and not compact
57

Complete Linkage
4

3 3

2
< 2
1

0
0 1 2 3 4 5 6 7
58

Complete Linkage
4

0
0 1 2 3 4 5 6 7

Complete linkage returns more compact clusters

in this case.
59

Single vs Complete Linkage

1.02 5.02 6.99

1
4 6

0
0 1 2 3 4 5 6 7 8
60

In both cases …
3

1.02 5.02 6.99

1
4 6

0
0 1 2 3 4 5 6 7 8
61

Single Linkage
3

1.02 5.02 6.99

1
4 6

0
0 1 2 3 4 5 6 7 8
62

Complete Linkage
3

1.02 5.02 6.99

1
4 6

0
0 1 2 3 4 5 6 7 8

Complete Linkage is sensitive to outliers.

Average Linkage
!!!! ,!!!! !(!, !)
!!"# (!! , !! ) = ! ! !!
|!! | ∙ |!! |

Distance between two clusters is
the average distance between
every pair of points in the
clusters.
64

Hierarchical Clustering summary

•  Create a family of hierarchical clusterings
–  Visualized using a dendrograms
–  Users can choose number of clusters after
clustering is done.

•  Can use any distance function

•  Diﬀerent choices for measuring distance

between clusters.

Point Estimation Exercises
100% (1)
Point Estimation Exercises
7 pages
Fandi Ct3 201004 Exam Final
No ratings yet
Fandi Ct3 201004 Exam Final
8 pages
Thesis On Extreme Learning Machine
100% (2)
Thesis On Extreme Learning Machine
4 pages
RC4 Encryption Algorithm
No ratings yet
RC4 Encryption Algorithm
7 pages
Probability and Statistics by Example Volume 2 Markov Chains A Primer in Random Processes and Their Applications V 2 1st Edition Yuri Suhov
100% (5)
Probability and Statistics by Example Volume 2 Markov Chains A Primer in Random Processes and Their Applications V 2 1st Edition Yuri Suhov
54 pages
Complete Answer Guide For Solution Manual For Essentials of Econometrics Gujarati Porter 4th Edition
100% (9)
Complete Answer Guide For Solution Manual For Essentials of Econometrics Gujarati Porter 4th Edition
44 pages
Anna University Question Paper - MA2261 Probability and Random Processes
No ratings yet
Anna University Question Paper - MA2261 Probability and Random Processes
3 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Sequencias e Series
No ratings yet
Sequencias e Series
14 pages
Livro Das Questoes
No ratings yet
Livro Das Questoes
8 pages
Chap5 ECL301L Lab Manual Pascual
No ratings yet
Chap5 ECL301L Lab Manual Pascual
12 pages
Stack Practice Programs
No ratings yet
Stack Practice Programs
19 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
55 pages
@2018 Regression Tree Ensembles For Wind Energy and Solar Radiation
No ratings yet
@2018 Regression Tree Ensembles For Wind Energy and Solar Radiation
10 pages
Unit-5 DSA
No ratings yet
Unit-5 DSA
8 pages
Market Trade Analytics Internship Assessment
No ratings yet
Market Trade Analytics Internship Assessment
3 pages
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
No ratings yet
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
26 pages
DS Lab Manual (BCS304) @vtunetwork
100% (1)
DS Lab Manual (BCS304) @vtunetwork
70 pages
Hippo Video Assessment (Round 1)
No ratings yet
Hippo Video Assessment (Round 1)
4 pages
Dynamic Analysis of A Planar Rigid-Link Mechanism With
No ratings yet
Dynamic Analysis of A Planar Rigid-Link Mechanism With
11 pages
Some Addmath Solutions
No ratings yet
Some Addmath Solutions
12 pages
Movie Box Office Revenue Prediction Using Machine Learning: Team Members: (BATCH N0:13) Guide Name
No ratings yet
Movie Box Office Revenue Prediction Using Machine Learning: Team Members: (BATCH N0:13) Guide Name
8 pages
Topic Formula: P Q R HCF (P, Q, R) HCF (P, Q) HCF (Q, R) HCF (P, R) P Q R LCM (P, Q, R) LCM (P, Q) LCM (Q, R) LCM (P, R)
No ratings yet
Topic Formula: P Q R HCF (P, Q, R) HCF (P, Q) HCF (Q, R) HCF (P, R) P Q R LCM (P, Q, R) LCM (P, Q) LCM (Q, R) LCM (P, R)
9 pages
Lab 4 - DTFS Analysis
No ratings yet
Lab 4 - DTFS Analysis
4 pages
38.1 - Problem Formulation Movie Reviews - mp4
No ratings yet
38.1 - Problem Formulation Movie Reviews - mp4
5 pages
DP-Knapsack Problem
No ratings yet
DP-Knapsack Problem
51 pages
A Demand Forecasting Method For High Value-Added Agri - Food Based On Machine Learning and Time Series Analysis
No ratings yet
A Demand Forecasting Method For High Value-Added Agri - Food Based On Machine Learning and Time Series Analysis
4 pages
Finding Probability
No ratings yet
Finding Probability
5 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2141)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)

Clusters

Uploaded by

Clusters

Uploaded by

Clustering

Announcements (Thu. Mar 2)

• Projects teams and number assignments

Announcements (Thu. Mar 2)

• But interesting trends in one region may

Idea: Cluster tweets by geography

Trending topics by geography

Example: Market Segmentation

Example: Phylogenetic Trees

• Using distance metrics for clustering

How did we create 10 clusters?

Can compare apples vs oranges …

• X = {x1, x2, …, xn} is a dataset

• Each xi is assumed to be a point in some

!! = !!! , !!! , … , !!! !

Square of the straight line distance

– If we know the cluster representatives (or the

– K-means++: (see paper)

• Finding the optimal solution (with least RSS(C)) is NP-

Where k-means fails …

Scaling / changing features can help

• Number of clusters should be known

• Using distance metrics for clustering

• Symmetric: d(x,y) = d(y,x)

• Choice of distance metric is usually

• K-means minimizes the sum of the Euclidean

Minkowski (Lp) Distance

Minkowski (Lp) Distance

Minkowski (Lp) Distance

– can be used in conjunction with TFIDF scores

Mean of xi’s Dot Product

– Again, a measure of similarity (inverse of distance)

Scaling / Changing features …

• How do we cluster for general distance

• Using distance metrics for clustering

K-means for general distance functions?

• Mean of a set of points does not always make

• Mean m of a set of points P minimizes the sum

– The above is not true for a general distance metric.

• Same algorithm as K-means …

Computing the medoid

• Need to compute all |P|2 distances.

• In comparison, computing the mean in k-

• Centers are always points that appear in the

• Can use any distance measure for clustering.

• Still need to know the number of clusters a priori…

• Using distance metrics for clustering

• Can choose the clusters a posteriori.

Dendrogram Entire dataset

A horizontal cut in the

Distance between clusters

Distance between two clusters is

Distance between two clusters is

Single vs Complete Linkage

Chaining: Single linkage can result in clusters

Complete linkage returns more compact clusters

Single vs Complete Linkage

1.02 5.02 6.99

1.02 5.02 6.99

1.02 5.02 6.99

1.02 5.02 6.99

Complete Linkage is sensitive to outliers.

Hierarchical Clustering summary

• Can use any distance function

• Diﬀerent choices for measuring distance

You might also like

•  Projects teams and number assignments

•  But interesting trends in one region may

•  Using distance metrics for clustering

•  X = {x1, x2, …, xn} is a dataset

•  Each xi is assumed to be a point in some

–  If we know the cluster representatives (or the

–  K-means++: (see paper)

•  Finding the optimal solution (with least RSS(C)) is NP-

•  Number of clusters should be known

•  Using distance metrics for clustering

•  Symmetric: d(x,y) = d(y,x)

•  Choice of distance metric is usually

•  K-means minimizes the sum of the Euclidean

–  can be used in conjunction with TFIDF scores

–  Again, a measure of similarity (inverse of distance)

•  How do we cluster for general distance

•  Using distance metrics for clustering

•  Mean of a set of points does not always make

•  Mean m of a set of points P minimizes the sum

–  The above is not true for a general distance metric.

•  Same algorithm as K-means …

•  Need to compute all |P|2 distances.

•  In comparison, computing the mean in k-

•  Centers are always points that appear in the

•  Can use any distance measure for clustering.

•  Still need to know the number of clusters a priori…

•  Using distance metrics for clustering

•  Can choose the clusters a posteriori.

•  Can use any distance function

•  Diﬀerent choices for measuring distance