0% found this document useful (0 votes)

7 views

clustering

Chapter 23 discusses probabilistic language processing with a focus on clustering techniques in unsupervised learning. It covers various clustering algorithms, including the k-means and hierarchical agglomerative clustering, and highlights their applications in different domains such as document clustering and analyzing protein sequences. The chapter also addresses challenges like the curse of dimensionality and the determination of the optimal number of clusters.

Uploaded by

preyanshi555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

clustering

Uploaded by

preyanshi555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 62

Chapter 23 Probabilistic Language Processing

Clustering examples

Additional sources used in preparing the slides:

• David Grossman’s clustering slides:
http://ir.iit.edu/~dagr/IRcourse/Notes/08Clustering.pdf
• Subbarao Kambhampati’s clustering slides:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt
• Jeffrey Ullman’s clustering slides:
www-db.stanford.edu/~ullman/cs345-notes.html
• Ernest Davis’ clustering slides:
www.cs.nyu.edu/courses/fall02/G22.3033-008/index.htm

1
Unsupervised learning

2
Example: a cholera outbreak in London

Many years ago, during a cholera outbreak in London, a

physician plotted the location of cases on a map.
Properly visualized, the data indicated that cases
clustered around certain intersections, where there were
polluted wells, not only exposing the cause of cholera,
but indicating what to do about the problem.

X X
X
XX XX
X X X
X X
X X
X X
X X
XX
X

3
Conceptual Clustering

The clustering problem

Given
• a collection of unclassified objects, and
• a means for measuring the similarity of objects
(distance metric),
find
• classes (clusters) of objects such that some
standard of quality is met (e.g., maximize the
similarity of objects in the same class.)
Essentially, it is an approach to discover a useful
summary of the data.

4
Conceptual Clustering (cont’d)

Ideally, we would like to represent clusters and

their semantic explanations. In other words, we
would like to define clusters extensionally (i.e.,
by general rules) rather than intensionally (i.e.,
by enumeration).
For instance, compare
{ X | X teaches AI at MTU CS}, and
{ John Lowther, Nilufer Onder}

5
Curse of dimensionality

• While clustering looks intuitive in 2

dimensions, many applications involve 10 or
10,000 dimensions
• High-dimensional spaces look different: the
probability of random points being close drops
quickly as the dimensionality grows

6
Higher dimensional examples

• Observation that customers who buy diapers are more

likely to buy beer than average allowed supermarkets to
place beer and diapers nearby, knowing many
customers would walk between them. Placing potato
chips between increased the sales of all three items.

7
SkyServer

8
Sloan Digital Sky Survey

• A cool tool to “map the universe”

• Objects are represented by their radiation in 9
dimensions (each dimension represents radiation in one
band of the spectrum)
• Clustered 2 x 109 sky objects into similar objects e.g.,
stars, galaxies, quasars, etc.
• The objective was to catalog and cluster the entire
visible universe. Clustering sky objects by their
radiation levels in different bands allowed astronomers
to distinguish between galaxies, nearby stars, and many
other kinds of celestial objects.

9
Clustering CDs

• Intuition: music divides into categories and

customers prefer a few categories
• But what are categories really?
• Represent a CD by the customers who bought it
• Similar CDs have similar sets of customers and
vice versa

10
The space of CDs

• Think of a space with one dimension for each

customer
• Values in a dimension may be 0 or 1 only
• A CD’s point in this space is
(x1, x2, …, xn), where xi = 1 iff the ith customer
bought the CD
• Compare this with the correlated items matrix:
rows = customers
columns = CDs

11
Clustering documents

• Query “salsa” submitted to MetaCrawler returns the

following documents among others:
 How to dance salsa
 Gourmet salsa
 Diet seen on Rachael Ray
 Michigan Salsa

• It also asks: “Are you looking for?”

 Music salsa
 Salsa recipe
 Homemade salsa recipe
 Salsa dancing

• The clusters are: dance, recipe, clubs, sauces, buy,

Mexican, bands, natural, …
12
Clustering documents (cont’d)

• Documents may be thought of as points in a high-

dimensional space, where each dimension
corresponds to one possible word.
• Clusters of documents in this space often
correspond to groups of documents on the same
topic, i.e., documents with similar sets of words may
be about the same topic
• Represent a document by a vector (x1, x2, …, xn),
where xi = 1 iff the ith word (in some order) appears in
the document
• n can be infinite

13
Analyzing protein sequences

• Objects are sequences of {C, A, T, G}

• Distance between sequences is “edit
distance,” the minimum number of inserts and
deletes to turn one into the other
• Note that there is a “distance,” but no
convenient space of points

14
Measuring distance

• To discuss, whether a set of points is close enough

to be considered a cluster, we need a distance
measure D(x,y) that tells how far points x and y are.
• The axioms for a distance measure D are:

1. D(x,x) = 0 A point is distance 0

from itself
2. D(x,y) = D(y,x) Distance is symmetric

3. D(x,y) ≤ D(x,z) + D(z,y) The triangle inequality

4. D(x,y) ≥ 0 Distance is positive

15
K-dimensional Euclidean space

The distance between any two points, say

a = [a1, a2, … , ak] and b = [b1, b2, … , bk]
is given in some manner such as:
b
1. Common distance (“L2 norm”) :
k
i =1 (ai - bi)2
a
2. Manhattan distance (“L1 norm”): b

i k=1 |ai - bi|

3. Max of dimensions (“L norm”): a
b

maxi =1
k |a - b |
i i

a 16
Non-Euclidean spaces

Here are some examples where a distance measure

without a Euclidean space makes sense.
• Web pages: Roughly 108-dimensional space where
each dimension corresponds to one word. Rather
use vectors to deal with only the words actually
present in documents a and b.
• Character strings, such as DNA sequences: Rather
use a metric based on the LCS---Lowest Common
Subsequence.
• Objects represented as sets of symbolic, rather
than numeric, features: Rather base similarity on the
proportion of features that they have in common.

17
Non-Euclidean spaces (cont’d)

object1 = {small, red, rubber, ball}

object2 = {small, blue, rubber, ball}
object3 = {large, black, wooden, ball}

similarity(object1, object2) = 3 / 4
similarity(object1, object3) =
similarity(object2, object3) = 1/4
Note that it is possible to assign different
weights to features.

18
Approaches to Clustering

Broadly specified, there are two classes of

clustering algorithms:
1. Centroid approaches: We guess the centroid
(central point) in each cluster, and assign points
to the cluster of their nearest centroid.
2. Hierarchical approaches: We begin assuming
that each point is a cluster by itself. We
repeatedly merge nearby clusters, using some
measure of how close two clusters are (e.g.,
distance between their centroids), or how good a
cluster the resulting group would be (e.g., the
average distance of points in the cluster from the
resulting centroid.)

19
The k-means algorithm
•Pick k cluster centroids.
•Assign points to clusters by picking the
closest centroid to the point in question. As
points are assigned to clusters, the centroid of
the cluster may migrate.
Example: Suppose that k = 2 and we assign
points 1, 2, 3, 4, 5, in that order. Outline circles
represent points, filled circles represent
centroids. 1 5

3
4
20
The k-means algorithm example (cont’d)

1 5 1 5

2 2
3 3
4 4

1 5 1 5

2 2
3 3
4 4

21
Issues

• How to initialize the k centroids?

Pick points sufficiently far away from any other
centroid, until there are k.
• As computation progresses, one can decide to
split one cluster and merge two, to keep the total
at k. A test for whether to do so might be to ask
whether doing so reduces the average distance
from points to their centroids.
• Having located the centroids of k clusters, we
can reassign all points, since some points that
were assigned early may actually wind up closer
to another centroid, as the centroids move about.

22
Issues (cont’d)

• How to determine k?
One can try different values for k until the
smallest k such that increasing k does not
much decrease the average points of points to
their centroids.

XX
X X X
X X
X X X X
X
X

X
X X
X X X

23
Determining k
X
X X
X X X X When k = 1, all the points are
X X
X X X in one cluster, and the average
X distance to the centroid will be
high.
X
X X
X X X

X
XX
X X
When k = 2, one of the clusters
X
X
X
X X X
will be by itself and the other
X
X two will be forced into one
cluster. The average distance
X of points to the centroid will
X
X X
X X
shrink considerably.
24
Determining k (cont’d)
X
X X
X X X X When k = 3, each of the
X X
X X X apparent clusters should be a
X cluster by itself, and the
average distance from the
X
X
X
points to their centroids
X X X shrinks again.

X When k = 4, then one of the

X X
X X X X true clusters will be artificially
X X
X X X partitioned into two nearby
X
clusters. The average distance
to the centroids will drop a bit,
X
X
X but not much.
X X X
25
Determining k (cont’d)

Average
radius

1 2 3 4
k

This failure to drop further suggests that k = 3

is right. This conclusion can be made even if
the data is in so many dimensions that we
cannot visualize the clusters.

26
The CLUSTER/2 algorithm

1. Select k seeds from the set of observed

objects. This may be done randomly or
according to some selection function.
2. For each seed, using that seed as a positive
instance and all other seeds as negative
instances, produce a maximally general
definition that covers all of the positive and
none of the negative instances (multiple
classifications of non-seed objects are
possible.)

27
The CLUSTER/2 algorithm (cont’d)

3. Classify all objects in the sample according to

these descriptions. Replace each maximally
specific description that covers all objects in the
category (to decrease the likelihood that classes
overlap on unseen objects.)
4. Adjust remaining overlapping definitions.
5. Using a distance metric, select an element
closest to the center of each class.
6. Repeat steps 1-5 using the new central elements
as seeds. Stop when clusters are satisfactory.

28
The CLUSTER/2 algorithm (cont’d)

7. If clusters are unsatisfactory and no

improvement occurs over several iterations,
select the new seeds closest to the edge of the
cluster.

29
The steps of a CLUSTER/2 run

30
Document clustering

Automatically group related documents into

clusters given some measure of similarity. For
example,
• medical documents
• legal documents
• financial documents
• web search results

31
Hierarchical Agglomerative Clustering (HAC)

• Given n documents, create a n x n doc-doc

similarity matrix.
• Each document starts as a cluster of size one.
• do until there is only one cluster
 Combine the two clusters with the greatest similarity
(if X and Y are the most mergable pair of clusters,
then we create X-Y as the parent of X and Y. Hence the
name “hierarchical”.)
 Update the doc-doc matrix.

32
Example

Consider A, B, C, D, E as documents with the

following similarities:

A B C D E
The pair
A - 2 7 9 4 with the
highest
B 2 - 9 11 14 similarity
is:
C 7 9 - 4 8
B-E = 14
D 9 11 4 - 2

E 4 14 8 2 -

33
Example

So let’s cluster B and E. We now have the

following structure:

A C D B E

34
Example

Update the doc-doc matrix:

A BE C D To compute
(A,BE):
A - 2 7 9 take the
minimum of
BE 2 - 8 2 (A,B)=2 and
(A,E)=4.
C 7 8 - 4
This is called
D 9 2 4 - complete
linkage.

35
Example

Highest link is A-D. So let’s cluster A and D. We

now have the following structure:

AD BE

A D C B E

36
Example

Update the doc-doc matrix:

AD BE C

AD - 2 4

BE 2 - 8

C 4 8 -

37
Example

• Highest link is BE-C. So let’s cluster BE and C.

We now have the following structure:

BCE

AD BE

A D C B E

38
Example

• At this point, there are only two nodes that

have not been clustered. So we cluster AD and
BCE. We now have the following structure:

ABCDE

Everything
BCE has been
clustered.

AD BE

A D C B E

39
Time complexity analysis

Hierarchical agglomerative clustering (HAC)

requires:
• O(n2) to compute the doc-doc similarity matrix
• One node is added during each round of
clustering so there are now O(n) clustering
steps
• For each clustering step we must re-compute
the doc-doc matrix. This requires O(n) time.
• So we have: n2 + (n)(n) = O(n2) – so it’s
expensive!
• For 500,000 documents n2 is 250,000,000,000!!
40
One pass clustering

• Choose a document and declare it to be in a

cluster of size 1.
• Now compute the distance from this cluster to
all the remaining nodes.
• Add “closest” node to the cluster. If no node
is really close (within some threshold), start a
new cluster between the two closest nodes.

41
Example

• Consider the following nodes

B
D

C
A

42
Example

• Choose node A as the first cluster

• Now compute the distance between A and the
others. B is the closest, so cluster A and B.
• Compute the centroid of the cluster just formed.

B
D
AB
C
A

43
Example

• Compute the distance between A-B and all the

remaining clusters using the centroid of A-B.
• Let’s assume all the others are too far from AB.
Choose one of these non-clustered elements and
place it in a cluster. Let’s choose E.

B
D
AB
C
A

44
Example

• Compute the distance from E to D and E to C.

• E to D is closer so we form a cluster of E and D.

E
DE
B
D
AB
C
A

45
Example

• Compute the distance from D-E to C.

• It is within the threshold so include C in this
cluster.
Everything
has been
clustered.
E

B
D CDE
AB
C
A

46
Time complexity analysis

One pass requires:

• n passes as we add node for each pass
• First pass requires n-1 comparisons
• Second pass requires n-2 comparisons
• Last pass needs 1
• So we have 1 + 2 + 3 + … + (n-1) = (n-1)(n) / 2
• (n2 - n) / 2 = O(n2)
• The constant is lower for one pass but we are
still at n2 .

47
Remember k-means clustering

• Pick k points as the seeds of k clusters

• At the onset, there are k clusters of size one.
• do until all nodes are clustered
 Pick a point and put it into the cluster whose centroid is
closest.
 Recompute the centroid of the modified cluster.

48
Time complexity analysis

K-means requires:
• Each node gets added to a cluster, so there
are n clustering steps
• For each addition, we need to compare to k
centroids
• We also need to recompute the centroid after
adding the new node, this takes a constant
amount of time (say c)
• The total time needed is (k + c) n = O(n)
• So it is a linear algorithm!

49
But there are problems…

• K needs to be known in advance or need trials

to compute k
• Tends to go to local minima that are sensitive
to the starting centroids:

A B C

D E F

If the seeds are B and E, the resulting clusters

are {A,B,C} and {D,E,F}.
If the seeds are D and F, the resulting clusters
are {A,B,D,E} and {C,F}. 50
Two questions for you

1. Why did the computer go to the restaurant?

2. What do you do when you have a slow
algorithm that produces quality results, and a
fast algorithm that cannot guarantee quality?

1. To get a byte.

2. Many things…
One option is to use the slow algorithm on a
portion of the problem to obtain a better
starting point for the fast algorithm.

51
Buckshot clustering

• The goal is to reduce the run time by

combining HAC and k-means clustering.
• Select d documents where d is SQRT(n).
• Cluster these d documents using HAC, this
will take O(n) time.
• Use the results of HAC as initial seeds for k-
means.
• It uses HAC to bootstrap k-means.
• The overall algorithm is O(n) and avoids
problems of bad seed selection.

52
Getting the k clusters

Cut where you have k clusters

ABCDE

AD BCE

A D C B E

53
Effect of document order

• With hierarchical clustering we get the same

clusters every time.
• With one pass clustering, we get different
clusters based on the order we process the
documents.
• With k-means clustering, we get different
clusters based on the selected seeds.

54
Computing the distance (time)

• In our time complexity analysis we finessed

the time required to compute the distance
between two nodes
• Sometimes this is an expensive task
depending on the analysis required

55
Computing the distance (methods)

• To compute the intra-cluster distance:

(Sum/min/max/avg) the (absolute/squared)
distance between
 All pairs of points in the cluster, or
 Between the centroid and all points in the cluster

• To compute the inter-cluster distance for HAC:

 Single-link: distance between closest neighbors
 Complete-link: distance between farthest neighbors
 Group-average: average distance between all pairs of
neighbors
 Centroid-distance: distance between centroids (most
commonly used)

56
More on document clustering
• Applications
 Structuring search results
 Suggesting related pages
 Automatic directory construction / update
 Finding near identical pages
 Finding mirror pages (e.g., for propagating updates)

 Eliminate near-duplicates from results page

 Plagiarism detection

 Lost and found (find identical pages at different URLs at

different times)
• Problems
 Polysemy, e.g., “bat,” “Washington,” “Banks”
 Multiple aspects of a single topic
 Ultimately amounts to general problem of information
structuring
57
Clustering vs. classification

• Clustering is when the clusters are not known

• If the system of clusters is known, and the
problem is to place a new item into the proper
cluster, this is classification

58
How many possible clusterings?

If we have n points and would like to cluster

them into k clusters, then there are k clusters
the first point can go to, there are k clusters for
each of the remaining points. So the total
number of possible clusterings is kn.
Brute force enumeration will not work. That is
why we have iterative optimization algorithms
that start with a clustering and iteratively
improve it.
Finally, note that noise (outliers) is a problem
for clustering too. One can use statistical
techniques to identify outliers.

59
Cluster structure

• Hierarchical vs flat
• Overlap
 Disjoint partitioning, e.g., partition congressmen by state
 Multiple dimensions of partitioning, each disjoint, e.g.,
partition congressmen by state; by party; by
House/Senate
 Arbitrary overlap, e.g., partition bills by congressmen
who voted for them

• Exhaustive vs. non-exhaustive

• Outliers: what to do?
• How many clusters? How large?

60
Measuring the quality of the clusters

A good clustering is one where

• (intra-cluster distance) the sum of distances
between objects in the same cluster are
minimized
• (inter-cluster distance) while the distances
between different clusters are maximized

The objective is to minimize: F(intra, inter)

61
Related communities

• data mining (in databases, over the web)

• statistics
• clustering algorithms
• visualization
• databases

cs4811-ch10c-clustering
No ratings yet
cs4811-ch10c-clustering
35 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
S VD For Clustering
No ratings yet
S VD For Clustering
10 pages
Clusters
No ratings yet
Clusters
64 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Class Ppt
No ratings yet
Clustering Class Ppt
103 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Clustering
No ratings yet
Clustering
39 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Clustering
No ratings yet
Clustering
65 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Cluster
No ratings yet
Cluster
66 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
k-medoids
No ratings yet
k-medoids
101 pages
Unit 5
No ratings yet
Unit 5
63 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
III-clustering
No ratings yet
III-clustering
87 pages
Clustering-Part1
No ratings yet
Clustering-Part1
79 pages
Lect 12
No ratings yet
Lect 12
80 pages
Lect 4
No ratings yet
Lect 4
34 pages
Clustering
No ratings yet
Clustering
12 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Clustering
No ratings yet
Clustering
35 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
CS583 Unsupervised Learning (2) (3)
No ratings yet
CS583 Unsupervised Learning (2) (3)
95 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
ch07 Clustering
No ratings yet
ch07 Clustering
56 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
M5
No ratings yet
M5
40 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Clustering
No ratings yet
Clustering
75 pages
6.nsupervised Learning Clustering Lecture 7 Slides For4962
No ratings yet
6.nsupervised Learning Clustering Lecture 7 Slides For4962
37 pages
Unit 6 Unsupervised Learning
No ratings yet
Unit 6 Unsupervised Learning
68 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
The Summation of Series
From Everand
The Summation of Series
Harold T. Davis
4/5 (1)
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet
Summative Test Week 1&2
No ratings yet
Summative Test Week 1&2
2 pages
G David Garson GLM UNIVARIATE, ANOVA, AND ANCOVA 2013, Statistical
No ratings yet
G David Garson GLM UNIVARIATE, ANOVA, AND ANCOVA 2013, Statistical
159 pages
Siddharth Jena FlowCV Resume 20250116
No ratings yet
Siddharth Jena FlowCV Resume 20250116
1 page
0478 oct nov 11
No ratings yet
0478 oct nov 11
12 pages
Scholarship Form
No ratings yet
Scholarship Form
1 page
CPU Organization Bindu Agarwalla
No ratings yet
CPU Organization Bindu Agarwalla
22 pages
SEM 4 Mini Project On CHPP
No ratings yet
SEM 4 Mini Project On CHPP
8 pages
ME51N Create Req
No ratings yet
ME51N Create Req
14 pages
A. Preliminary Activity:: Integration With English Literacy
No ratings yet
A. Preliminary Activity:: Integration With English Literacy
6 pages
21st CENT-Lesson 5
No ratings yet
21st CENT-Lesson 5
18 pages
Mamoon's Resume (4)
No ratings yet
Mamoon's Resume (4)
1 page
Agc 150 Ats Data Sheet 4921240628 Uk
No ratings yet
Agc 150 Ats Data Sheet 4921240628 Uk
26 pages
Chapter5 Regression TransformationAndWeightingToCorrectModelInadequacies
No ratings yet
Chapter5 Regression TransformationAndWeightingToCorrectModelInadequacies
16 pages
How To Create Twmax Account
No ratings yet
How To Create Twmax Account
12 pages
A Comprehensive and Systematic Literature Review On The Big Data
No ratings yet
A Comprehensive and Systematic Literature Review On The Big Data
60 pages
Student Management System
No ratings yet
Student Management System
38 pages
Project Report On Gym Management
100% (2)
Project Report On Gym Management
30 pages
Mrunal Resume 2
No ratings yet
Mrunal Resume 2
3 pages
Fundamentals of IT and Cyber Security
No ratings yet
Fundamentals of IT and Cyber Security
14 pages
Week 1
No ratings yet
Week 1
5 pages
HTTP Response 500 Is Returned When Doing Connection Test
No ratings yet
HTTP Response 500 Is Returned When Doing Connection Test
2 pages
Shima Design Solutions_en
No ratings yet
Shima Design Solutions_en
16 pages
Phone SW Rev Extraction Method v2
No ratings yet
Phone SW Rev Extraction Method v2
13 pages
Smart Dustbin
No ratings yet
Smart Dustbin
5 pages
G 10 01
No ratings yet
G 10 01
3 pages
Cache Replacement
No ratings yet
Cache Replacement
10 pages
NC-Programming Manual For Turning Centers With Fanuc 30 Series Controls
No ratings yet
NC-Programming Manual For Turning Centers With Fanuc 30 Series Controls
105 pages
BMATS101 - Model QP (Set-2) Solution - 240125 - 163217
No ratings yet
BMATS101 - Model QP (Set-2) Solution - 240125 - 163217
37 pages
Unit 2
No ratings yet
Unit 2
65 pages
Vitalograph Alpha 6000 Operation User S Manual 36
No ratings yet
Vitalograph Alpha 6000 Operation User S Manual 36
36 pages

clustering

Uploaded by

clustering

Uploaded by

Chapter 23 Probabilistic Language Processing

Additional sources used in preparing the slides:

Many years ago, during a cholera outbreak in London, a

The clustering problem

Ideally, we would like to represent clusters and

• While clustering looks intuitive in 2

• Observation that customers who buy diapers are more

• A cool tool to “map the universe”

• Intuition: music divides into categories and

• Think of a space with one dimension for each

• Query “salsa” submitted to MetaCrawler returns the

• It also asks: “Are you looking for?”

• The clusters are: dance, recipe, clubs, sauces, buy,

• Documents may be thought of as points in a high-

• Objects are sequences of {C, A, T, G}

• To discuss, whether a set of points is close enough

1. D(x,x) = 0 A point is distance 0

3. D(x,y) ≤ D(x,z) + D(z,y) The triangle inequality

4. D(x,y) ≥ 0 Distance is positive

The distance between any two points, say

i k=1 |ai - bi|

Here are some examples where a distance measure

object1 = {small, red, rubber, ball}

Broadly specified, there are two classes of

• How to initialize the k centroids?

X When k = 4, then one of the

This failure to drop further suggests that k = 3

1. Select k seeds from the set of observed

3. Classify all objects in the sample according to

7. If clusters are unsatisfactory and no

Automatically group related documents into

• Given n documents, create a n x n doc-doc

Consider A, B, C, D, E as documents with the

So let’s cluster B and E. We now have the

Update the doc-doc matrix:

Highest link is A-D. So let’s cluster A and D. We

Update the doc-doc matrix:

• Highest link is BE-C. So let’s cluster BE and C.

• At this point, there are only two nodes that

Hierarchical agglomerative clustering (HAC)

• Choose a document and declare it to be in a

• Consider the following nodes

• Choose node A as the first cluster

• Compute the distance between A-B and all the

• Compute the distance from E to D and E to C.

• Compute the distance from D-E to C.

One pass requires:

• Pick k points as the seeds of k clusters

• K needs to be known in advance or need trials

If the seeds are B and E, the resulting clusters

1. Why did the computer go to the restaurant?

• The goal is to reduce the run time by

Cut where you have k clusters

• With hierarchical clustering we get the same

• In our time complexity analysis we finessed

• To compute the intra-cluster distance:

• To compute the inter-cluster distance for HAC:

 Eliminate near-duplicates from results page

 Lost and found (find identical pages at different URLs at

• Clustering is when the clusters are not known

If we have n points and would like to cluster

• Exhaustive vs. non-exhaustive

A good clustering is one where

The objective is to minimize: F(intra, inter)

• data mining (in databases, over the web)

You might also like