0% found this document useful (0 votes)

103 views

ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

The document discusses text clustering for information retrieval. It introduces the topic of document clustering, motivations for clustering, and representations of documents. It discusses what clustering is, applications of clustering in IR like improving search recall and navigation. Issues for clustering like representation, similarity measures, determining the number and size of clusters are covered. Finally, it briefly introduces flat and hierarchical clustering algorithms.

Uploaded by

Rohit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views

ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

Uploaded by

Rohit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Introduc)on

to Informa)on Retrieval

ISTE-612 Knowledge Processing

Technologies
Week 12
Text Clustering

Introduc)on to Informa)on Retrieval

Todays Topic: Clustering

Document clustering
MoEvaEons
Document representaEons
Success criteria

Clustering algorithms
ParEEonal
Hierarchical

Introduc)on to Informa)on Retrieval

WHAT IS CLUSTERING?

Introduc)on to Informa)on Retrieval

Ch.
16

What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from dierent clusters should be
dissimilar.

The commonest form of unsupervised learning

Unsupervised learning = ?

Introduc)on to Informa)on Retrieval

Ch.
16

The commonest form of unsupervised learning

Introduc)on to Informa)on Retrieval

Ch.
16

A data set with clear cluster structure

How would
you design
an algorithm
for nding
the three
clusters in
this case?

Introduc)on to Informa)on Retrieval

Sec.
16.1

ApplicaEons of clustering in IR
Whole corpus analysis/naviga@on
BeBer user interface: search without typing

For improving recall in search applicaEons

BeVer search results (like pseudo RF)

For beVer navigaEon of search results

EecEve user recall will be higher

For speeding up vector space retrieval

Cluster-based retrieval gives faster search

Introduc)on to Informa)on Retrieval

ApplicaEons of clustering in IR
Whole corpus analysis/navigaEon
Explore data

For improving recall in search applica@ons

Returning similar documents

For beVer navigaEon of search results

Explore search results

For speeding up vector space retrieval

Fewer comparisons

Sec.
16.1

Introduc)on to Informa)on Retrieval

Sec.
16.1

For improving search recall

Cluster hypothesis - Documents in the same cluster behave
similarly with respect to relevance to informaEon needs
Therefore, to improve search recall:
Cluster docs in corpus a priori
When a query matches a doc D, also return other docs in the
cluster containing D
Hope if we do this: The query car will also return docs containing
automobile
Because clustering grouped together docs containing car with
those containing automobile.

Why might this happen?

Introduc)on to Informa)on Retrieval

ApplicaEons of clustering in IR
Whole corpus analysis/navigaEon
BeVer user interface: search without typing

For improving recall in search applicaEons

BeVer search results (like pseudo RF)

For beBer naviga@on of search results

Eec@ve user recall will be higher

For speeding up vector space retrieval

Cluster-based retrieval gives faster search

Sec.
16.1

Introduc)on to Informa)on Retrieval

ApplicaEons of clustering in IR
Whole corpus analysis/navigaEon
BeVer user interface: search without typing

For improving recall in search applicaEons

BeVer search results (like pseudo RF)

For beVer navigaEon of search results

EecEve user recall will be higher

For speeding up vector space retrieval

Cluster-based retrieval gives faster search

Sec.
16.1

Introduc)on to Informa)on Retrieval

CLUSTERING ISSUES

Introduc)on to Informa)on Retrieval

Issues for clustering

RepresentaEon for clustering
Document representaEon
Vector space model

Need a noEon of similarity/distance

How many clusters?

Fixed a priori?
Completely data driven?
Avoid trivial clusters - too large or small
What is the right size for a cluster?
Likely data dependent

Sec.
16.2

Introduc)on to Informa)on Retrieval

NoEon of similarity/distance
Ideal: seman2c similarity.
PracEcal: term-staEsEcal similarity (docs as
vectors)
Cosine similarity
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
But real implementaEons use cosine similarity

Introduc)on to Informa)on Retrieval

Clustering Algorithms
Flat algorithms
Usually start with a random parEEoning
Rene it iteraEvely
K means clustering

Hierarchical algorithms
BoVom-up, agglomeraEve

Introduc)on to Informa)on Retrieval

Sec.
16.4

K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
!
!
1
(c) =
x

| c | x!c
Reassignment of instances to clusters is based on
distance to the current cluster centroids.
(Or one can equivalently phrase it in terms of similariEes)

Introduc)on to Informa)on Retrieval

Sec.
16.4

K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds.
UnEl clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
Compute the new centroid

Introduc)on to Informa)on Retrieval

Sec.
16.4

K Means Example
(K=2)

Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!

Introduc)on to Informa)on Retrieval

TerminaEon condiEons
Several possibiliEes, e.g.,
A xed number of iteraEons.
Doc parEEon unchanged.
Centroid posiEons dont change.
Does this mean that the docs in a
cluster are unchanged?

Sec.
16.4

Introduc)on to Informa)on Retrieval

Sec.
16.4

Seed Choice
Results can vary based on
random seed selecEon.
Some seeds can result in poor
convergence rate, or
convergence to sub-opEmal
clusterings.
Select good seeds using a heurisEc
(e.g., doc least similar to any
exisEng mean)
Try out mulEple starEng points
IniEalize with the results of another
method.

Example showing
sensitivity to seeds

In the above, if you start

with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}

Introduc)on to Informa)on Retrieval

How Many Clusters?

Number of clusters K is given
ParEEon n docs into predetermined number of clusters

Finding the right number of clusters is part of the

problem
Given docs, parEEon into an appropriate number of
subsets.
E.g., for query results - ideal value of K not known up front
- though UI may impose limits.

Introduc)on to Informa)on Retrieval

HIERARCHICAL CLUSTERING

Introduc)on to Informa)on Retrieval

17
Ch.

Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.

animal
vertebrate
fish reptile amphib. mammal

invertebrate
worm insect crustacean

One approach: recursive applicaEon of a

parEEonal clustering algorithm.

Introduc)on to Informa)on Retrieval

Dendrogram: Hierarchical Clustering

Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.

Introduc)on to Informa)on Retrieval

Sec.
17.1

Hierarchical AgglomeraEve Clustering

(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, unEl there is only one cluster.

The history of merging forms a binary tree

or hierarchy.

Introduc)on to Informa)on Retrieval

Sec.
17.2

Closest pair of clusters

Many variants to dening closest pair of clusters
Single-link
Similarity of the most cosine-similar (single-link)

Complete-link
Similarity of the furthest points, the least cosine-similar

Centroid
Clusters whose centroids (centers of gravity) are the most
cosine-similar

Average-link
Average cosine between all pairs of elements

Introduc)on to Informa)on Retrieval

Sec.
17.2

Single Link AgglomeraEve Clustering

Use maximum similarity of pairs:

sim(ci ,c j ) = max sim( x, y)

xci , yc j

Can result in straggly (long and thin) clusters

due to chaining eect.
Afer merging ci and cj, the similarity of the
resulEng cluster to another cluster, ck, is:

sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))

Introduc)on to Informa)on Retrieval

Sec.
17.2

Complete Link
Use minimum similarity of pairs:

sim(ci ,c j ) = min sim( x, y )

xci , yc j

Makes Eghter, spherical clusters that are typically

preferable.
Afer merging ci and cj, the similarity of the resulEng
cluster to another cluster, ck, is:
(( ci c j ), ck ) = min( sim(ci , ck ), sim(c j , ck ))
sim
Ci

Introduc)on to Informa)on Retrieval

Sec.
17.3

Group Average
Similarity of two clusters = average similarity of all
pairs within merged cluster.

! !
1
sim(ci , c j ) =
sim( x , y )

ci c j ( ci c j 1) x!( ci c j ) y!( ci c j ): y! x!
Compromise between single and complete link.
Two opEons:
Averaged across all pairs in the merged cluster
Averaged over all pairs between the two original clusters

Introduc)on to Informa)on Retrieval

END FOR THIS WEEK

Introduc)on to Informa)on Retrieval

Sec.
16.3

What Is A Good Clustering?

Internal criterion: A good clustering will produce
high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is
high
the inter-class similarity is low
The measured quality of a clustering depends on
both the document representaEon and the
similarity measure used

Introduc)on to Informa)on Retrieval

Sec.
16.3

External criteria for clustering quality

Quality measured by its ability to discover some
or all of the hidden paVerns or latent classes in
gold standard data
Assesses a clustering with respect to ground
truth requires labeled data
Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, 1, 2, , K with ni members.

Introduc)on to Informa)on Retrieval

Sec.
16.3

External EvaluaEon of Cluster Quality

Simple measure: purity, assign cluster i to
the most frequent class
1
purity = max| k c j |
N k
j

Introduc)on to Informa)on Retrieval

Purity example

Cluster I: (max(5, 1, 0)) = 5 (red)

Cluster II: (max(1, 4, 1)) = 4 (blue)
Cluster III: (max(2, 0, 3)) = 3 (green)

Sec.
16.3

It Report
0% (1)
It Report
24 pages
Boarding Pass BPN - Palu PDF
No ratings yet
Boarding Pass BPN - Palu PDF
1 page
Lecture14 Clustering
No ratings yet
Lecture14 Clustering
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture12 Clustering
No ratings yet
Lecture12 Clustering
48 pages
Clustring by Dr. Inam Ullah Khan
No ratings yet
Clustring by Dr. Inam Ullah Khan
48 pages
Lecture 17 Clustering
No ratings yet
Lecture 17 Clustering
63 pages
6 Text Clustering
No ratings yet
6 Text Clustering
66 pages
Flat Clustering PDF
No ratings yet
Flat Clustering PDF
73 pages
16 Flat
No ratings yet
16 Flat
88 pages
Chapter4 Clustering Compressed
No ratings yet
Chapter4 Clustering Compressed
48 pages
Intro IR
No ratings yet
Intro IR
108 pages
Unit 1
No ratings yet
Unit 1
108 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Notes 1149 Unit 3
No ratings yet
Notes 1149 Unit 3
32 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
IR Unit 1
No ratings yet
IR Unit 1
30 pages
IR Lec13 Web Crawling
No ratings yet
IR Lec13 Web Crawling
27 pages
14 Vcat
No ratings yet
14 Vcat
66 pages
Classification Methods
No ratings yet
Classification Methods
10 pages
ISR U 1&2 Tech-Knowledge
No ratings yet
ISR U 1&2 Tech-Knowledge
68 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Relevance Feedback
No ratings yet
Relevance Feedback
47 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Information Retrieval Thesis
100% (3)
Information Retrieval Thesis
5 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Intelligence Database ct1
No ratings yet
Intelligence Database ct1
8 pages
Research Paper on Information Retrieval System
100% (1)
Research Paper on Information Retrieval System
7 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Part B
No ratings yet
Part B
12 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
67 pages
Lecture10 Efficient Scoring
No ratings yet
Lecture10 Efficient Scoring
19 pages
Chap 13
No ratings yet
Chap 13
68 pages
Relevance Feedback: Improving Results
No ratings yet
Relevance Feedback: Improving Results
41 pages
unt3 pptx digital marketing
No ratings yet
unt3 pptx digital marketing
17 pages
An Efficient and Empirical Model of Distributed Clustering
No ratings yet
An Efficient and Empirical Model of Distributed Clustering
5 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Solutions To Exercises: Chapter 1 - Information Retrieval Models
No ratings yet
Solutions To Exercises: Chapter 1 - Information Retrieval Models
1 page
L 12 Flat Cluster
No ratings yet
L 12 Flat Cluster
26 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Lecture8-Evaluation 2013
No ratings yet
Lecture8-Evaluation 2013
44 pages
5-Introduction To Information Retrieval
No ratings yet
5-Introduction To Information Retrieval
3 pages
lecture7b-efficient-scoring
No ratings yet
lecture7b-efficient-scoring
18 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
AZ Lecture7-Queryexpansion
No ratings yet
AZ Lecture7-Queryexpansion
49 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
ElasticSearch Server
From Everand
ElasticSearch Server
Rafal Kuc
No ratings yet
Understanding ECMAScript 6: The Definitive Guide for JavaScript Developers
From Everand
Understanding ECMAScript 6: The Definitive Guide for JavaScript Developers
Nicholas C. Zakas
4/5 (5)
Annual Examination class xi
No ratings yet
Annual Examination class xi
7 pages
RMS Express Revision History
No ratings yet
RMS Express Revision History
28 pages
BG-30B V11 Data Cards R1-Notes
No ratings yet
BG-30B V11 Data Cards R1-Notes
13 pages
Recurrent Neural Network Wiki
100% (1)
Recurrent Neural Network Wiki
7 pages
Apps DBA Handy Notes
No ratings yet
Apps DBA Handy Notes
3 pages
08-04-NJOIT Attachment A Application Server Asset Inventory Spreadsheet
No ratings yet
08-04-NJOIT Attachment A Application Server Asset Inventory Spreadsheet
3 pages
8086 Microprocessor Architecture
No ratings yet
8086 Microprocessor Architecture
31 pages
A+ Engg Colleges in TN
No ratings yet
A+ Engg Colleges in TN
116 pages
10V Battery Tester SOW
No ratings yet
10V Battery Tester SOW
3 pages
Marketing Communication
No ratings yet
Marketing Communication
4 pages
Project Report: Fashion Designing
No ratings yet
Project Report: Fashion Designing
54 pages
Problem 03
70% (10)
Problem 03
18 pages
Drupal Architecture Whitepaper
No ratings yet
Drupal Architecture Whitepaper
18 pages
PMI Test Report
No ratings yet
PMI Test Report
1 page
Cec Client
No ratings yet
Cec Client
9 pages
Unit Title Unit Code Programme Credits Level Unit Status Contact Time
No ratings yet
Unit Title Unit Code Programme Credits Level Unit Status Contact Time
4 pages
Where Are Earthquakes Happening Right Now? Worksheet
No ratings yet
Where Are Earthquakes Happening Right Now? Worksheet
3 pages
Unit 4 Timers For Embedded
No ratings yet
Unit 4 Timers For Embedded
54 pages
Ashish Kumar Jangir: Mobile: +917795615859 / +91-9066278507 Career Objective
No ratings yet
Ashish Kumar Jangir: Mobile: +917795615859 / +91-9066278507 Career Objective
3 pages
Netcat Hacker Manual: A Handy Pocket Guide For Your Cat
No ratings yet
Netcat Hacker Manual: A Handy Pocket Guide For Your Cat
28 pages
Developing Knowledge-Based Client Relationships
No ratings yet
Developing Knowledge-Based Client Relationships
13 pages
Systems Guide To Figforth
No ratings yet
Systems Guide To Figforth
107 pages
MT6753 Android Scatter
No ratings yet
MT6753 Android Scatter
9 pages
KCS713 Cloud Computing
No ratings yet
KCS713 Cloud Computing
1 page
Geolog6.6 Determin Tutorial
No ratings yet
Geolog6.6 Determin Tutorial
124 pages
Department of Statistics: COURSE STATS 330/762
No ratings yet
Department of Statistics: COURSE STATS 330/762
10 pages
UNIT 3 GRAPHS Class Notes
100% (1)
UNIT 3 GRAPHS Class Notes
24 pages
Introduction To Poles and Zeros
No ratings yet
Introduction To Poles and Zeros
5 pages

ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

Uploaded by

ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering

Uploaded by

Introduc)on

ISTE-612 Knowledge Processing

Introduc)on to Informa)on Retrieval

Todays Topic: Clustering

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

The commonest form of unsupervised learning

Introduc)on to Informa)on Retrieval

The commonest form of unsupervised learning

Introduc)on to Informa)on Retrieval

A data set with clear cluster structure

Introduc)on to Informa)on Retrieval

For improving recall in search applicaEons

For beVer navigaEon of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

For improving recall in search applica@ons

For beVer navigaEon of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

For improving search recall

Why might this happen?

Introduc)on to Informa)on Retrieval

For improving recall in search applicaEons

For beBer naviga@on of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

For improving recall in search applicaEons

For beVer navigaEon of search results

For speeding up vector space retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Issues for clustering

Need a noEon of similarity/distance

How many clusters?

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

In the above, if you start

Introduc)on to Informa)on Retrieval

How Many Clusters?

Finding the right number of clusters is part of the

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

One approach: recursive applicaEon of a

Introduc)on to Informa)on Retrieval

Dendrogram: Hierarchical Clustering

Introduc)on to Informa)on Retrieval

Hierarchical AgglomeraEve Clustering

The history of merging forms a binary tree

Introduc)on to Informa)on Retrieval

Closest pair of clusters

Introduc)on to Informa)on Retrieval

Single Link AgglomeraEve Clustering

sim(ci ,c j ) = max sim( x, y)

Can result in straggly (long and thin) clusters

sim(( ci c j ), ck ) = max( sim(ci , ck ), sim(c j , ck ))

Introduc)on to Informa)on Retrieval

sim(ci ,c j ) = min sim( x, y )

Makes Eghter, spherical clusters that are typically

Introduc)on to Informa)on Retrieval

Introduc)on to Informa)on Retrieval

END FOR THIS WEEK

Introduc)on to Informa)on Retrieval

What Is A Good Clustering?

Introduc)on to Informa)on Retrieval

External criteria for clustering quality

Introduc)on to Informa)on Retrieval

External EvaluaEon of Cluster Quality

Introduc)on to Informa)on Retrieval

Cluster I: (max(5, 1, 0)) = 5 (red)

You might also like