ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering
ISTE - 612 Knowledge Processing Technologies: Week 12 Text Clustering
to Informa)on Retrieval
Clustering
algorithms
ParEEonal
Hierarchical
WHAT IS CLUSTERING?
Ch.
16
What
is
clustering?
Clustering:
the
process
of
grouping
a
set
of
objects
into
classes
of
similar
objects
Documents
within
a
cluster
should
be
similar.
Documents
from
dierent
clusters
should
be
dissimilar.
Ch.
16
What
is
clustering?
Clustering:
the
process
of
grouping
a
set
of
objects
into
classes
of
similar
objects
Documents
within
a
cluster
should
be
similar.
Documents
from
dierent
clusters
should
be
dissimilar.
Ch.
16
Sec.
16.1
ApplicaEons
of
clustering
in
IR
Whole
corpus
analysis/naviga@on
BeBer
user
interface:
search
without
typing
ApplicaEons
of
clustering
in
IR
Whole
corpus
analysis/navigaEon
Explore
data
Sec.
16.1
Sec.
16.1
ApplicaEons
of
clustering
in
IR
Whole
corpus
analysis/navigaEon
BeVer
user
interface:
search
without
typing
Sec.
16.1
ApplicaEons
of
clustering
in
IR
Whole
corpus
analysis/navigaEon
BeVer
user
interface:
search
without
typing
Sec.
16.1
CLUSTERING ISSUES
12
Sec.
16.2
NoEon
of
similarity/distance
Ideal:
seman2c
similarity.
PracEcal:
term-staEsEcal
similarity
(docs
as
vectors)
Cosine
similarity
For
many
algorithms,
easier
to
think
in
terms
of
a
distance
(rather
than
similarity)
between
docs.
But
real
implementaEons
use
cosine
similarity
Clustering
Algorithms
Flat
algorithms
Usually
start
with
a
random
parEEoning
Rene
it
iteraEvely
K
means
clustering
Hierarchical
algorithms
BoVom-up,
agglomeraEve
Sec.
16.4
K-Means
Assumes
documents
are
real-valued
vectors.
Clusters
based
on
centroids
(aka
the
center
of
gravity
or
mean)
of
points
in
a
cluster,
c:
!
!
1
(c) =
x
| c | x!c
Reassignment
of
instances
to
clusters
is
based
on
distance
to
the
current
cluster
centroids.
(Or
one
can
equivalently
phrase
it
in
terms
of
similariEes)
Sec.
16.4
K-Means
Algorithm
Select
K
random
docs
{s1,
s2,
sK}
as
seeds.
UnEl
clustering
converges
(or
other
stopping
criterion):
For
each
doc
di:
Assign
di
to
the
cluster
cj
such
that
dist(xi,
sj)
is
minimal.
(Next,
update
the
seeds
to
the
centroid
of
each
cluster)
For
each
cluster
cj
Compute
the
new
centroid
Sec.
16.4
K
Means
Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
Compute centroids
Reassign clusters
Converged!
TerminaEon
condiEons
Several
possibiliEes,
e.g.,
A
xed
number
of
iteraEons.
Doc
parEEon
unchanged.
Centroid
posiEons
dont
change.
Does this mean that the docs in a
cluster are unchanged?
Sec.
16.4
Sec.
16.4
Seed
Choice
Results
can
vary
based
on
random
seed
selecEon.
Some
seeds
can
result
in
poor
convergence
rate,
or
convergence
to
sub-opEmal
clusterings.
Select
good
seeds
using
a
heurisEc
(e.g.,
doc
least
similar
to
any
exisEng
mean)
Try
out
mulEple
starEng
points
IniEalize
with
the
results
of
another
method.
Example showing
sensitivity to seeds
HIERARCHICAL CLUSTERING
22
17
Ch.
Hierarchical
Clustering
Build
a
tree-based
hierarchical
taxonomy
(dendrogram)
from
a
set
of
documents.
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
24
Sec.
17.1
Sec.
17.2
Complete-link
Similarity
of
the
furthest
points,
the
least
cosine-similar
Centroid
Clusters
whose
centroids
(centers
of
gravity)
are
the
most
cosine-similar
Average-link
Average
cosine
between
all
pairs
of
elements
Sec.
17.2
Sec.
17.2
Complete
Link
Use
minimum
similarity
of
pairs:
Cj
Ck
Sec.
17.3
Group
Average
Similarity
of
two
clusters
=
average
similarity
of
all
pairs
within
merged
cluster.
! !
1
sim(ci , c j ) =
sim( x , y )
ci c j ( ci c j 1) x!( ci c j ) y!( ci c j ): y! x!
Compromise
between
single
and
complete
link.
Two
opEons:
Averaged
across
all
pairs
in
the
merged
cluster
Averaged
over
all
pairs
between
the
two
original
clusters
30
Sec.
16.3
Sec.
16.3
Sec.
16.3
Purity example
Sec.
16.3