Unsupervised Algorithms Unit3
Unsupervised Algorithms Unit3
• Input data : set of documents to classify ,not even class labels are
provided
• Task of the classifier : separate documents into
subsets (clusters) automatically separating procedure is called
clustering
Applications of Clustering
• IR: presentation of results (clustering of documents)
• Summarisation:
1. clustering of similar documents for multi-document summarisation
2. clustering of similar sentences for re-generation of sentences
• Topic Segmentation: clustering of similar paragraphs (adjacent or non-
adjacent) for detection of topic structure/importance
• Lexical semantics: clustering of words by cooccurrence patterns
Example
• Class labels can be generated automatically but are different from
labels specified by humans usually.
• Thus, solving the whole classification problem with no human
intervention is hard ,If class labels are provided, clustering is more
effective
The Cluster Hypothesis
• “Similar documents tend to be relevant to the same requests”
• Issues:
1. Variants: “Documents that are relevant to the same topics are
similar”
2. Simple vs. complex topics
3. Evaluation, prediction
• The cluster hypothesis is the main motivation behind document
clustering
Similarity Coefficients
1.Simple matching:
2.Dice’s Coefficient:
3.Cosine Coefficient:
Document-document similarity
• Document representative
• > Select features to characterize document: terms,phrases, citations
• > Select weighting scheme for these features:
a) Binary, raw/relative frequency, divergence measure
b) Title / body / abstract, controlled vocabulary, selected topics,
taxonomy
• Similarity / association coefficient or dissimilarity/ distance metric
Clustering methods
• Non-hierarchic methods
• => partitions
> High efficiency, low effectiveness»
• Hierarchic methods
• => hierarchic structures - small clusters of highly similar documents
nested within larger clusters of less similar documents
• Divisive => monothetic classifications
• Agglomerative => polythetic classifications !!
Partitioning method
• Generic procedure:
• The first object becomes the first cluster
• Each subsequent object is matched against existing clusters
1. It is assigned to the most similar cluster if the similarity measure is
above a set threshold
2. Otherwise it forms a new cluster
• Re-shuffling of documents into clusters can be done iteratively to
increase cluster similarity
Representation of clustered hierarches
kohonen feature in maps on text
• Clustering is used in information retrieval systems to enhance the
efficiency and effectiveness of the retrieval process.
• Clustering is achieved by partitioning the documents in a collection
into classes such that documents that are associated with each other
are assigned to the same cluster.
Types of Clustering
Desiderata for clustering
Non-hierarchical (partitioning) clustering
• Partitional clustering algorithms produce a set of k non-nested
partitions corresponding to k clusters of n objects.
• Advantage: not necessary to compare each object to each other
object, just comparisons of objects – cluster centroids necessary
• Optimal partitioning clustering algorithms are O(kn)
• Main algorithm: K-means
K-means Clustering