Module 5 Document Clustering
Module 5 Document Clustering
Outline
1 Introduction
The Problem and the Motivation
Approach
2 Methodology
Document Representation
Similarity Measures
Clustering Algorithms
Evaluation
3 Related Work
Past Results
References
4 The End
Introduction Methodology Related Work The End
Approach
Document Representation
Document Representation
Document 2
I can’t wait for this to get over.
Introduction Methodology Related Work The End
Document Representation
Document Representation
Document 2
[1,0,0,0,0,0,1,1,1,1,1,1,1]
Document Representation
Document Representation
Pre-processing
Document Representation
TFIDF
Similarity Measures
Metric
Similarity Measures
Euclidean Distance
Similarity Measures
Cosine Similarity
Similarity Measures
Jaccard Coefficient
t~a · t~b
SIMJ (t~a , t~b ) =
|t~a |2 + |t~b |2 − t~a · t~b
where t~a and t~b are m-dimensional vectors over the term
set T .
Non-negative and bounded between [0, 1].
Introduction Methodology Related Work The End
Similarity Measures
Similarity Measures
Manhattan Distance
where t~a and t~b are m-dimensional vectors over the term
set T and wt,a = tfidf (da , t).
Introduction Methodology Related Work The End
Similarity Measures
Chebychev Distance
where t~a and t~b are m-dimensional vectors over the term
set T and wt,a = tfidf (da , t).
Introduction Methodology Related Work The End
Clustering Algorithms
Hierarchical Algorithms
Clustering Algorithms
http://www.cs.utexas.edu/~mooney/cs391L/
slides/clustering.ppt
Clustering Algorithms
Clustering Algorithms
k -means Algorithm
Clustering Algorithms
Clustering Algorithms
k -means: In action
http://www.codeproject.com/Articles/439890/
Text-Documents-Clustering-using-K-Means-Algorithm
Clustering Algorithms
Evaluation
Entropy
where c is the total number of categories in the data set and nih
is the number of documents from the hth class that were
assigned to this cluster Ci .
Introduction Methodology Related Work The End
Evaluation
Purity
Evaluation
Datasets
Past Results
Anna Huang
Past Results
Anna Huang
References
References
Anna Huang.
Similarity measures for document clustering.
In Proceedings of the Sixth New Zealand Computer
Science Research Student Conference (NZCSRSC2008),
Christchurch, New Zealand, pages 49 − 56, 2008.
D. Arthur and S. Vassilvitskii.
k-means++ the advantages of careful seeding.
In Symposium on Discrete Algorithms, 2007.
Y. Zhao and G. Karypis.
Empirical and theoretical comparisons of selected criterion
functions for document clustering.
Machine Learning, 55(3), 2004.
Introduction Methodology Related Work The End
Thank You!
Questions?
Suggestions?