Text Clustering
Text Clustering
Text Clustering
Clustering
Partition unlabeled examples into disjoint subsets of clusters, such that:
Examples within a cluster are very similar Examples in different clusters are very different
Discover new categories in an unsupervised manner (no sample category labels provided).
Clustering Example
.. . . .. . . . . . . . . . .
3
Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.
animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean
HAC Algorithm
Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj
Cluster Similarity
Assume a similarity function that determines the similarity of two instances: sim(x,y).
Cosine similarity of document vectors.
How to compute similarity of two clusters each possibly containing multiple instances?
Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members.
Can result in straggly (long and thin) clusters due to chaining effect.
Appropriate in some domains, such as clustering islands.
10
11
12
13
Computational Complexity
In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.
14
15
Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters (to encourage tighter final clusters).
16
( s (ci ) s (c j )) y ( s (ci ) s (c j )) (| ci | | c j |)
17
Non-Hierarchical Clustering
Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.
18
K-Means
Assumes instances are real-valued vectors. Clusters based on centroids, center of gravity, or mean of points in a cluster, c:
T T 1 (c) ! x T | c | xc
Distance Metrics
Euclidian distance (L2 norm):
TT L2 ( x , y ) ! ( xi yi ) 2
i !1 m
L1 norm: TT
L1 ( x , y ) ! xi yi
i !1
K-Means Algorithm
Let d be the distance measure between instances. Select k random instances {s1, s2, sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = Q(cj)
21
K Means Example
(K=2)
Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged!
22
Time Complexity
Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(kn) distance computations, or O(knm). Computing centroids: Each instance vector gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(Iknm). Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC.
23
K-Means Objective
The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid.
K l !1
xi X l
|| x i Q l ||
Finding the global optimum is NP-hard. The k-means algorithm is guaranteed to converge a local optimum.
24
Seed Choice
Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic or the results of another method.
25
Buckshot Algorithm
Combines HAC and K-Means clustering. First randomly take a sample of instances of size n Run group-average HAC on this sample, which takes only O(n) time. Use the results of HAC as initial seeds for K-means. Overall algorithm is O(n) and avoids problems of bad seed selection.
26
Text Clustering
HAC and K-Means have been applied to text in a straightforward way. Typically use normalized, TF/IDF-weighted vectors and cosine similarity. Optimize computations for sparse vectors. Applications:
During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. Clustering of results of retrieval to present more organized results to the user ( la Northernlight folders). Automated production of hierarchical taxonomies of documents for browsing purposes ( la Yahoo & DMOZ).
27
Soft Clustering
Clustering typically assumes that each instance is given a hard assignment to exactly one cluster. Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. Soft clustering gives probabilities that an instance belongs to each of a set of clusters. Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1).
28
29
EM Algorithm
Iterative method for learning probabilistic categorization model from unsupervised data. Initially assume random assignment of examples to categories. Learn an initial probabilistic model by estimating model parameters U from this randomly labeled data. Iterate following two steps until convergence:
Expectation (E-step): Compute P(ci | E) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. Maximization (M-step): Re-estimate the model parameters, U, from the probabilistically re-labeled data.
30
EM
Initialize:
Assign random probabilistic labels to unlabeled data
Unlabeled Examples
+ + + + +
31
EM
Initialize:
Give soft-labeled training data to a probabilistic learner
+ + + + +
Prob. Learner
32
EM
Initialize:
Produce a probabilistic classifier
+ + + + +
Prob. Learner
Prob. Classifier
33
EM
E Step:
Relabel unlabled data using the trained classifier
+
Prob. Learner
Prob. Classifier
+ + + +
34
EM
M step:
Retrain classifier on relabeled data
+
Prob. Learner
Prob. Classifier
+ + + +
Nave Bayes EM
Randomly assign examples probabilistic category labels. Use standard nave-Bayes training to learn a probabilistic model with parameters U from the labeled data. Until convergence or until maximum number of iterations reached: E-Step: Use the nave Bayes model U to compute P(ci | E) for each category and example, and re-label each example using these probability values as soft category labels. M-Step: Use standard nave-Bayes training to re-estimate the parameters U using these new probabilistic category labels.
37
Semi-Supervised Learning
For supervised categorization, generating labeled training data is expensive. Idea: Use unlabeled data to aid supervised categorization. Use EM in a semi-supervised mode by training EM on both labeled and unlabeled data.
Train initial probabilistic model on user-labeled subset of data instead of randomly labeled unsupervised data. Labels of user-labeled examples are frozen and never relabeled during EM iterations. Labels of unsupervised data are constantly probabilistically relabeled by EM.
38
Semi-Supervised EM
Training Examples + + +
Unlabeled Examples
+
Prob. Learner
Prob. Classifier
+ + + +
39
Semi-Supervised EM
Training Examples + + +
+
Prob. Learner
Prob. Classifier
+ + + +
40
Semi-Supervised EM
Training Examples + + +
+ + + + +
Prob. Learner
Prob. Classifier
41
Semi-Supervised EM
Training Examples + + +
Unlabeled Examples
+
Prob. Learner
Prob. Classifier
+ + + +
42
Semi-Supervised EM
Training Examples + + +
+
Prob. Learner
Prob. Classifier
+ + + +
Semi-Supervised EM Results
Experiments on assigning messages from 20 Usenet newsgroups their proper newsgroup label. With very few labeled examples (2 examples per class), semi-supervised EM significantly improved predictive accuracy:
27% with 40 labeled messages only. 43% with 40 labeled + 10,000 unlabeled messages.
With more labeled examples, semi-supervision can actually decrease accuracy, but refinements to standard EM can help prevent this.
Must weight labeled data appropriately more than unlabeled data.
For semi-supervised EM to work, the natural clustering of data must be consistent with the desired categories
Failed when applied to English POS tagging (Merialdo, 1994)
44
Semi-Supervised EM Example
Assume Catholic is present in both of the labeled documents for soc.religion.christian, but Baptist occurs in none of the labeled data for this class. From labeled data, we learn that Catholic is highly indicative of the Christian category. When labeling unsupervised data, we label several documents with Catholic and Baptist correctly with the Christian category. When retraining, we learn that Baptist is also indicative of a Christian document. Final learned model is able to correctly assign documents containing only Baptist to Christian.
45
Issues in Clustering
How to evaluate clustering?
Internal:
Tightness and separation of clusters (e.g. k-means objective) Fit of probabilistic model to data
External
Compare to known class labels on benchmark data
Improving search to converge faster and avoid local minima. Overlapping clustering.
46
Conclusions
Unsupervised learning induces categories from unlabeled data. There are a variety of approaches, including:
HAC k-means EM
Semi-supervised learning uses both labeled and unlabeled data to improve results.
47