Clustering
Clustering
K-means algorithm
Classification Clustering
https://colah.github.io/posts/2015-01-Visualizing-Representations/
Types of Clustering
Hard clustering v. soft clustering
Hierarchical clustering: pairs of most-similar clusters are iteratively linked until all
objects are in a clustering relationship
Advantage: not necessary to compare each object to each other object, just
comparisons of objects – cluster centroids necessary
K-means typically converges in around 10-20 iterations (if we don’t care about a few documents switching back and
forth)
Non-optimality
If we start with a bad set of seeds, the resulting clustering can be horrible.
Solution 2: Use prior hierarchical clustering step to find seeds with good coverage of document space
Time complexity of K-means
Reassignment step: O(KNM) (we need to compute KN document-centroid distances,
each of which costs O(M)
Recomputation step: O(NM) (we need to add each of the document’s < M values to one
of the centroids)
Then repeatedly merge the two clusters that are most similar
HAC also can be applied if K can’t be predetermined (can start without knowing K)
Take-away
Partitional clustering
Provides less information but is more efficient (best: O(kn))
K-means
Hierarchical clustering
Best algorithms O(n2) complexity
Single-link vs. complete-link (vs. group-average)