-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44
Conversation
See the given reference, this is a very different algorithm.
You are right, I believe this implements "A simple and fast algorithm for K-medoids clustering" H.S. Park 2009 that is also widely cited, instead of the original PAM algorithm. It is also the implementation mentioned in "Elements of Statistical Learning" as far as I understand. The user manual is indeed incorrect about using PAM and should be fixed. The main advantage of the current algorithm is performance. According to the above paper, for N samples and k classes PAM run time scales as O(k(N-k)²) while this one is O(kn) (though given that it needs to pre-calculate the pairwise distance matrix which is O(N²) I don't see how it could be, maybe I am missing something).
Did you get worse results than with PAM? On what data, and compared to what other implementation? Would you be interested in contributing the PAM implementation? cc @zdog234 |
The performance is fast (but O(N²)) - because it fails to optimize well, the k-means-style iterations get stuck in local optima easily that PAM can escape from. And if you miss many improvements, you only need few iterations and are fast. But also the result quality is much worse - it found 18% worse solutions on average on standard ORlib benchmark data sets. |
Ah, my bad. Most of the implementation (and documentation) here came from two previous PRs, and I should have been more careful with updating the documentation. |
Thanks for the feedback @kno10 ! Yes, I think we should fix the documentation. It's not PAM, but what we have is fairly good baseline (and relatively simple) implementation. Putting aside the Park paper, that indeed doesn't provide very detailed experiments, it being outlined in ESL book is "standard" enough IMO. I don't doubt there are possible improvements though and so adding PAM solver in a separate PR would be certainly be very welcome (I opened #46) @zdog234 No worries, I should have reviewed more thoroughly. |
* "Clustering by Means of Medoids'" | ||
Kaufman, L. and Rousseeuw, P.J., | ||
Statistical Data Analysis Based on the L1Norm and Related Methods, edited | ||
by Y. Dodge, North-Holland, 405416. 1987 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's then add Maranzana (1963) and Park (2009) references here and in the docstring below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the references should reflect what was used, and implemented. For example Park specifies a different initialization strategy. I don't think retrofitting references is the proper way to go. Maybe the ESL book then should be cited instead,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about retrofitting, we can cite ESL, but is not the primary source, and there is barely 2 pages on K-medoids there. Maranzana (1963) does seem to describe this algorithm with random initialization. The initialization is indeed different in Park (2009), but I would still mention it ( I understand that you don't like it :) ) , as otherwise the iterative step is the same, and it has a more recent bibliography review on the topic. We can add their initialization as an option as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example Park specifies a different initialization strategy.
Actually, init="heuristic"
scikit-learn-extra/sklearn_extra/cluster/_k_medoids.py
Lines 336 to 339 in ab1a7ef
elif self.init == "heuristic": # Initialization by heuristic | |
# Pick K first data points that have the smallest sum distance | |
# to every other point. These are the initial medoids. | |
medoids = np.argpartition(np.sum(D, axis=1), n_clusters - 1)[ |
is not that different from what they do up to a normalization factor I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is quite similar, except for the missing normalization term. From my intuition this will work very poorly, because most likely these medoids will be close to each other at the center of the data set; so none of them will be a good medoid. If you benchmark this, it will likely work worse than uniform random.
sklearn_extra/cluster/_k_medoids.py
Outdated
@@ -90,6 +90,8 @@ class KMedoids(BaseEstimator, ClusterMixin, TransformerMixin): | |||
|
|||
References | |||
---------- | |||
A different algorithm, that finds higher quality results, is explained in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this sentence in favor of Maranzana (1963) and Park (2009) references.
doc/user_guide.rst
Outdated
two alternating steps commonly called the | ||
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987). | ||
currently only supports a non-standard version of K-Medoids substantially | ||
different from the well-known PAM algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a bit more neutrally
"""
currently only supports K-Medoids solver analogous to K-Means. Other frequently used approach is partitioning around medoids (PAM).
"""
Just to keep you updated as I go through some related literature. I just came across this note from 1979:
The Teitz and Bart heuristic is largely the same as PAM (but randomly started). The Maranzana heuristic is the k-means-style approach. I cannot confirm that the results of PAM are optimal with a "high" regularity on general problems. The authors in the note used geographic data (and it is in "geographical analysis"), so this is likely rather well behaved data. On the ORLib data sets, success rate of PAM is about 33%, with random initialization and 10 tries this increases to over 50% (and probably can further be increased with additional restarts). The k-means-style approach solves 0% with 10 tries, unless initialized with BUILD of PAM (in which case it was optimal exactly if BUILD already was optimal). |
Hey, I saw this issue after realizing that the implementation of PAM is not the original one. I would suggest removing the comment about the implementation from the sourcecode and documentation first and think about a suitable implementation later. One might not see this discussion and use the algorithm. In comparison to the Implementation in R, which I believe is close to the original implementation, the currently implemented algorithm computed different results for my data set. I used the R implementation instead. The documentation gives good insight about the original algorithm and different approaches to optimize the time complexity. Is there a reason for not implementing the original PAM and the optimizations? |
@Solosneros this pull request aims at fixing the documentation of the current state, yes. IMHO, references should document what has been used for the implementation, and apparently the Maranzana nor the Park work were used. My guess is that it is based on the ESL book (which doesn't include a reference on their k-medoids IIRC). But since I did not write the code, I do not know what was used - just that it clearly is not PAM, and not the current reference. |
See the given reference, this is a very different algorithm.