DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44

kno10 · 2019-12-26T20:48:18Z

See the given reference, this is a very different algorithm.

rth · 2019-12-26T21:52:22Z

You are right, I believe this implements "A simple and fast algorithm for K-medoids clustering" H.S. Park 2009 that is also widely cited, instead of the original PAM algorithm. It is also the implementation mentioned in "Elements of Statistical Learning" as far as I understand. The user manual is indeed incorrect about using PAM and should be fixed.

The main advantage of the current algorithm is performance. According to the above paper, for N samples and k classes PAM run time scales as O(k(N-k)²) while this one is O(kn) (though given that it needs to pre-calculate the pairwise distance matrix which is O(N²) I don't see how it could be, maybe I am missing something).

A different algorithm, that finds higher quality results, is explained in:

Did you get worse results than with PAM? On what data, and compared to what other implementation? Would you be interested in contributing the PAM implementation?

cc @zdog234

kno10 · 2019-12-26T22:17:46Z

The performance is fast (but O(N²)) - because it fails to optimize well, the k-means-style iterations get stuck in local optima easily that PAM can escape from. And if you miss many improvements, you only need few iterations and are fast. But also the result quality is much worse - it found 18% worse solutions on average on standard ORlib benchmark data sets.
In fact, if you completely implement Park - including their initialization - it was even 39% worse, because their initialization logic is also very poor. In fact, the results were usually barely better than just sampling medoids and assigning objects to the nearest medoid once (which really would be O(nk)).
The runtime is okay, but there are better alternatives that find good solutions.
Maranzana (1963) used that long before Park. And several people have noted in literature that this does not work very well (its called the "alternate" heuristic in facility location, because of the two alternating steps). Its just that the Park paper is much better at selling it as "good" when it isn't, with experiments on the iris data set + other easy toy data sets...
Clearly the claim of being O(nk) in the Park paper is incorrect, too. Step 2 very clearly takes O(n²) time. Your implementation won't be in O(nk) either I guess?

znd4 · 2019-12-27T01:04:17Z

Ah, my bad. Most of the implementation (and documentation) here came from two previous PRs, and I should have been more careful with updating the documentation.

rth · 2019-12-27T10:23:54Z

Thanks for the feedback @kno10 ! Yes, I think we should fix the documentation. It's not PAM, but what we have is fairly good baseline (and relatively simple) implementation. Putting aside the Park paper, that indeed doesn't provide very detailed experiments, it being outlined in ESL book is "standard" enough IMO. I don't doubt there are possible improvements though and so adding PAM solver in a separate PR would be certainly be very welcome (I opened #46)

@zdog234 No worries, I should have reviewed more thoroughly.

rth · 2019-12-27T10:26:37Z

doc/user_guide.rst

- * "Clustering by Means of Medoids'"
-   Kaufman, L. and Rousseeuw, P.J.,
-   Statistical Data Analysis Based on the L1Norm and Related Methods, edited
-   by Y. Dodge, North-Holland, 405416. 1987


OK let's then add Maranzana (1963) and Park (2009) references here and in the docstring below.

But the references should reflect what was used, and implemented. For example Park specifies a different initialization strategy. I don't think retrofitting references is the proper way to go. Maybe the ESL book then should be cited instead,.

It's not about retrofitting, we can cite ESL, but is not the primary source, and there is barely 2 pages on K-medoids there. Maranzana (1963) does seem to describe this algorithm with random initialization. The initialization is indeed different in Park (2009), but I would still mention it ( I understand that you don't like it :) ) , as otherwise the iterative step is the same, and it has a more recent bibliography review on the topic. We can add their initialization as an option as well.

For example Park specifies a different initialization strategy.

Actually, init="heuristic"

scikit-learn-extra/sklearn_extra/cluster/_k_medoids.py

Lines 336 to 339 in ab1a7ef

elif self.init == "heuristic": # Initialization by heuristic

# Pick K first data points that have the smallest sum distance

# to every other point. These are the initial medoids.

medoids = np.argpartition(np.sum(D, axis=1), n_clusters - 1)[

is not that different from what they do up to a normalization factor I think?

That is quite similar, except for the missing normalization term. From my intuition this will work very poorly, because most likely these medoids will be close to each other at the center of the data set; so none of them will be a good medoid. If you benchmark this, it will likely work worse than uniform random.

rth · 2019-12-27T10:27:08Z

sklearn_extra/cluster/_k_medoids.py

@@ -90,6 +90,8 @@ class KMedoids(BaseEstimator, ClusterMixin, TransformerMixin):

    References
    ----------
+    A different algorithm, that finds higher quality results, is explained in:


I would remove this sentence in favor of Maranzana (1963) and Park (2009) references.

rth · 2019-12-27T10:35:55Z

doc/user_guide.rst

-two alternating steps commonly called the
-Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987).
+currently only supports a non-standard version of K-Medoids substantially
+different from the well-known PAM algorithm.


maybe a bit more neutrally

"""
currently only supports K-Medoids solver analogous to K-Means. Other frequently used approach is partitioning around medoids (PAM).
"""

kno10 · 2019-12-30T18:21:07Z

Just to keep you updated as I go through some related literature.

I just came across this note from 1979:
https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1538-4632.1979.tb00674.x

Although the Teitz and Bart heuristic cannot guarantee optimality, its solutions, at least to problems of this size, are optimal with a high degree of regularity. On the other hand, optimal solutions with the Maranzana heuristic are the exception

The Teitz and Bart heuristic is largely the same as PAM (but randomly started). The Maranzana heuristic is the k-means-style approach.

I cannot confirm that the results of PAM are optimal with a "high" regularity on general problems. The authors in the note used geographic data (and it is in "geographical analysis"), so this is likely rather well behaved data. On the ORLib data sets, success rate of PAM is about 33%, with random initialization and 10 tries this increases to over 50% (and probably can further be increased with additional restarts). The k-means-style approach solves 0% with 10 tries, unless initialized with BUILD of PAM (in which case it was optimal exactly if BUILD already was optimal).

Solosneros · 2020-01-20T13:14:21Z

Hey,

I saw this issue after realizing that the implementation of PAM is not the original one. I would suggest removing the comment about the implementation from the sourcecode and documentation first and think about a suitable implementation later. One might not see this discussion and use the algorithm. In comparison to the Implementation in R, which I believe is close to the original implementation, the currently implemented algorithm computed different results for my data set.

I used the R implementation instead. The documentation gives good insight about the original algorithm and different approaches to optimize the time complexity.

Is there a reason for not implementing the original PAM and the optimizations?

kno10 · 2020-02-02T11:35:14Z

@Solosneros this pull request aims at fixing the documentation of the current state, yes.

IMHO, references should document what has been used for the implementation, and apparently the Maranzana nor the Park work were used. My guess is that it is based on the ESL book (which doesn't include a reference on their k-medoids IIRC). But since I did not write the code, I do not know what was used - just that it clearly is not PAM, and not the current reference.

rth · 2020-03-29T14:51:25Z

Just to move this forward, I have a pushed a few fixed with a more neutral formulation (as suggested in earlier review comments).

If someone can contribute the PAM implementation that would be very much welcome. Let's continue this discussion in #46

Thanks @kno10 !

kno10 added 2 commits December 26, 2019 21:47

The algorithm explained - and implemented - is not PAM.

e358a7b

See the given reference, this is a very different algorithm.

Reference is not what is implemented here.

fead8fc

rth added bug Something isn't working doc and removed bug Something isn't working labels Dec 26, 2019

rth mentioned this pull request Dec 27, 2019

KMedoids: add PAM algorithm #46

Closed

rth reviewed Dec 27, 2019

View reviewed changes

rth added 2 commits March 29, 2020 16:44

Address comments

880d7d4

Fix references

383d3c9

rth changed the title ~~Documentation fix: The algorithm explained - and implemented - is not PAM.~~ DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM Mar 29, 2020

rth merged commit 0008438 into scikit-learn-contrib:master Mar 29, 2020

rth mentioned this pull request Nov 7, 2020

Add PAM algorithm to K-Medoids #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44

DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44

kno10 commented Dec 26, 2019

rth commented Dec 26, 2019 •

edited

Loading

kno10 commented Dec 26, 2019

znd4 commented Dec 27, 2019

rth commented Dec 27, 2019

rth Dec 27, 2019

kno10 Dec 27, 2019

rth Dec 27, 2019

rth Dec 27, 2019

kno10 Dec 27, 2019

rth Dec 27, 2019

rth Dec 27, 2019

kno10 commented Dec 30, 2019

Solosneros commented Jan 20, 2020

kno10 commented Feb 2, 2020

rth commented Mar 29, 2020

	elif self.init == "heuristic": # Initialization by heuristic
	# Pick K first data points that have the smallest sum distance
	# to every other point. These are the initial medoids.
	medoids = np.argpartition(np.sum(D, axis=1), n_clusters - 1)[

DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44

DOC fix: The algorithm explained - and implemented - in K-Medoids is not PAM #44

Conversation

kno10 commented Dec 26, 2019

rth commented Dec 26, 2019 • edited Loading

kno10 commented Dec 26, 2019

znd4 commented Dec 27, 2019

rth commented Dec 27, 2019

rth Dec 27, 2019

Choose a reason for hiding this comment

kno10 Dec 27, 2019

Choose a reason for hiding this comment

rth Dec 27, 2019

Choose a reason for hiding this comment

rth Dec 27, 2019

Choose a reason for hiding this comment

kno10 Dec 27, 2019

Choose a reason for hiding this comment

rth Dec 27, 2019

Choose a reason for hiding this comment

rth Dec 27, 2019

Choose a reason for hiding this comment

kno10 commented Dec 30, 2019

Solosneros commented Jan 20, 2020

kno10 commented Feb 2, 2020

rth commented Mar 29, 2020

rth commented Dec 26, 2019 •

edited

Loading