-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
FEA Add Rand Index and pair confusion matrix #17412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. At some point (I can't find the issue) I proposed that we have a function which turns a binary classification metric into a clustering metric via the pairwise transformation. In that framework, this would just be accuracy... Not sure if we want to go about implementing the generic solution. I know that in some spaces, F1 at least is calculated on pairs of clustered elements. (Another generic class of clustering metrics is the max-sum assignment, which here we only implement as the consensus score for biclustering.) |
Currently it passes the linter but it fails to fully build because the test code is looking for the new methods to be part of the package, but as they are not yet merged they are not found. That seems like a generic chicken-and-egg problem, but maybe I am misunderstanding something on how this is set up.
|
@jnothman I made the requested changes. Haven't seen any update on this for a week. I don't know how fast these things usually progress.l Is there something else that's waited for from me? |
Hi @ufmayer, thanks for your work. Some checks didn't pass, do you mind having a look at them? Thanks! |
Please, forgive me @ufmayer you already noticed the unsuccessful checks, my comment wasn't helpful. |
@cmarmo Merci beaucoup. That was the information I needed. All checks have now passed. What's the next step? |
@ufmayer, je vous en prie... :)
'Waiting for reviewer' ... :) , pinging from time to time for the pull request don't get lost. Also it is useful to check if conflicts are generated by other merge, and fix them when they arise. |
Sorry, my availability lately has been very scattered. I commented on this
because I was interested in it, not because I had excess time for review!
:D
I'll try give it another shot.
|
@jnothman Thanks for saying you might find time to give this another shot. I think it's ready to go. How about it? |
I've tagged it for the current release milestone, and hopefully it will be picked up by another reviewer as we work towards release over the next 6 weeks. Sorry for the delays, but there's a lot to review! |
@jnothman Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of reviews.
|
||
""" | ||
labels_true, labels_pred = check_clusterings(labels_true, labels_pred) | ||
n_samples = np.int64(labels_true.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_samples = np.int64(labels_true.shape[0]) | |
n_samples = labels_true.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this unchanged for now. The entire calculations below are using np.int64 and the return data type of the ndarray is np.int64, and this cast makes this explicitly clear. The cast will of course otherwise happen implicitly anyway but I think this makes it unambiguous. I am not opposed to the change, but if you don't feel strongly I would prefer to let it stand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some remaining comments.
But in general it looks good.
@ufmayer You can ignore the MacOS build failure. We are currently investigating it. |
@glemaitre Thank you for the thorough review. I implemented the majority of your suggestions and resolved the corresponding auto-generated conversations in this PR. Please have a look at the remaining ones, I left brief comments on each that I didn't resolve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the remaining comments. Just some nitpicks. Once done this is mergeable :)
@@ -113,14 +108,20 @@ def contingency_matrix(labels_true, labels_pred, *, eps=None, sparse=False): | |||
|
|||
.. versionadded:: 0.18 | |||
|
|||
dtype : numeric data type, default=np.int. | |||
See the notes on ``eps`` below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we specify the following instead (if I don't make a mistake interpreting the note below).
Output dtype. Ignore if `eps` is not `None`.
@glemaitre Done. |
@glemaitre, ready for a second approval? Thanks! |
Thanks @ufmayer LGTM merging. |
Reference Issues/PRs
Adding new functionality
What does this implement/fix? Explain your changes.
Two changes:
adjusted_rand_score method.
Any other comments?
Rand Index is essentially computing the non-chance-adjusted accuracy of a clustering, which is a standard measure for clusterings.