Skip to content

FEA Add Rand Index and pair confusion matrix #17412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 65 commits into from
Oct 28, 2020

Conversation

ufmayer
Copy link
Contributor

@ufmayer ufmayer commented Jun 1, 2020

Reference Issues/PRs

Adding new functionality

What does this implement/fix? Explain your changes.

Two changes:

  1. contingency_matrix() creates currently an int overflow for large data such as using adjusted_rand_score on large clusterings. This implementation adds an optional dtype parameter which can be used to store the counts in np.int64 to avoid overflows.
  2. Adds computation of the unadjusted Rand Index (https://en.wikipedia.org/wiki/Rand_index). Computation is based on an efficient pair comparison helper method similar to the existing
    adjusted_rand_score method.

Any other comments?

Rand Index is essentially computing the non-chance-adjusted accuracy of a clustering, which is a standard measure for clusterings.

@jnothman
Copy link
Member

jnothman commented Jun 2, 2020

Thanks for the PR.
The linter is unhappy.

At some point (I can't find the issue) I proposed that we have a function which turns a binary classification metric into a clustering metric via the pairwise transformation. In that framework, this would just be accuracy... Not sure if we want to go about implementing the generic solution. I know that in some spaces, F1 at least is calculated on pairs of clustered elements.

(Another generic class of clustering metrics is the max-sum assignment, which here we only implement as the consensus score for biclustering.)

@ufmayer
Copy link
Contributor Author

ufmayer commented Jun 2, 2020

Currently it passes the linter but it fails to fully build because the test code is looking for the new methods to be part of the package, but as they are not yet merged they are not found. That seems like a generic chicken-and-egg problem, but maybe I am misunderstanding something on how this is set up.

sklearn/tests/test_supervised.py:6: error: Module 'sklearn.metrics' has no attribute 'pair_confusion_matrix'
sklearn/tests/test_supervised.py:6: error: Module 'sklearn.metrics' has no attribute 'rand_score'
sklearn/metrics/cluster/tests/test_common.py:8: error: Module 'sklearn.metrics.cluster' has no attribute 'rand_score'

@ufmayer
Copy link
Contributor Author

ufmayer commented Jun 11, 2020

@jnothman I made the requested changes. Haven't seen any update on this for a week. I don't know how fast these things usually progress.l Is there something else that's waited for from me?

@cmarmo
Copy link
Contributor

cmarmo commented Jun 12, 2020

Hi @ufmayer, thanks for your work. Some checks didn't pass, do you mind having a look at them? Thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Jun 12, 2020

Please, forgive me @ufmayer you already noticed the unsuccessful checks, my comment wasn't helpful.
The mypy check tells you that rand_score and pair_confusion_matrix should be defined in the sklearn/metrics/cluster/__init__.py to be imported correctly. Doing that will let the build know where the new module is.
Hope this would help to move forward.

@ufmayer
Copy link
Contributor Author

ufmayer commented Jun 12, 2020

@cmarmo Merci beaucoup. That was the information I needed. All checks have now passed. What's the next step?

@cmarmo
Copy link
Contributor

cmarmo commented Jun 12, 2020

@ufmayer, je vous en prie... :)

What's the next step?

'Waiting for reviewer' ... :) , pinging from time to time for the pull request don't get lost. Also it is useful to check if conflicts are generated by other merge, and fix them when they arise.
Last suggestion: you can add [MRG] to the PR title (as specified in the documentation), to specify that your PR is ready for review. Good luck!

@ufmayer ufmayer changed the title Adding an implementation of unadjusted Rand Index [MRG] Adding an implementation of unadjusted Rand Index Jun 12, 2020
@jnothman
Copy link
Member

jnothman commented Jun 13, 2020 via email

@ufmayer
Copy link
Contributor Author

ufmayer commented Jun 23, 2020

@jnothman Thanks for saying you might find time to give this another shot. I think it's ready to go. How about it?

@ufmayer
Copy link
Contributor Author

ufmayer commented Sep 1, 2020

@jnothman @cmarmo It's been 3 months today since I filed this PR. Is there anything you could suggest to move this along?

@jnothman jnothman added this to the 0.24 milestone Sep 1, 2020
@jnothman
Copy link
Member

jnothman commented Sep 1, 2020

I've tagged it for the current release milestone, and hopefully it will be picked up by another reviewer as we work towards release over the next 6 weeks.

Sorry for the delays, but there's a lot to review!

@ufmayer
Copy link
Contributor Author

ufmayer commented Sep 1, 2020

@jnothman Thanks!

@glemaitre glemaitre self-requested a review October 15, 2020 08:01
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of reviews.


"""
labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
n_samples = np.int64(labels_true.shape[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_samples = np.int64(labels_true.shape[0])
n_samples = labels_true.shape[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this unchanged for now. The entire calculations below are using np.int64 and the return data type of the ndarray is np.int64, and this cast makes this explicitly clear. The cast will of course otherwise happen implicitly anyway but I think this makes it unambiguous. I am not opposed to the change, but if you don't feel strongly I would prefer to let it stand.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some remaining comments.
But in general it looks good.

@glemaitre glemaitre changed the title [MRG] Adding an implementation of unadjusted Rand Index FEA Add Rand Index and pair confusion matrix Oct 20, 2020
@glemaitre
Copy link
Member

@ufmayer You can ignore the MacOS build failure. We are currently investigating it.
However, the other potential failure should be looked at :)

@ufmayer
Copy link
Contributor Author

ufmayer commented Oct 21, 2020

@glemaitre Thank you for the thorough review. I implemented the majority of your suggestions and resolved the corresponding auto-generated conversations in this PR. Please have a look at the remaining ones, I left brief comments on each that I didn't resolve.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the remaining comments. Just some nitpicks. Once done this is mergeable :)

@@ -113,14 +108,20 @@ def contingency_matrix(labels_true, labels_pred, *, eps=None, sparse=False):

.. versionadded:: 0.18

dtype : numeric data type, default=np.int.
See the notes on ``eps`` below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify the following instead (if I don't make a mistake interpreting the note below).

Output dtype. Ignore if `eps` is not `None`.

@ufmayer
Copy link
Contributor Author

ufmayer commented Oct 22, 2020

@glemaitre Done.

@cmarmo
Copy link
Contributor

cmarmo commented Oct 26, 2020

@glemaitre, ready for a second approval? Thanks!

@glemaitre glemaitre merged commit 662cc64 into scikit-learn:master Oct 28, 2020
@glemaitre
Copy link
Member

Thanks @ufmayer LGTM merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants