FEA Add Rand Index and pair confusion matrix #17412

ufmayer · 2020-06-01T17:46:31Z

Reference Issues/PRs

Adding new functionality

What does this implement/fix? Explain your changes.

Two changes:

contingency_matrix() creates currently an int overflow for large data such as using adjusted_rand_score on large clusterings. This implementation adds an optional dtype parameter which can be used to store the counts in np.int64 to avoid overflows.
Adds computation of the unadjusted Rand Index (https://en.wikipedia.org/wiki/Rand_index). Computation is based on an efficient pair comparison helper method similar to the existing
adjusted_rand_score method.

Any other comments?

Rand Index is essentially computing the non-chance-adjusted accuracy of a clustering, which is a standard measure for clusterings.

…matrices

jnothman · 2020-06-02T11:48:44Z

Thanks for the PR.
The linter is unhappy.

At some point (I can't find the issue) I proposed that we have a function which turns a binary classification metric into a clustering metric via the pairwise transformation. In that framework, this would just be accuracy... Not sure if we want to go about implementing the generic solution. I know that in some spaces, F1 at least is calculated on pairs of clustered elements.

(Another generic class of clustering metrics is the max-sum assignment, which here we only implement as the consensus score for biclustering.)

sklearn/metrics/cluster/_supervised.py

ufmayer · 2020-06-02T20:59:33Z

Currently it passes the linter but it fails to fully build because the test code is looking for the new methods to be part of the package, but as they are not yet merged they are not found. That seems like a generic chicken-and-egg problem, but maybe I am misunderstanding something on how this is set up.

sklearn/tests/test_supervised.py:6: error: Module 'sklearn.metrics' has no attribute 'pair_confusion_matrix'
sklearn/tests/test_supervised.py:6: error: Module 'sklearn.metrics' has no attribute 'rand_score'
sklearn/metrics/cluster/tests/test_common.py:8: error: Module 'sklearn.metrics.cluster' has no attribute 'rand_score'

ufmayer · 2020-06-11T21:49:41Z

@jnothman I made the requested changes. Haven't seen any update on this for a week. I don't know how fast these things usually progress.l Is there something else that's waited for from me?

cmarmo · 2020-06-12T08:28:23Z

Hi @ufmayer, thanks for your work. Some checks didn't pass, do you mind having a look at them? Thanks!

cmarmo · 2020-06-12T12:33:17Z

Please, forgive me @ufmayer you already noticed the unsuccessful checks, my comment wasn't helpful.
The mypy check tells you that rand_score and pair_confusion_matrix should be defined in the sklearn/metrics/cluster/__init__.py to be imported correctly. Doing that will let the build know where the new module is.
Hope this would help to move forward.

…vised.py

… scorer files

ufmayer · 2020-06-12T18:58:15Z

@cmarmo Merci beaucoup. That was the information I needed. All checks have now passed. What's the next step?

cmarmo · 2020-06-12T19:29:21Z

@ufmayer, je vous en prie... :)

What's the next step?

'Waiting for reviewer' ... :) , pinging from time to time for the pull request don't get lost. Also it is useful to check if conflicts are generated by other merge, and fix them when they arise.
Last suggestion: you can add [MRG] to the PR title (as specified in the documentation), to specify that your PR is ready for review. Good luck!

jnothman · 2020-06-13T22:48:41Z

Sorry, my availability lately has been very scattered. I commented on this because I was interested in it, not because I had excess time for review! :D I'll try give it another shot.

ufmayer · 2020-06-23T22:00:39Z

@jnothman Thanks for saying you might find time to give this another shot. I think it's ready to go. How about it?

ufmayer · 2020-09-01T19:18:16Z

@jnothman @cmarmo It's been 3 months today since I filed this PR. Is there anything you could suggest to move this along?

jnothman · 2020-09-01T21:48:33Z

I've tagged it for the current release milestone, and hopefully it will be picked up by another reviewer as we work towards release over the next 6 weeks.

Sorry for the delays, but there's a lot to review!

ufmayer · 2020-09-01T21:51:48Z

@jnothman Thanks!

…into master

glemaitre

First round of reviews.

doc/modules/clustering.rst

sklearn/metrics/cluster/_supervised.py

glemaitre · 2020-10-20T14:16:37Z

sklearn/metrics/cluster/_supervised.py

+
+    """
+    labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
+    n_samples = np.int64(labels_true.shape[0])


Suggested change

n_samples = np.int64(labels_true.shape[0])

n_samples = labels_true.shape[0]

I left this unchanged for now. The entire calculations below are using np.int64 and the return data type of the ndarray is np.int64, and this cast makes this explicitly clear. The cast will of course otherwise happen implicitly anyway but I think this makes it unambiguous. I am not opposed to the change, but if you don't feel strongly I would prefer to let it stand.

sklearn/metrics/cluster/_supervised.py

glemaitre

Some remaining comments.
But in general it looks good.

sklearn/metrics/cluster/_supervised.py

sklearn/metrics/cluster/tests/test_supervised.py

glemaitre · 2020-10-21T18:07:15Z

@ufmayer You can ignore the MacOS build failure. We are currently investigating it.
However, the other potential failure should be looked at :)

ufmayer · 2020-10-21T22:44:47Z

@glemaitre Thank you for the thorough review. I implemented the majority of your suggestions and resolved the corresponding auto-generated conversations in this PR. Please have a look at the remaining ones, I left brief comments on each that I didn't resolve.

doc/modules/clustering.rst

glemaitre

There is the remaining comments. Just some nitpicks. Once done this is mergeable :)

sklearn/metrics/cluster/_supervised.py

glemaitre · 2020-10-22T07:40:50Z

sklearn/metrics/cluster/_supervised.py

@@ -113,14 +108,20 @@ def contingency_matrix(labels_true, labels_pred, *, eps=None, sparse=False):

        .. versionadded:: 0.18

+    dtype : numeric data type, default=np.int.
+        See the notes on ``eps`` below.


Can we specify the following instead (if I don't make a mistake interpreting the note below).

Output dtype. Ignore if `eps` is not `None`.

sklearn/metrics/cluster/_supervised.py

ufmayer · 2020-10-22T17:48:53Z

@glemaitre Done.

cmarmo · 2020-10-26T15:44:13Z

@glemaitre, ready for a second approval? Thanks!

glemaitre · 2020-10-28T13:28:31Z

Thanks @ufmayer LGTM merging.

ufmayer added 2 commits June 1, 2020 10:35

Add dtype arg to contingency_matrix to avoid int overflow for larger …

82db6bc

…matrices

Add pair_confusion_matrix and rand_score

b5416ae

github-actions bot added the module:metrics label Jun 1, 2020

jnothman reviewed Jun 2, 2020

View reviewed changes

ufmayer added 9 commits June 2, 2020 12:38

Added tests for rand_score and pair_confusion_matrix

365b0d6

Changed return type of pair_confusion_matrix from list to array

88e5cca

Removed some superfluous code comments from pair_confusion_matrix

44f2e10

Remove doctest: +ELLIPSIS from comments

a291419

Break up long lines

cc62595

Break up long lines

c521412

Remove duplicate imports

47a981d

Whitespace changes for linter

ab77648

Another whitespace change for linter

026bcca

ufmayer added 6 commits June 12, 2020 09:43

Fixed bugs in test_supervised

fded1c0

Merged tests/test_supervised.py into metrics/cluster/tests/test_super…

f4f43be

…vised.py

Added rand_score and pair_confusion_matrix to various init, test, and…

d02f9fb

… scorer files

added rand_score and pair_confusion_matrix to __all__ in __init__

0404d9a

changed whitespace to make linter happy

5d5ada9

Update comments in pair_confusion_matrix for doctest

f0d4efa

cmarmo added the Waiting for Reviewer label Jun 12, 2020

ufmayer changed the title ~~Adding an implementation of unadjusted Rand Index~~ [MRG] Adding an implementation of unadjusted Rand Index Jun 12, 2020

ufmayer added 3 commits August 17, 2020 08:31

Merged in master from from the upstream repository

16c7e2b

Re-establish order of 0.24 changes in master

daf32ec

Merged in master from from the upstream repository

cf5ca69

jnothman added this to the 0.24 milestone Sep 1, 2020

ufmayer added 2 commits September 28, 2020 15:16

Merged in master from from the upstream repository

10cca87

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

95080f2

…into master

glemaitre self-requested a review October 15, 2020 08:01

Merged in master from from the upstream repository

7335fdc

glemaitre reviewed Oct 20, 2020

View reviewed changes

glemaitre changed the title ~~[MRG] Adding an implementation of unadjusted Rand Index~~ FEA Add Rand Index and pair confusion matrix Oct 20, 2020

ufmayer added 2 commits October 21, 2020 10:49

Implement changes suggested by reviewer

470c1ef

Fix whitespace for linter

5b0feac

ufmayer added 4 commits October 21, 2020 11:28

Fix typo

f900092

Fix doc formatting

bafd69d

Remove tests replaced by parameterized implementation

0516ddd

Combine unadjusted and adjusted Rand index doc

132cd88

glemaitre reviewed Oct 22, 2020

View reviewed changes

doc/modules/clustering.rst Outdated Show resolved Hide resolved

doc/modules/clustering.rst Show resolved Hide resolved

glemaitre reviewed Oct 22, 2020

View reviewed changes

cmarmo removed the Waiting for Reviewer label Oct 22, 2020

Merging in latest changes requested by reviewers

22637bb

glemaitre merged commit 662cc64 into scikit-learn:master Oct 28, 2020

	n_samples = np.int64(labels_true.shape[0])
	n_samples = labels_true.shape[0]

Uh oh!

FEA Add Rand Index and pair confusion matrix #17412

FEA Add Rand Index and pair confusion matrix #17412

Uh oh!

Conversation

ufmayer commented Jun 1, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Jun 2, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ufmayer commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ufmayer commented Jun 11, 2020

Uh oh!

cmarmo commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmarmo commented Jun 12, 2020

Uh oh!

ufmayer commented Jun 12, 2020

Uh oh!

cmarmo commented Jun 12, 2020

Uh oh!

jnothman commented Jun 13, 2020 via email

Uh oh!

ufmayer commented Jun 23, 2020

Uh oh!

ufmayer commented Sep 1, 2020

Uh oh!

jnothman commented Sep 1, 2020

Uh oh!

ufmayer commented Sep 1, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre Oct 20, 2020

Choose a reason for hiding this comment

Uh oh!

ufmayer Oct 21, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Oct 21, 2020

Uh oh!

ufmayer commented Oct 21, 2020

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

ufmayer commented Jun 2, 2020 •

edited

Loading

cmarmo commented Jun 12, 2020 •

edited

Loading