Skip to content

FIX Remove warnings when fitting a dataframe #21578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Nov 6, 2021

Reference Issues/PRs

Fixes #21577
Fixes #21618

What does this implement/fix? Explain your changes.

This PR removes input validation for methods that already validate the input. The common test was updated to catch warnings during fit.

@thomasjpfan thomasjpfan added this to the 1.0.2 milestone Nov 6, 2021
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If @jeremiedbb calls look at it since we work on this before the release but did not think about these methods.

I assume that this is relatively difficult to have a general test since it would depend on some parameters for some estimators. We might still have such of corner case for some of the estimators.

@@ -512,7 +513,7 @@ def _compute_oob_predictions(self, X, y):
(n_samples, 1, n_outputs)
The OOB predictions.
"""
X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
X = check_array(X, dtype=DTYPE, accept_sparse="csr", copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we either want to validate the data or not. If this is a private method and we know validation is done somewhere else, why do we need to call check_array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, check_array is required to convert the CSC matrix (during fit) into a CSR matrix for prediction. I'm undecided on who would be responsibility for this. It's either the caller of _compute_oob_predictions or _compute_oob_predictions itself.

In any case, I updated the PR with a comment regarding this behavior.

@@ -745,4 +749,4 @@ def _global_clustering(self, X=None):
self.subcluster_labels_ = clusterer.fit_predict(self.subcluster_centers_)

if compute_labels:
self.labels_ = self.predict(X)
self.labels_ = self._predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't a cleaner API be like self.predict(X, validate_input=False), or self.validate_input(predict=False).predict(X)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.predict(X, validate_input=False)

I don't think introducing a public arg for that is cleaner. I find it clean that we use a private function internally and expose a public function that does extra validation.

self.validate_input(predict=False).predict(X)

I don't get the predict=False arg, could explain more what you have in mind ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other reasons why a public, or a developer API, would be nice to have when it comes to [skipping] validation: #16653 (comment)

The predict=False would kinda set a flag in the estimator to skip the validation in a certain method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predict=False would kinda set a flag in the estimator to skip the validation in a certain method.

I think adding more state to the estimator after __init__ is outside the scope of this PR, but we can use this PR as a motivation to do it. It would kind of be like "inference mode".

self.predict(X, validate_input=False)

I think it would be very nice to have this type of kwarg everywhere. It would be similar to the check_finite flag in SciPy. (Every year I see the "Scikit-learn is slow during prediction" and it comes down to the validation we do.)

In both cases, I do not think we should change public API with a bug fix PR.

@adrinjalali
Copy link
Member

I find the introduction of a private method like _transform in this PR just to handle cases where we want validation and cases where we don't, quite a hacky solution.

I think this would be a nice trigger to introduce the developer or public API we've talked about, to disable input validation in these estimators. Not sure what others think @scikit-learn/core-devs

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I agree that having a generic public API to disable validation checks at inference is beyond the scope of this bugfix PR and just calling ad-hoc private functions when necessary is good enough for now.

I just have a question:

"""
# Prediction requires X to be in CSR format
if issparse(X):
X = check_array(X, accept_sparse="csr", force_all_finite=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why force_all_finite=True here if input validation has already been performed in the caller?

Wouldn't the following be enough?

Suggested change
X = check_array(X, accept_sparse="csr", force_all_finite=True)
X = X.tocsr()

@ogrisel
Copy link
Member

ogrisel commented Nov 15, 2021

Also, this comment about adding a comment to explain the test has not been addressed: https://github.com/scikit-learn/scikit-learn/pull/21578/files#r745491797

@thomasjpfan
Copy link
Member Author

I updated PR with comment and suggestion.

@ogrisel
Copy link
Member

ogrisel commented Nov 19, 2021

Some test froze in the macos CI and triggered the 60 minutes timeout. I pushed another commit to check whether this is deterministically happening in this PR or a rare random event.

@ogrisel
Copy link
Member

ogrisel commented Nov 19, 2021

Here is where it happened in the run for the 5c22dec commit:

[...]
......................................s.s..s........s.s.s.......s.s.s... [ 47%]
....s.s.s.......s.s.s.......s.s.s....................................... [ 47%]
............
##[error]The operation was canceled.

https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=35125&view=logs&j=97641769-79fb-5590-9088-a30ce9b850b9&t=4745baa1-36b5-56c8-9a8e-6480742db1a6

@lorentzenchr
Copy link
Member

I find the introduction of a private method like _transform in this PR just to handle cases where we want validation and cases where we don't, quite a hacky solution.

I think this would be a nice trigger to introduce the developer or public API we've talked about, to disable input validation in these estimators. Not sure what others think @scikit-learn/core-devs

Should we open an issue for this to be discussed? Or is there already one?

@adrinjalali
Copy link
Member

@lorentzenchr we could open a new issue, some background is present here: #16653

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @thomasjpfan.

I agree with you: this PR is fine as is and a general API is to be introduced in another.
And as always, Julien is nitpicking with some tiny suggestions when nothing has to fundamentally be changed.

@lorentzenchr
Copy link
Member

@lorentzenchr we could open a new issue, some background is present here: #16653

Done in #21804.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with the follow-up PR for MLP and to merge it after resolving problem on the CI.

@jjerphan jjerphan merged commit 071f98f into scikit-learn:main Nov 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
glemaitre pushed a commit that referenced this pull request Dec 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants