Skip to content

RFC: Cloning estimators in pipeline #8157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
GaelVaroquaux opened this issue Jan 5, 2017 · 24 comments
Open

RFC: Cloning estimators in pipeline #8157

GaelVaroquaux opened this issue Jan 5, 2017 · 24 comments

Comments

@GaelVaroquaux
Copy link
Member

This is an RFC on the lack of cloning of estimators in the pipeline.

Current situation

The Pipeline modifies in the fit method the estimators that are given in it's steps argument, in violation of the scikit-learn convention (bad bad coder).

Specific issue raised

To implement caching of the fit of transformers reliably, cloning of them is crucial (#7990), this way the caching is dependent only on the model parameters of the transformers.

In the PR (#7990), the first attempt was to implement a new class, CachedPipeline that would behave like the Pipeline but clone the transformers and cache them. The drawback is the multiplication of classes that are very similar, which makes features harder to discover and give surprises as there is a subttle difference between Pipeline and CachedPipeline.

Proposal

The proposal (put forward IRL by @ogrisel, but that has my favor too) is to deprecate the fact that pipeline.steps is modified and introduce a .steps_ attribute (and a .named_steps_ attribute). That way we could merge Pipeline and CachedPipeline.

Implementation

The difficulty is the deprecation path, as often. Two options:

  1. Make steps and named_steps be properties, and add a warning upon access. Make it so that, for two releases, they return the modified estimators, ie what is stored in steps_ and named_steps_. In two releases, named_steps dies in favor of named_steps_, and we remove the properties.

  2. Create a new class, for instance Pipe, that has the new logic with optional caching.

I have no strong opinion on which deprecation path is best. Option 1 is more work but may less intrusive on the users.

@GaelVaroquaux
Copy link
Member Author

Cc @glemaitre @ogrisel. I'd also really like to get feedback from @jnothman and @amueller

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2017

I think it would be possible to implement caching without renaming the steps attribute (as @glemaitre did) but I think it would be cleaner to respect our own API convention. I think we can untie the two issues but it would probably be more natural to only store the fitted estimator in the named_steps_ attribute as you suggest.

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2017

To make things more explicit, I am 👍 with proposal 1. My point is that we could decide to implement before or after we merge caching support (#7990).

@glemaitre
Copy link
Member

glemaitre commented Jan 5, 2017

less intrusive on the users

+1 for 1.

@jnothman
Copy link
Member

jnothman commented Jan 6, 2017

I think this change will add some confusion for new and existing users, but I agree it is consistent with our API principles. Having tried to implement a memoizing pipeline too, I appreciate its value. Overall, I am +1.

In terms of deprecation, I would only warn when steps or named_steps is accessed after fit. Note that it is also accessed by get_/set_params, and indeed, get_params()[step_name] currently returns a fitted step; in the future it will only return the pre-cloned step.

Note also that we could use an OrderedDict for steps_ and not worry about implementing named_steps_ separately. Not sure this provides real benefit.

@agramfort
Copy link
Member

no objection from my side. I prefer 1 as it avoids a new class.

thx guys for taking a stab at this

@GaelVaroquaux
Copy link
Member Author

OK, so consensus is to go for approach 1.

I suggest to first implement caching support in a single object, finishing #7990 (cc @glemaitre ) and then do a PR for deprecating the access to steps.

@glemaitre
Copy link
Member

PR #7990 is ready for review with a single object.

@ogrisel
Copy link
Member

ogrisel commented Jan 17, 2017

Actually I am steps is a constructor parameter so we should not remove it in the future but instead keep storing the unchanged estimators rather than a reference to the fitted estimators.

@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jan 17, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jan 18, 2017

One of my concerns with making Pipeline conform to the API in this way is that formerly it was possible to take steps from a fitted pipeline and create a new pipeline consisting of a range of those steps.

pre_final_pipeline = Pipeline(full_pipeline.steps[:-1])
pre_final_pipeline.transform(X)

It is now also possible to do

full_pipeline.set_params(last_step=None).transform(X)

to similar effect. With caching

full_pipeline.set_params(last_step=None).fit(X_train).transform(X)

would work, but

full_pipeline.set_params(last_step=None)

would put the pipeline into an unusual state, I suppose.

Either way, constructing pipelines from pre-fitted components is something users could do. Under the proposal of this issue, it's no longer an option. Do we provide a method to do it? (We've previously considered using indexing for this -- pre_final_pipeline = full_pipeline[:-1] -- but shied away from such magic.)

@ogrisel
Copy link
Member

ogrisel commented Jan 19, 2017

In that case we can decide to keep the fitted estimators as a steps_ attribute to get:

pre_final_pipeline = Pipeline(full_pipeline.steps_[:-1])
pre_final_pipeline.transform(X)

or alternatively we could use an OrderedDict for named_steps_ and get the following to work although it's more verbose:

pre_final_pipeline = Pipeline(list(full_pipeline.named_steps_.items())[:-1])
pre_final_pipeline.transform(X)

I don't understand the cases with last_step. Where does this parameter come from?

@jnothman
Copy link
Member

jnothman commented Jan 19, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jan 19, 2017 via email

@jnothman
Copy link
Member

jnothman commented Jan 19, 2017 via email

@ogrisel
Copy link
Member

ogrisel commented Jan 19, 2017

That doesn't work, because the Pipeline construct will set steps, not steps_

Indeed...

Those are very good remarks. Maybe we could introduce a warm_start=True param for the pipeline itself. But the semantics of that guy might become complex. That requires more thinking.

@ogrisel
Copy link
Member

ogrisel commented Jan 19, 2017

More simply we could just make the Pipeline.transform method use the steps attribute (constructor param) if the named_steps_ attribute has not been set yet.

@ogrisel
Copy link
Member

ogrisel commented Jan 24, 2017

@jnothman what do you think about my last comment?

@jnothman
Copy link
Member

I don't hate it, but it's a bit magic, and again sets Pipeline apart from other estimator conventions.

@glemaitre
Copy link
Member

What about using make_pipeline to create a "proper" (deep-copy?) Pipeline instance given the steps to keep.

full_pipeline.set_params(last_step=None) seems convenient but the in-place modifications of the meta-estimator bother me :)

@jnothman
Copy link
Member

jnothman commented Jan 25, 2017 via email

@glemaitre
Copy link
Member

I started the PR #8350 to have a concrete example to play with.

@amueller
Copy link
Member

amueller commented Jun 5, 2017

We should discuss this in person - I need to read your proposal again first, though.

@jnothman
Copy link
Member

jnothman commented Sep 6, 2017

I note that some current users of Pipeline are liable to have things that are impossible to clone, e.g. parameters that refuse to be deepcopied (a spaCy model instance is one case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants