-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC: Cloning estimators in pipeline #8157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Cc @glemaitre @ogrisel. I'd also really like to get feedback from @jnothman and @amueller |
I think it would be possible to implement caching without renaming the |
To make things more explicit, I am 👍 with proposal 1. My point is that we could decide to implement before or after we merge caching support (#7990). |
+1 for 1. |
I think this change will add some confusion for new and existing users, but I agree it is consistent with our API principles. Having tried to implement a memoizing pipeline too, I appreciate its value. Overall, I am +1. In terms of deprecation, I would only warn when Note also that we could use an |
no objection from my side. I prefer 1 as it avoids a new class. thx guys for taking a stab at this |
OK, so consensus is to go for approach 1. I suggest to first implement caching support in a single object, finishing #7990 (cc @glemaitre ) and then do a PR for deprecating the access to |
PR #7990 is ready for review with a single object. |
Actually I am |
Actually I am steps is a constructor parameter so we should not remove it in
the future but instead keep storing the unchanged estimators rather than a
reference to the fitted estimators.
+1
|
One of my concerns with making pre_final_pipeline = Pipeline(full_pipeline.steps[:-1])
pre_final_pipeline.transform(X) It is now also possible to do full_pipeline.set_params(last_step=None).transform(X) to similar effect. With caching full_pipeline.set_params(last_step=None).fit(X_train).transform(X) would work, but full_pipeline.set_params(last_step=None) would put the pipeline into an unusual state, I suppose. Either way, constructing pipelines from pre-fitted components is something users could do. Under the proposal of this issue, it's no longer an option. Do we provide a method to do it? (We've previously considered using indexing for this -- |
In that case we can decide to keep the fitted estimators as a pre_final_pipeline = Pipeline(full_pipeline.steps_[:-1])
pre_final_pipeline.transform(X) or alternatively we could use an pre_final_pipeline = Pipeline(list(full_pipeline.named_steps_.items())[:-1])
pre_final_pipeline.transform(X) I don't understand the cases with |
That doesn't work, because the Pipeline construct will set steps, not steps_
…On 19 January 2017 at 20:33, Olivier Grisel ***@***.***> wrote:
In that case we can decide to keep the fitted estimators as a steps_
attribute to get:
pre_final_pipeline = Pipeline(full_pipeline.steps_[:-1])
pre_final_pipeline.transform(X)
or alternatively we could use an ordereddict for named_steps_ and get the
following to work although it's more verbose:
pre_final_pipeline = Pipeline(list(full_pipeline.named_steps_.items())[:-1])
pre_final_pipeline.transform(X)
I don't understand the cases with last_step. Where does this parameter
comes from?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8157 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61UrkDs8S-GvB2CjIhMaEMhB74u3ks5rTy3ugaJpZM4Lby-P>
.
|
I meant, assuming the last step has name 'last_step'
…On 19 January 2017 at 20:34, Joel Nothman ***@***.***> wrote:
That doesn't work, because the Pipeline construct will set steps, not
steps_
On 19 January 2017 at 20:33, Olivier Grisel ***@***.***>
wrote:
> In that case we can decide to keep the fitted estimators as a steps_
> attribute to get:
>
> pre_final_pipeline = Pipeline(full_pipeline.steps_[:-1])
> pre_final_pipeline.transform(X)
>
> or alternatively we could use an ordereddict for named_steps_ and get
> the following to work although it's more verbose:
>
> pre_final_pipeline = Pipeline(list(full_pipeline.named_steps_.items())[:-1])
> pre_final_pipeline.transform(X)
>
> I don't understand the cases with last_step. Where does this parameter
> comes from?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#8157 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz61UrkDs8S-GvB2CjIhMaEMhB74u3ks5rTy3ugaJpZM4Lby-P>
> .
>
|
(And we now support setting step estimators by their name)
…On 19 January 2017 at 20:35, Joel Nothman ***@***.***> wrote:
I meant, assuming the last step has name 'last_step'
On 19 January 2017 at 20:34, Joel Nothman ***@***.***> wrote:
> That doesn't work, because the Pipeline construct will set steps, not
> steps_
>
> On 19 January 2017 at 20:33, Olivier Grisel ***@***.***>
> wrote:
>
>> In that case we can decide to keep the fitted estimators as a steps_
>> attribute to get:
>>
>> pre_final_pipeline = Pipeline(full_pipeline.steps_[:-1])
>> pre_final_pipeline.transform(X)
>>
>> or alternatively we could use an ordereddict for named_steps_ and get
>> the following to work although it's more verbose:
>>
>> pre_final_pipeline = Pipeline(list(full_pipeline.named_steps_.items())[:-1])
>> pre_final_pipeline.transform(X)
>>
>> I don't understand the cases with last_step. Where does this parameter
>> comes from?
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> <#8157 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/AAEz61UrkDs8S-GvB2CjIhMaEMhB74u3ks5rTy3ugaJpZM4Lby-P>
>> .
>>
>
>
|
Indeed... Those are very good remarks. Maybe we could introduce a |
More simply we could just make the |
@jnothman what do you think about my last comment? |
I don't hate it, but it's a bit magic, and again sets Pipeline apart from other estimator conventions. |
What about using
|
No, I think set_params should not modify the fitted steps. I am not
entirely comfortable with ogrisel's proposal, though it could be used
during a deprecation period. Rather, we should have a get_steps() method
which extracts a pipeline with the selected fitted steps.
…On 25 January 2017 at 10:37, Guillaume Lemaitre ***@***.***> wrote:
What about using make_pipeline to create a "proper" (deep-copy?) Pipeline
instance given the steps to keep.
full_pipeline.set_params(last_step=None) seems convenient but the
in-place modifications of the meta-estimator bother me :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8157 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zxLfkL9fEMV-XjyyCNhNYs7mnn4ks5rVotIgaJpZM4Lby-P>
.
|
I started the PR #8350 to have a concrete example to play with. |
We should discuss this in person - I need to read your proposal again first, though. |
I note that some current users of Pipeline are liable to have things that are impossible to clone, e.g. parameters that refuse to be deepcopied (a spaCy model instance is one case). |
This is an RFC on the lack of cloning of estimators in the pipeline.
Current situation
The
Pipeline
modifies in the fit method the estimators that are given in it'ssteps
argument, in violation of the scikit-learn convention (bad bad coder).Specific issue raised
To implement caching of the fit of transformers reliably, cloning of them is crucial (#7990), this way the caching is dependent only on the model parameters of the transformers.
In the PR (#7990), the first attempt was to implement a new class,
CachedPipeline
that would behave like thePipeline
but clone the transformers and cache them. The drawback is the multiplication of classes that are very similar, which makes features harder to discover and give surprises as there is a subttle difference betweenPipeline
andCachedPipeline
.Proposal
The proposal (put forward IRL by @ogrisel, but that has my favor too) is to deprecate the fact that
pipeline.steps
is modified and introduce a.steps_
attribute (and a.named_steps_
attribute). That way we could mergePipeline
andCachedPipeline
.Implementation
The difficulty is the deprecation path, as often. Two options:
Make
steps
andnamed_steps
be properties, and add a warning upon access. Make it so that, for two releases, they return the modified estimators, ie what is stored insteps_
andnamed_steps_
. In two releases,named_steps
dies in favor ofnamed_steps_
, and we remove the properties.Create a new class, for instance
Pipe
, that has the new logic with optional caching.I have no strong opinion on which deprecation path is best. Option 1 is more work but may less intrusive on the users.
The text was updated successfully, but these errors were encountered: