Skip to content

Cache final transformer in pipeline with memory setting #23112

Open
@bmreiniger

Description

@bmreiniger

Describe the bug

When setting the memory parameter of a transformer Pipeline (i.e., one whose last step is a transformer), the final transformer is not cached.

Discovered at https://stackoverflow.com/q/71812869/10495893.

Steps/Code to Reproduce

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import time

class Test(BaseEstimator, TransformerMixin):
    def __init__(self, col):
        self.col = col

    def fit(self, X, y=None):
        print(self.col)
        return self

    def transform(self, X, y=None):
        for t in range(5):
            # just to slow it down / check caching.
            print(".")
            time.sleep(1)
        #print(self.col)
        return X

pipline = Pipeline(
    [
        ("test", Test(col="this_column")),
        ("test2", Test(col="that_column"))
    ],
    memory="tmp/cache",
)

pipline.fit(None)
pipline.fit(None)
pipline.fit(None)

Expected Results

this_column
.
.
.
.
.
that_column

Actual Results

this_column
.
.
.
.
.
that_column
that_column
that_column

Versions

System:
    python: 3.7.13 (default, Mar 16 2022, 17:37:17)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.5
        scipy: 1.4.1
       Cython: 0.29.28
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions