Skip to content

Missing features removal with SimpleImputer #16426

@vitorsrg

Description

@vitorsrg

Code sample

In the sample code below, a column is removed from the dataset during the pipeline

>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imp = SimpleImputer()
>>> imp.fit([[0, np.nan], [1, np.nan]])
>>> imp.transform([[0, np.nan], [1, 1]])
array([[0.],
       [1.]])

Problem description

Currently sklearn.impute.SimpleImputer silently removes features that are np.nan on every training sample.

This may cause further issues on pipelines because the dataset's shape has changed, e.g.

dataset[:, columns_to_impute_with_median] = imp.fit_transform(dataset[:, columns_to_impute_with_median])

Possible solutions

For the problematic features, either keep their values if valid or impute the fill_value during transform. I suggest adding a new parameter to trigger this behaviour with a warning highlighting the referred features.

As I'm willing to implement this feature, I look forward advices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementModerateAnything that requires some knowledge of conventions and best practicesmodule:impute

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions