-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Labels
EnhancementModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practicesmodule:impute
Description
Code sample
In the sample code below, a column is removed from the dataset during the pipeline
>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imp = SimpleImputer()
>>> imp.fit([[0, np.nan], [1, np.nan]])
>>> imp.transform([[0, np.nan], [1, 1]])
array([[0.],
[1.]])
Problem description
Currently sklearn.impute.SimpleImputer
silently removes features that are np.nan
on every training sample.
This may cause further issues on pipelines because the dataset's shape
has changed, e.g.
dataset[:, columns_to_impute_with_median] = imp.fit_transform(dataset[:, columns_to_impute_with_median])
Possible solutions
For the problematic features, either keep their values if valid or impute the fill_value
during transform
. I suggest adding a new parameter to trigger this behaviour with a warning highlighting the referred features.
As I'm willing to implement this feature, I look forward advices.
mitar, mgbckr, atpage, ImpriMed and ariefsafermanmicahjsmith
Metadata
Metadata
Assignees
Labels
EnhancementModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practicesmodule:impute