ENH Adds support for drop + handle_unknown=ignore in the OneHotEncoder #19041

thomasjpfan · 2020-12-19T02:52:40Z

Reference Issues/PRs

Fixes #18072

What does this implement/fix? Explain your changes.

Adds support for the suggestions stated in #18072

When an unknown category is encountered and there is a dropped category, then the unknown category will be encoded as the dropped category. (all zeros)
The inverse transform will map all zeros to the dropped category if a category was dropped.

…nknown_ignore_support

glemaitre

I think that we should update the docstring of the OneHotEncoder as well.

glemaitre · 2020-12-19T10:21:36Z

doc/modules/preprocessing.rst

+
+All the categories in `X_test` are unknown during transform and will be mapped
+to all zeros. This means that unknown categories will have the same mapping
+as the dropped category.


It might be good to show the inverse_transform here.

glemaitre · 2020-12-19T10:28:45Z

Do you think it would be a good idea to expose an attribute containing the column with unknown categories? I am wondering if the warning will not be too much annoying.

I am thinking that we could have 2 attributes, one containing the column indices and another the unknown categories, when it applies. In this case, we could avoid to warn but you could always check the attributes for sanity check?

…nknown_ignore_support

thomasjpfan · 2020-12-25T01:06:16Z

Do you think it would be a good idea to expose an attribute containing the column with unknown categories? I am wondering if the warning will not be too much annoying.

With categories='auto', we would not know if a category is unknown until transform. May you expand on your idea of using the attributes?

If the goal is to avoid warnings, we can hope that the documentation is clear enough and remove the warning.

glemaitre · 2021-01-05T09:16:40Z

With categories='auto', we would not know if a category is unknown until transform. May you expand on your idea of using the attributes?

Yep this is True. Since it would only be rare, we should not warn so much thought.

glemaitre

LGTM

ogrisel

LGTM. I would be +0 for having handle_unknown == "ignore" not warn but handle_unknown == "warn" instead, but we can always do that in a latter PR.

In particular @amueller wasn't a big fan of warnings: #18072 (comment)

thomasjpfan added 4 commits November 9, 2020 14:49

WIP

3ff524c

Merge remote-tracking branch 'upstream/master' into ohe_drop_handle_u…

8100826

…nknown_ignore_support

ENH Adds support for handle_unknown=ignore and drop

39d5263

ENH Adds inverse_tranform with dropped category

e4cc677

github-actions bot added the module:preprocessing label Dec 19, 2020

thomasjpfan added 2 commits December 18, 2020 21:54

DOC Adds whats new

53c0eda

ENH Uses another parameter

1d70589

glemaitre reviewed Dec 19, 2020

View reviewed changes

thomasjpfan added 3 commits December 24, 2020 18:31

Merge remote-tracking branch 'upstream/master' into ohe_drop_handle_u…

29e0409

…nknown_ignore_support

DOC Adds inverse_transform to user guide

78ed39e

DOC Fix

e5234f4

glemaitre approved these changes Jan 5, 2021

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:53

glemaitre mentioned this pull request Jan 27, 2021

Have handle_unknown="ignore" by default in OneHotEncoder #19286

Closed

glemaitre mentioned this pull request Feb 4, 2021

Allowing drop='first' and handle_unknown='ignore' in OneHotEncoding #19346

Closed

Merge branch 'main' into ohe_drop_handle_unknown_ignore_support

a628ee5

ogrisel approved these changes Mar 31, 2021

View reviewed changes

ogrisel merged commit c9c89cf into scikit-learn:main Mar 31, 2021

ogrisel mentioned this pull request Apr 1, 2021

drop must be None for OneHotEncoder to utilize handle_unknown = 'ignore' #18072

Closed

albertvillanova mentioned this pull request Apr 4, 2021

DOC Fix order of whatsnew entries #19822

Merged

thomasjpfan mentioned this pull request Apr 4, 2021

ENH Adds infrequent categories to OneHotEncoder #16018

Merged

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

glemaitre mentioned this pull request Jan 30, 2023

Use handle_unknown=ignore in SuperVectorizer skrub-data/skrub#473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Adds support for drop + handle_unknown=ignore in the OneHotEncoder #19041

ENH Adds support for drop + handle_unknown=ignore in the OneHotEncoder #19041

thomasjpfan commented Dec 19, 2020

glemaitre left a comment

glemaitre Dec 19, 2020

glemaitre commented Dec 19, 2020 •

edited

Loading

thomasjpfan commented Dec 25, 2020

glemaitre commented Jan 5, 2021

glemaitre left a comment

ogrisel left a comment •

edited

Loading

ENH Adds support for drop + handle_unknown=ignore in the OneHotEncoder #19041

ENH Adds support for drop + handle_unknown=ignore in the OneHotEncoder #19041

Conversation

thomasjpfan commented Dec 19, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Dec 19, 2020

Choose a reason for hiding this comment

glemaitre commented Dec 19, 2020 • edited Loading

thomasjpfan commented Dec 25, 2020

glemaitre commented Jan 5, 2021

glemaitre left a comment

Choose a reason for hiding this comment

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

glemaitre commented Dec 19, 2020 •

edited

Loading

ogrisel left a comment •

edited

Loading