Skip to content

ENH Adds support for drop + handle_unknown=ignore in the OneHotEncoder #19041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #18072

What does this implement/fix? Explain your changes.

Adds support for the suggestions stated in #18072

  1. When an unknown category is encountered and there is a dropped category, then the unknown category will be encoded as the dropped category. (all zeros)
  2. The inverse transform will map all zeros to the dropped category if a category was dropped.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should update the docstring of the OneHotEncoder as well.


All the categories in `X_test` are unknown during transform and will be mapped
to all zeros. This means that unknown categories will have the same mapping
as the dropped category.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to show the inverse_transform here.

@glemaitre
Copy link
Member

glemaitre commented Dec 19, 2020

Do you think it would be a good idea to expose an attribute containing the column with unknown categories? I am wondering if the warning will not be too much annoying.

I am thinking that we could have 2 attributes, one containing the column indices and another the unknown categories, when it applies. In this case, we could avoid to warn but you could always check the attributes for sanity check?

@thomasjpfan
Copy link
Member Author

Do you think it would be a good idea to expose an attribute containing the column with unknown categories? I am wondering if the warning will not be too much annoying.

With categories='auto', we would not know if a category is unknown until transform. May you expand on your idea of using the attributes?

If the goal is to avoid warnings, we can hope that the documentation is clear enough and remove the warning.

@glemaitre
Copy link
Member

With categories='auto', we would not know if a category is unknown until transform. May you expand on your idea of using the attributes?

Yep this is True. Since it would only be rare, we should not warn so much thought.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would be +0 for having handle_unknown == "ignore" not warn but handle_unknown == "warn" instead, but we can always do that in a latter PR.

In particular @amueller wasn't a big fan of warnings: #18072 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

drop must be None for OneHotEncoder to utilize handle_unknown = 'ignore'
3 participants