A set of scikit-learn-style transformers for encoding categorical variables into numeric with different techniques. While ordinal, one-hot, and hashing encoders have similar equivalents in the existing scikit-learn version, the transformers in this library all share a few useful properties:
- First-class support for pandas dataframes as an input (and optionally as output)
- Can explicitly configure which columns in the data are encoded by name or index, or infer non-numeric columns regardless of input type
- Can drop any columns with very low variance based on training set optionally
- Portability: train a transformer on data, pickle it, reuse it later and get the same thing out.
- Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer (*)
(*) For full compatibility with Pipelines and ColumnTransformers, and consistent behaviour of get_feature_names_out, it's recommended to upgrade sklearn to a version at least '1.2.0' and to set output as pandas:
import sklearn
sklearn.set_config(transform_output="pandas")
install as:
pip install category_encoders
or
conda install -c conda-forge category_encoders
To use:
import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=[...])
encoder = ce.BaseNEncoder(cols=[...])
encoder = ce.BinaryEncoder(cols=[...])
encoder = ce.CatBoostEncoder(cols=[...])
encoder = ce.CountEncoder(cols=[...])
encoder = ce.GLMMEncoder(cols=[...])
encoder = ce.GrayEncoder(cols=[...])
encoder = ce.HashingEncoder(cols=[...])
encoder = ce.HelmertEncoder(cols=[...])
encoder = ce.JamesSteinEncoder(cols=[...])
encoder = ce.LeaveOneOutEncoder(cols=[...])
encoder = ce.MEstimateEncoder(cols=[...])
encoder = ce.OneHotEncoder(cols=[...])
encoder = ce.OrdinalEncoder(cols=[...])
encoder = ce.PolynomialEncoder(cols=[...])
encoder = ce.QuantileEncoder(cols=[...])
encoder = ce.RankHotEncoder(cols=[...])
encoder = ce.SumEncoder(cols=[...])
encoder = ce.TargetEncoder(cols=[...])
encoder = ce.WOEEncoder(cols=[...])
encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
All of these are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. If the cols parameter isn't passed, every non-numeric column will be converted. See below for detailed documentation
CategoryEncoders internally works with pandas DataFrames as opposed to sklearn which works with numpy arrays. This can cause problems in sklearn versions prior to 1.2.0. In order to ensure full compatibility with sklearn set sklearn to also output DataFrames. This can be done by
sklearn.set_config(transform_output="pandas")
for a whole project or just for a single pipeline using
Pipeline(
steps=[
("preprocessor", SomePreprocessor().set_output("pandas"),
("encoder", SomeEncoder()),
]
)
If you experience another bug, feel free to report it on [github](https://github.com/scikit-learn-contrib/category_encoders/issues)
.. toctree:: :maxdepth: 3 backward_difference basen binary catboost count glmm gray hashing helmert jamesstein leaveoneout mestimate onehot ordinal polynomial quantile rankhot sum summary targetencoder woe wrapper