Author: | Thomas J Fan |
---|---|
Status: | Rejected |
Type: | Standards Track |
Created: | 2020-10-03 |
This SLEP proposes adding the get_feature_names_out
method to all
transformers and the feature_names_in_
attribute for all estimators.
The feature_names_in_
attribute is set during fit
if the input, X
,
contains the feature names.
scikit-learn
is commonly used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling get_feature_names
on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names:
X = pd.DataFrame({'letter': ['a', 'b', 'c'], 'pet': ['dog', 'snake', 'dog'], 'distance': [1, 2, 3]}) y = [0, 0, 1] orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] ct = ColumnTransformer( [('cat', OneHotEncoder(), orig_cat_cols), ('num', StandardScaler(), orig_num_cols)]) pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) cat_names = (pipe['columntransformer'] .named_transformers_['onehotencoder'] .get_feature_names(orig_cat_cols)) feature_names = np.r_[cat_names, orig_num_cols]
The feature_names
extracted above corresponds to the features directly
passed into LogisticRegression
. As demonstrated above, the process of
extracting feature_names
requires knowing the order of the selected
categories in the ColumnTransformer
. Furthermore, if there is feature
selection in the pipeline, such as SelectKBest
, the get_support
method
would need to be used to infer the column names that were selected.
This SLEP proposes adding the feature_names_in_
attribute to all estimators
that will extract the feature names of X
during fit
. This will also
be used for validation during non-fit
methods such as transform
or
predict
. If the X
is not a recognized container with columns, then
feature_names_in_
can be undefined. If feature_names_in_
is undefined,
then it will not be validated.
Secondly, this SLEP proposes adding get_feature_names_out(input_names=None)
to all transformers. By default, the input features will be determined by the
feature_names_in_
attribute. The feature names of a pipeline can then be
easily extracted as follows:
pipe[:-1].get_feature_names_out() # ['cat__letter_a', 'cat__letter_b', 'cat__letter_c', 'cat__pet_dog', 'cat__pet_snake', 'num__distance']
Note that get_feature_names_out
does not require input_names
because the feature names was stored in the pipeline itself. These
features will be passed to each step's get_feature_names_out
method to
obtain the output feature names of the Pipeline
itself.
The following enhancements are not a part of this SLEP. These features are made possible if this SLEP gets accepted.
This SLEP enables us to implement an
array_out
keyword argument to alltransform
methods to specify the array container outputted bytransform
. An implementation ofarray_out
requiresfeature_names_in_
to validate that the names infit
andtransform
are consistent. An implementation ofarray_out
needs a way to map from the input feature names to output feature names, which is provided byget_feature_names_out
.An alternative to
array_out
: Transformers in a pipeline may wish to have feature names passed in asX
. This can be enabled by adding aarray_input
parameter toPipeline
:pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), array_input='pandas')
In this case, the pipeline will construct a pandas DataFrame to be inputted into
MyTransformer
andLogisticRegression
. The feature names will be constructed by callingget_feature_names_out
as data is passed through thePipeline
. This feature implies thatPipeline
is doing the construction of the DataFrame.
The
get_feature_names_out
will be constructed using the name generation specification from :ref:`slep_007`.For a
Pipeline
with only one estimator, slicing will not work and one would need to access the feature names directly:pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) pipe[:-1].feature_names_in_ # Works pipe2 = make_pipeline(LogisticRegression()) pipe[:-1].feature_names_in_ # Does not work
This is because pipe2[:-1] raises an error because it will result in a pipeline with no steps. We can work around this by allowing pipelines with no steps.
feature_names_in_
can be any 1-DSequence
, such as an list or an ndarray.Meta-estimators will delegate the setting and validation of
feature_names_in_
to its inner estimators. The meta-estimator will definefeature_names_in_
by referencing its inner estimators. For example, thePipeline
can usesteps[0].feature_names_in_
as the input feature names. If the inner estimators do not definefeature_names_in_
then the meta-estimator will not definedfeature_names_in_
as well.
- This SLEP is fully backward compatible with previous versions. With the
introduction of
get_feature_names_out
,get_feature_names
will be deprecated. Note thatget_feature_names_out
's signature will always containinput_features
which can be used or ignored. This helps standardize the interface for the get feature names method. - The inclusion of a
get_feature_names_out
method will not introduce any overhead to estimators. - The inclusion of a
feature_names_in_
attribute will increase the size of estimators because they would store the feature names. Users can remove the attribute by callingdel est.feature_names_in_
if they want to remove the feature and disable validation.
There have been many attempts to address this issue:
array_out
in keyword parameter intransform
: This approach requires third party estimators to unwrap and wrap array containers in transform, which introduces more burden for third party estimator maintainers. Furthermore,array_out
with sparse data will introduce an overhead when being passed along in aPipeline
. This overhead comes from the construction of the sparse data container that has the feature names.- :ref:`slep_007` :
SLEP007
introduces afeature_names_out_
attribute while this SLEP proposes aget_feature_names_out
method to accomplish the same task. The benefit of theget_feature_names_out
method is that it can be used even if the feature names were not passed infit
with a dataframe. For example, in aPipeline
the feature names are not passed through to each step and aget_feature_names_out
method can be used to get the names of each step with slicing. - :ref:`slep_012` : The
InputArray
was developed to work around the overhead of using a pandasDataFrame
or an xarrayDataArray
. The introduction of another data structure into the Python Data Ecosystem, would lead to more burden for third party estimator maintainers.
[1] | Each SLEP must either be explicitly labeled as placed in the public domain (see this SLEP as an example) or licensed under the Open Publication License. |
This document has been placed in the public domain. [1]