Author: | Adrin Jalali |
---|---|
Status: | Final |
Type: | Standards Track |
Created: | 2019-04 |
Vote opened: | 2021-10-26 |
Vote closed: | 2021-11-29 |
Implemented with v1.1.0.
This SLEP proposes the introduction of the feature_names_in_
attribute for
all estimators, and the get_feature_names_out
method for all transformers.
We here discuss the generation of such attributes and their propagation through
pipelines. Since for most estimators there are multiple ways to generate
feature names, this SLEP does not intend to define how exactly feature names
are generated for all of them.
scikit-learn
has been making it easier to build complex workflows with the
ColumnTransformer
and it has been seeing widespread adoption. However,
using it results in pipelines where it's not clear what the input features to
the final predictor are, even more so than before. For example, after fitting
the following pipeline, users should ideally be able to inspect the features
going into the final predictor:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) # We will train our classifier with the following features: # Numeric Features: # - age: float. # - fare: float. # Categorical Features: # - embarked: categories encoded as strings {'C', 'S', 'Q'}. # - sex: categories encoded as strings {'female', 'male'}. # - pclass: ordinal integers {1, 2, 3}. # We create the preprocessing pipelines for both numeric and categorical data. numeric_features = ['age', 'fare'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) categorical_features = ['embarked', 'sex', 'pclass'] categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)]) # Append classifier to preprocessing pipeline. # Now we have a full prediction pipeline. clf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) clf.fit(X_train, y_train)
However, it's impossible to interpret or even sanity-check the
LogisticRegression
instance that's produced in the example, because the
correspondence of the coefficients to the input features is basically
impossible to figure out.
This proposal suggests adding feature_names_in_
attribute and
get_feature_names_out
method to fitted estimators: , such that in the
abovementioned example clf[-1].feature_names_in_
and
clf[-2].get_feature_names_out()
will be:
['num__age', 'num__fare', 'cat__embarked_C', 'cat__embarked_Q', 'cat__embarked_S', 'cat__embarked_missing', 'cat__sex_female', 'cat__sex_male', 'cat__pclass_1', 'cat__pclass_2', 'cat__pclass_3']
Ideally the generated feature names describe how a feature is generated at each
stage of a pipeline. For instance, cat__sex_female
shows that the feature
has been through a categorical preprocessing pipeline, was originally the
column sex
, and has been one hot encoded and is one if it was originally
female
. However, this is not always possible or desirable especially when a
generated column is based on many columns, since the generated feature names
will be too long, for example in PCA
. As a rule of thumb, the following
types of transformers may generate feature names which corresponds to the
original features:
- Leave columns unchanged, e.g.
StandardScaler
- Select a subset of columns, e.g.
SelectKBest
- create new columns where each column depends on at most one input column,
e.g
OneHotEncoder
- Algorithms that create combinations of a fixed number of features, e.g.
PolynomialFeatures
, as opposed to all of them where there are many. Note that verbosity considerations andverbose_feature_names_out
as explained later can apply here.
This proposal talks about how feature names are generated and not how they are propagated.
The API for input and output feature names includes a feature_names_in_
attribute for all estimators, and a get_feature_names_out
method for any
estimator with a transform
method, i.e. they expose the generated feature
names via the get_feature_names_out
method.
Note that this SLEP also applies to resamplers the same way as transformers.
The input feature names are stored in a fitted estimator in a
feature_names_in_
attribute, and are taken from the given input data, for
instance a pandas
data frame. This attribute will be None
if the input
provides no feature names. The feature_names_in_
attribute is a 1d NumPy
array with object dtype and all elements in the array are strings.
A fitted estimator exposes the output feature names through the
get_feature_names_out
method. The output of get_feature_names_out
is a
1d NumPy array with object dtype and all elements in the array are strings. Here
we discuss more in detail how these feature names are generated. Since for most
estimators there are multiple ways to generate feature names, this SLEP does not
intend to define how exactly feature names are generated for all of them. It is
instead a guideline on how they could generally be generated.
As detailed bellow, some generated output features names are the same or a
derived from the input feature names. In such cases, if no input feature names
are provided, x0
to xn
are assumed to be their names.
This includes transformers which output a subset of the input features, w/o
changing them. For example, if a SelectKBest
transformer selects the first
and the third features, and no names are provided, the
get_feature_names_out
will be [x0, x2]
.
The simplest category of transformers in this section are the ones which
generate a column based on a single given column. These would simply
preserve the input feature names if a single new feature is generated,
such as in StandardScaler
, which would map 'age'
to 'age'
.
If an input feature maps to multiple new
features, a postfix is added, so that OneHotEncoder
might map
'gender'
to 'gender_female'
'gender_fluid'
etc.
Transformers where each output feature depends on a fixed number of input
features may generate descriptive names as well. For instance, a
PolynomialTransformer
on a small subset of features can generate an output
feature name such as x[0] * x[2] ** 3
.
And finally, the transformers where each output feature depends on many or all
input features, generate feature names which has the form of name0
to
namen
, where name
represents the transformer. For instance, a PCA
transformer will output [pca0, ..., pcan]
, n
being the number of PCA
components.
Meta estimators can choose to prefix the output feature names given by the estimators they are wrapping or not.
By default, Pipeline
adds no prefix, i.e its get_feature_names_out()
is the same as the get_feature_names_out()
of the last step, and None
if the last step is not a transformer.
ColumnTransformer
by default adds a prefix to the output feature names,
indicating the name of the transformer applied to them. If a column is in the output
as a part of passthrough
, it won't be prefixed since no operation has been
applied on it.
Here we include some examples to demonstrate the behavior of output feature names:
100 features (no names) -> PCA(n_components=3) get_feature_names_out(): [pca0, pca1, pca2] 100 features (no names) -> SelectKBest(k=3) get_feature_names_out(): [x2, x17, x42] [f1, ..., f100] -> SelectKBest(k=3) get_feature_names_out(): [f2, f17, f42] [cat0] -> OneHotEncoder() get_feature_names_out(): [cat0_cat, cat0_dog, ...] [f1, ..., f100] -> Pipeline( [SelectKBest(k=30), PCA(n_components=3)] ) get_feature_names_out(): [pca0, pca1, pca2] [model, make, numeric0, ..., numeric100] -> ColumnTransformer( [('cat', Pipeline(SimpleImputer(), OneHotEncoder()), ['model', 'make']), ('num', Pipeline(SimpleImputer(), PCA(n_components=3)), ['numeric0', ..., 'numeric100'])] ) get_feature_names_out(): ['cat_model_100', 'cat_model_200', ..., 'cat_make_ABC', 'cat_make_XYZ', ..., 'num_pca0', 'num_pca1', 'num_pca2']
However, the following examples produce a somewhat redundant feature names:
[model, make, numeric0, ..., numeric100] -> ColumnTransformer([ ('ohe', OneHotEncoder(), ['model', 'make']), ('pca', PCA(n_components=3), ['numeric0', ..., 'numeric100']) ]) get_feature_names_out(): ['ohe_model_100', 'ohe_model_200', ..., 'ohe_make_ABC', 'ohe_make_XYZ', ..., 'pca_pca0', 'pca_pca1', 'pca_pca2']
To provide more control over feature names, we could add a boolean
verbose_feature_names_out
constructor argument to certain transformers.
The default would reflect the description above, but changes would allow more verbose
names in some transformers, say having StandardScaler
map 'age'
to 'scale(age)'
.
In case of the ColumnTransformer
example above verbose_feature_names_out
could remove the estimator names, leading to shorter and less redundant names:
[model, make, numeric0, ..., numeric100] -> make_column_transformer( (OneHotEncoder(), ['model', 'make']), (PCA(n_components=3), ['numeric0', ..., 'numeric100']), verbose_feature_names_out=False ) get_feature_names_out(): ['model_100', 'model_200', ..., 'make_ABC', 'make_XYZ', ..., 'pca0', 'pca1', 'pca2']
Alternative solutions to a boolean flag could include:
- an integer: fine tuning the verbosity of the generated feature names.
- a
callable
which would give further flexibility to the user to generate user defined feature names.
These alternatives may be discussed and implemented in the future if deemed necessary.
All estimators should implement the feature_names_in_
and
get_feature_names_out()
API. This is checked in check_estimator
, and
the transition is done with a FutureWarning
for at least two versions to
give time to third party developers to implement the API.