Author: | Adrin jalali |
---|---|
Status: | Withdrawn (superseded by :ref:`SLEP007 <slep_007>` and :ref:`SLEP018 <slep_018>`) |
Type: | Standards Track |
Created: | 2019-12-20 |
This proposal results in a solution to propagating feature names through transformers, pipelines, and the column transformer. Ideally, we would have:
df = pd.readcsv('tabular.csv') # transforming the data in an arbitrary way transformer0 = ColumnTransformer(...) # a pipeline preprocessing the data and then a classifier (or a regressor) clf = make_pipeline(transformer0, ..., SVC()) # now we can investigate features at each stage of the pipeline clf[-1].input_feature_names_
The feature names are propagated throughout the pipeline and the user can investigate them at each step of the pipeline.
This proposal suggests adding a new data structure, called InputArray
,
which augments the data array X
with additional meta-data. In this proposal
we assume the feature names (and other potential meta-data) are attached to the
data when passed to an estimator. Alternative solutions are discussed later in
this document.
A main constraint of this data structure is that is should be backward
compatible, i.e. code which expects a numpy.ndarray
as the output of a
transformer, would not break. This SLEP focuses on feature names as the only
meta-data attached to the data. Support for other meta-data can be added later.
Since currently transformers return a numpy
or a scipy
array, backward
compatibility in this context means the operations which are valid on those
arrays should also be valid on the new data structure.
All operations are delegated to the data part of the container, and the
meta-data is lost immediately after each operation and operations result in a
numpy.ndarray
. This includes indexing and slicing, i.e. to avoid
performance degradation, __getitem__
is not overloaded and if the user
wishes to preserve the meta-data, they shall do so via explicitly calling a
method such as select()
. Operations between two InpuArrays
will not
try to align rows and/or columns of the two given objects.
pandas
compatibility comes ideally as a pd.DataFrame(inputarray)
, for
which pandas
does not provide a clean API at the moment. Alternatively,
inputarray.todataframe()
would return a pandas.DataFrame
with the
relevant meta-data attached.
Feature names are an object ndarray
of strings aligned with the columns.
They can be None
.
Estimators understand the InputArray
and extract the feature names from the
given data before applying the operations and transformations on the data.
All transformers return an InputArray
with feature names attached to it.
The way feature names are generated is discussed in SLEP007 - The Style of The
Feature Names.
Ideally sparse arrays follow the same pattern, but since scipy.sparse
does
not provide the kinda of API provided by numpy
, we may need to find
compromises.
There will be factory methods creating an InputArray
given a
pandas.DataFrame
or an xarray.DataArray
or simply an np.ndarray
or
an sp.SparseMatrix
and a given set of feature names.
An InputArray
can also be converted to a pandas.DataFrame
using a
todataframe()
method.
X
being an InputArray
:
>>> np.array(X) >>> X.todataframe() >>> pd.DataFrame(X) # only if pandas implements the API
And given X
a np.ndarray
or an sp.sparse
matrix and a set of
feature names, one can make the right InputArray
using:
>>> make_inputarray(X, feature_names)
Since we expect the feature names to be attached to the data given to an estimator, there are a few potential approaches we can take:
pandas
in,pandas
out: this means we expect the user to give the data as apandas.DataFrame
, and if so, the transformer would output apandas.DataFrame
which also includes the [generated] feature names. This is not a feasible solution sincepandas
plans to move to a per column representation, which meanspd.DataFrame(np.asarray(df))
has two guaranteed memory copies.XArray
: we could accept apandas.DataFrame
, and usexarray.DataArray
as the output of transformers, including feature names. However,xarray
has a hard dependency onpandas
, and usespandas.Index
to handle row labels and aligns rows when an operation between twoxarray.DataArray
is done, which can be time consuming, and is not the semantic expected inscikit-learn
; we only expect the number of rows to be equal, and that the rows always correspond to one another in the same order.
As a result, we need to have another data structure which we'll use to transfer data related information (such as feature names), which is lightweight and doesn't interfere with existing user code.
Another alternative to the problem of passing meta-data around is to pass that
as a parameter to fit
. This would heavily involve modifying meta-estimators
since they'd need to pass that information, and extract the relevant
information from the estimators to pass that along to the next estimator. Our
prototype implementations showed significant challenges compared to when the
meta-data is attached to the data.