-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Fast PolynomialFeatures on dense arrays #12251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PolynomialFeatures
@@ -1454,7 +1462,7 @@ def transform(self, X): | |||
""" | |||
check_is_fitted(self, ['n_input_features_', 'n_output_features_']) | |||
|
|||
X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the original data is not f-contiguous this will trigger a big copy of the whole numpy array. I wonder it would not be possible to do it by chunk instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though such copy is only of size n_samples * n_features
, while the transformed array is much bigger, n_samples * n_features * n_features
when degree=2
. We can definitely do the transform by chunk, but this would be only useful with a partial_transform
. With transform
, the entire transformed array is stored in memory at the end.
In term of speed, the first benchmark shows that the cost of the initial copy is balanced by the speed of the following computations over the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if the input is dense there can't be that much data anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though such copy is only of size n_samples * n_features, while the transformed array is much bigger, n_samples * n_features * n_features when degree=2.
I also agree, indeed that's a good point. I user have memory issues they will call transport by chunk themselves, possibly using tools such as dask-ml but there is nothing much to gain by chunking inside the transform method in scikit-learn.
@@ -1454,7 +1462,7 @@ def transform(self, X): | |||
""" | |||
check_is_fitted(self, ['n_input_features_', 'n_output_features_']) | |||
|
|||
X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if the input is dense there can't be that much data anyway
doc/whats_new/v0.21.rst
Outdated
:mod:`sklearn.preprocessing` | ||
............................ | ||
|
||
- |Efficiency| Speed improvement in :class:`preprocessing.PolynomialFeatures`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add |API|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well. I merged master to resolve a conflict in the changelog,
Idea:
Computations in
PolynomialFeatures
are performed on columns, so best performances are obtained using fortran-ordered ('F') arrays.Proposed changes:
order='F'
incheck_array
when checking the input, to speed up computations.order
to control order of the output, for further speed improvements. It defaults to 'C' not to change current behavior.Here is a benchmark for the first change (i.e.

order='C'
, as in master):Here is a benchmark for both changes (i.e.

order='F'
):We can see that code is 2 to 5 times faster for large input arrays.
Benchmark script in the details.
<\details>