[MRG] Fast PolynomialFeatures on dense arrays #12251

TomDLT · 2018-10-02T17:24:19Z

Idea:

Computations in PolynomialFeatures are performed on columns, so best performances are obtained using fortran-ordered ('F') arrays.

Proposed changes:

add order='F' in check_array when checking the input, to speed up computations.
add new parameter order to control order of the output, for further speed improvements. It defaults to 'C' not to change current behavior.

Here is a benchmark for the first change (i.e. order='C', as in master):

Here is a benchmark for both changes (i.e. order='F'):

We can see that code is 2 to 5 times faster for large input arrays.

Benchmark script in the details.

from time import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.validation import check_array, FLOAT_DTYPES

order = 'C'


###############################################################################
# function to benchmark (proposed in [True, False])
def fit_transform(self, X, proposed, order):
    self.fit(X)

    if proposed:
        X = check_array(X, order='F', dtype=FLOAT_DTYPES, accept_sparse='csc')
    else:
        X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc')

    n_samples, n_features = X.shape
    if n_features != self.n_input_features_:
        raise ValueError("X shape does not match training shape")
    combinations = self._combinations(n_features, self.degree,
                                      self.interaction_only, self.include_bias)

    order = order if proposed else 'C'

    XP = np.empty((n_samples, self.n_output_features_), dtype=X.dtype,
                  order=order)
    for i, comb in enumerate(combinations):
        XP[:, i] = X[:, comb].prod(1)

    return XP


###############################################################################
# Benchmark
est = PolynomialFeatures()
results = []
for n_samples in [100, 300, 1000, 3000, 10000]:
    for n_features in [4, 16, 64, 256]:
        X = np.random.randn(n_samples, n_features)
        n_repeat = max(1, 6553600 // n_samples // n_features // n_features)

        for proposed in [False, True]:
            print('.', end='', flush=True)
            t0 = time()
            for _ in range(n_repeat):
                XP = fit_transform(est, X, proposed, order)
            duration = (time() - t0) / n_repeat
            # print(np.isfortran(XP))
            results.append((n_samples, n_features, proposed, duration))

###############################################################################
# Plot with pandas
df = pd.DataFrame(results,
                  columns=['n_samples', 'n_features', 'proposed', 'duration'])
fig, axes = plt.subplots(ncols=2, figsize=(12, 5))

# first plot
ax = axes[0]
table = df.pivot_table(index='n_samples', columns=['n_features', 'proposed'],
                       values='duration')
table.plot(ax=ax, logy=True, logx=True, marker='o', colormap='viridis')
ax.set_title('Time duration (sec)')

# second plot
ax = axes[1]
table_proposed = df[df['proposed']].pivot_table(
    index='n_samples', columns='n_features', values='duration')
table_master = df[~df['proposed']].pivot_table(
    index='n_samples', columns='n_features', values='duration')
table_normed = table_proposed / table_master
table_normed.plot(ax=ax, logx=True, marker='o', colormap='viridis')
ax.set_title('Time ratio: proposed / master')

plt.subplots_adjust(top=0.90)
fig.suptitle("order = '%s'" % order, fontsize=16)
plt.show()

<\details>

PolynomialFeatures

ogrisel · 2018-10-03T07:32:15Z

sklearn/preprocessing/data.py

@@ -1454,7 +1462,7 @@ def transform(self, X):
        """
        check_is_fitted(self, ['n_input_features_', 'n_output_features_'])

-        X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc')


If the original data is not f-contiguous this will trigger a big copy of the whole numpy array. I wonder it would not be possible to do it by chunk instead.

I agree, though such copy is only of size n_samples * n_features, while the transformed array is much bigger, n_samples * n_features * n_features when degree=2. We can definitely do the transform by chunk, but this would be only useful with a partial_transform. With transform, the entire transformed array is stored in memory at the end.

In term of speed, the first benchmark shows that the cost of the initial copy is balanced by the speed of the following computations over the columns.

I agree that if the input is dense there can't be that much data anyway

I agree, though such copy is only of size n_samples * n_features, while the transformed array is much bigger, n_samples * n_features * n_features when degree=2.

I also agree, indeed that's a good point. I user have memory issues they will call transport by chunk themselves, possibly using tools such as dask-ml but there is nothing much to gain by chunking inside the transform method in scikit-learn.

jnothman · 2018-10-03T12:18:28Z

sklearn/preprocessing/data.py

@@ -1454,7 +1462,7 @@ def transform(self, X):
        """
        check_is_fitted(self, ['n_input_features_', 'n_output_features_'])

-        X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc')


I agree that if the input is dense there can't be that much data anyway

jnothman · 2018-10-03T12:19:03Z

doc/whats_new/v0.21.rst

+:mod:`sklearn.preprocessing`
+............................
+
+- |Efficiency| Speed improvement in :class:`preprocessing.PolynomialFeatures`,


Also add |API|

ogrisel

LGTM as well. I merged master to resolve a conflict in the changelog,

ENH speed perforamnce of dense

efb33c0

PolynomialFeatures

TomDLT changed the title ~~ENH speed perforamnce of dense~~ [MRG] Fast PolynomialFeatures on dense arrays Oct 2, 2018

TomDLT added 2 commits October 2, 2018 19:35

Update v0.21.rst

ca4e4eb

docstring

0dae39b

ogrisel reviewed Oct 3, 2018

View reviewed changes

jnothman approved these changes Oct 3, 2018

View reviewed changes

TomDLT and others added 3 commits October 3, 2018 15:03

add API in whatsnew

acdb6af

typo

943f8ac

Merge branch 'master' into polynomial

38b1a4d

ogrisel approved these changes Oct 3, 2018

View reviewed changes

ogrisel merged commit 3e5777a into scikit-learn:master Oct 3, 2018

ogrisel deleted the polynomial branch October 3, 2018 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Fast PolynomialFeatures on dense arrays #12251

[MRG] Fast PolynomialFeatures on dense arrays #12251

Uh oh!

TomDLT commented Oct 2, 2018 •

edited

Loading

Uh oh!

ogrisel Oct 3, 2018

Uh oh!

TomDLT Oct 3, 2018

Uh oh!

jnothman Oct 3, 2018

Uh oh!

ogrisel Oct 3, 2018

Uh oh!

jnothman Oct 3, 2018

Uh oh!

jnothman Oct 3, 2018

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

[MRG] Fast PolynomialFeatures on dense arrays #12251

[MRG] Fast PolynomialFeatures on dense arrays #12251

Uh oh!

Conversation

TomDLT commented Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Idea:

Proposed changes:

Uh oh!

ogrisel Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

TomDLT Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomDLT commented Oct 2, 2018 •

edited

Loading