Skip to content

[MRG] Fast PolynomialFeatures on dense arrays #12251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 3, 2018

Conversation

TomDLT
Copy link
Member

@TomDLT TomDLT commented Oct 2, 2018

Idea:

Computations in PolynomialFeatures are performed on columns, so best performances are obtained using fortran-ordered ('F') arrays.

Proposed changes:

  1. add order='F' in check_array when checking the input, to speed up computations.
  2. add new parameter order to control order of the output, for further speed improvements. It defaults to 'C' not to change current behavior.

Here is a benchmark for the first change (i.e. order='C', as in master):
order c

Here is a benchmark for both changes (i.e. order='F'):
order f

We can see that code is 2 to 5 times faster for large input arrays.

Benchmark script in the details.

from time import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils.validation import check_array, FLOAT_DTYPES

order = 'C'


###############################################################################
# function to benchmark (proposed in [True, False])
def fit_transform(self, X, proposed, order):
    self.fit(X)

    if proposed:
        X = check_array(X, order='F', dtype=FLOAT_DTYPES, accept_sparse='csc')
    else:
        X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc')

    n_samples, n_features = X.shape
    if n_features != self.n_input_features_:
        raise ValueError("X shape does not match training shape")
    combinations = self._combinations(n_features, self.degree,
                                      self.interaction_only, self.include_bias)

    order = order if proposed else 'C'

    XP = np.empty((n_samples, self.n_output_features_), dtype=X.dtype,
                  order=order)
    for i, comb in enumerate(combinations):
        XP[:, i] = X[:, comb].prod(1)

    return XP


###############################################################################
# Benchmark
est = PolynomialFeatures()
results = []
for n_samples in [100, 300, 1000, 3000, 10000]:
    for n_features in [4, 16, 64, 256]:
        X = np.random.randn(n_samples, n_features)
        n_repeat = max(1, 6553600 // n_samples // n_features // n_features)

        for proposed in [False, True]:
            print('.', end='', flush=True)
            t0 = time()
            for _ in range(n_repeat):
                XP = fit_transform(est, X, proposed, order)
            duration = (time() - t0) / n_repeat
            # print(np.isfortran(XP))
            results.append((n_samples, n_features, proposed, duration))

###############################################################################
# Plot with pandas
df = pd.DataFrame(results,
                  columns=['n_samples', 'n_features', 'proposed', 'duration'])
fig, axes = plt.subplots(ncols=2, figsize=(12, 5))

# first plot
ax = axes[0]
table = df.pivot_table(index='n_samples', columns=['n_features', 'proposed'],
                       values='duration')
table.plot(ax=ax, logy=True, logx=True, marker='o', colormap='viridis')
ax.set_title('Time duration (sec)')

# second plot
ax = axes[1]
table_proposed = df[df['proposed']].pivot_table(
    index='n_samples', columns='n_features', values='duration')
table_master = df[~df['proposed']].pivot_table(
    index='n_samples', columns='n_features', values='duration')
table_normed = table_proposed / table_master
table_normed.plot(ax=ax, logx=True, marker='o', colormap='viridis')
ax.set_title('Time ratio: proposed / master')

plt.subplots_adjust(top=0.90)
fig.suptitle("order = '%s'" % order, fontsize=16)
plt.show()

<\details>

@TomDLT TomDLT changed the title ENH speed perforamnce of dense [MRG] Fast PolynomialFeatures on dense arrays Oct 2, 2018
@@ -1454,7 +1462,7 @@ def transform(self, X):
"""
check_is_fitted(self, ['n_input_features_', 'n_output_features_'])

X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the original data is not f-contiguous this will trigger a big copy of the whole numpy array. I wonder it would not be possible to do it by chunk instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though such copy is only of size n_samples * n_features, while the transformed array is much bigger, n_samples * n_features * n_features when degree=2. We can definitely do the transform by chunk, but this would be only useful with a partial_transform. With transform, the entire transformed array is stored in memory at the end.

In term of speed, the first benchmark shows that the cost of the initial copy is balanced by the speed of the following computations over the columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if the input is dense there can't be that much data anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though such copy is only of size n_samples * n_features, while the transformed array is much bigger, n_samples * n_features * n_features when degree=2.

I also agree, indeed that's a good point. I user have memory issues they will call transport by chunk themselves, possibly using tools such as dask-ml but there is nothing much to gain by chunking inside the transform method in scikit-learn.

@@ -1454,7 +1462,7 @@ def transform(self, X):
"""
check_is_fitted(self, ['n_input_features_', 'n_output_features_'])

X = check_array(X, dtype=FLOAT_DTYPES, accept_sparse='csc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if the input is dense there can't be that much data anyway

:mod:`sklearn.preprocessing`
............................

- |Efficiency| Speed improvement in :class:`preprocessing.PolynomialFeatures`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add |API|

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well. I merged master to resolve a conflict in the changelog,

@ogrisel ogrisel merged commit 3e5777a into scikit-learn:master Oct 3, 2018
@ogrisel ogrisel deleted the polynomial branch October 3, 2018 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants