Skip to content

Conversation

AlexanderFabisch
Copy link
Member

Here's my contribution from the EuroSciPy 2023 sprint. It's still work in progress and I won't have the time to continue the work before October. So if anyone else wants to take it from here, feel free to do so.

Reference Issues/PRs

See also #26024

What does this implement/fix? Explain your changes.

Make standard scaler compatible to Array API.

Any other comments?

Unfortunately, the current implementation breaks some unit tests of the standard scaler that are related to dtypes. That's because I wanted to make it work for torch.float16, but maybe that is not necessary and we should just support float32 and float64.

I'll also add some comments to the diff. See below.

@github-actions
Copy link

github-actions bot commented Aug 19, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 784e189. Link to the linter CI: here

@AlexanderFabisch AlexanderFabisch changed the title [WIP] Make standard scaler compatible to Array API WIP: Make standard scaler compatible to Array API Aug 19, 2023
@AlexanderFabisch AlexanderFabisch marked this pull request as draft August 19, 2023 17:07
@EdAbati
Copy link
Contributor

EdAbati commented Aug 20, 2023

Hi @AlexanderFabisch, I'm happy to continue this if it cannot wait until October. Waiting to see what the maintainers think. :)

Here are a few things I learned while working on my PR that might be helpful if you decide to keep working on it:

  • update your branch with main to get some useful functions like _array_api.supported_float_dtypes
  • testing the Array API compliance could be done by using a function that looks like this
  • in other places, a scalar array is created using xp.asarray(0.0, device=device(...))

@AlexanderFabisch
Copy link
Member Author

I'm happy to continue this if it cannot wait until October.

Sure, I could also give you write access to my fork if needed. That way we could collaborate better.

@EdAbati
Copy link
Contributor

EdAbati commented Sep 16, 2023

Hi @AlexanderFabisch , thank you for sharing the fork :)
I continued a bit, and tried to resolve some comments based on what I saw in the other PRs.

There are still a couple of TODOs:

Another thing to bear in mind is that device='mps' does not support float64. #27232 introduces something we could use

@AlexanderFabisch
Copy link
Member Author

That looks a lot better @EdAbati . Thanks for continuing this PR.


try:
return op(x, *args, **kwargs, dtype=target_dtype)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reason to inspect the signature is that it is more explicit/specific. It covers us for the case where a TypeError is raised for a reason other than "this function doesn't have a dtype kwarg". I don't know off the top of my head what all the reasons are that a TypeError could be raised, so maybe this isn't an issue. But maybe the fact that I don't know off the top of my head is a reason to be more specific?

A reason to not use inspection would be that it is slow (compared to the time spent on op).

@betatim
Copy link
Member

betatim commented Jul 16, 2025

@OmarManzoor the script in your comment uses PolynomialFeatures instead of StandardScaler. Is that on purpose?

@OmarManzoor
Copy link
Contributor

@OmarManzoor the script in your comment uses PolynomialFeatures instead of StandardScaler. Is that on purpose?

Really sorry about that. I used the prior script and forgot to replace StandardScaler.

@betatim
Copy link
Member

betatim commented Jul 16, 2025

No worries. I was just wondering if I was missing something :D

When I make the swap, I see numbers roughly similar to what you do. A ten times slowdown for fitting seems weird no?

edit: these are the numbers I see if I use StandardScaler
Avg fit time for numpy: 0.036
Avg transform time for numpy: 0.011
Avg fit time for torch mps: 0.028, speed-up: 1.3x
Avg transform time for torch mps: 0.003 speed-up: 3.6x

So I think the results above came from the code you pasted. I see a roughly 3x speed up if I add a 0th run of the StandardScaler on MPS before the for loop and exclude its number from the average calculation

Details
from time import time

import numpy as np
import torch as xp

from sklearn._config import config_context
from sklearn.preprocessing import StandardScaler

X_np = np.random.rand(100000, 100).astype(np.float32)
X_xp = xp.asarray(X_np, device="mps")

# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in range(n_iter):
    start = time()
    pf_np = StandardScaler()
    pf_np.fit(X_np)
    fit_times.append(time() - start)

    start = time()
    pf_np.transform(X_np)
    transform_times.append(time() - start)

avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")


# Torch mps benchmarks
with config_context(array_api_dispatch=True):
    pf_xp = StandardScaler()
    pf_xp.fit(X_xp)

fit_times = []
transform_times = []
for _ in range(n_iter):
    with config_context(array_api_dispatch=True):
        start = time()
        pf_xp = StandardScaler()
        pf_xp.fit(X_xp)
        fit_times.append(time() - start)

        start = time()
        float(pf_xp.transform(X_xp)[0, 0])
        transform_times.append(time() - start)

avg_fit_time_mps = sum(fit_times) / n_iter
avg_transform_time_mps = sum(transform_times) / n_iter
print(
    f"Avg fit time for torch mps: {avg_fit_time_mps:.3f}, "
    f"speed-up: {avg_fit_time_numpy / avg_fit_time_mps:.1f}x"
)
print(
    f"Avg transform time for torch mps: {avg_transform_time_mps:.3f} "
    f"speed-up: {avg_transform_time_numpy / avg_transform_time_mps:.1f}x"
)

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jul 16, 2025

So I think the results above came from the code you pasted. I see a roughly 3x speed up if I add a 0th run of the StandardScaler on MPS before the for loop and exclude its number from the average calculation

Here are the results that I ran. I increased the dataset size to (1000000, 200). I just used the original code and replaced StandardScaler

Avg fit time for numpy: 0.893
Avg transform time for numpy: 0.201

Avg fit time for torch mps: 0.245, speed-up: 3.6x
Avg transform time for torch mps: 0.094 speed-up: 2.1x
from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn._config import config_context
from sklearn.preprocessing import StandardScaler

X_np = np.random.rand(1000000, 200).astype(np.float32)
X_xp = xp.asarray(X_np, device="mps")

# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in tqdm(range(n_iter), desc="Numpy Flow"):
    start = time()
    pf_np = StandardScaler()
    pf_np.fit(X_np)
    fit_times.append(time() - start)

    start = time()
    pf_np.transform(X_np)
    transform_times.append(time() - start)

avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")


# Torch mps benchmarks
fit_times = []
transform_times = []
for _ in tqdm(range(n_iter), desc="Torch mps Flow"):
    with config_context(array_api_dispatch=True):
        start = time()
        pf_xp = StandardScaler()
        pf_xp.fit(X_xp)
        fit_times.append(time() - start)

        start = time()
        float(pf_xp.transform(X_xp)[0, 0])
        transform_times.append(time() - start)

avg_fit_time_mps = sum(fit_times) / n_iter
avg_transform_time_mps = sum(transform_times) / n_iter
print(
    f"Avg fit time for torch mps: {avg_fit_time_mps:.3f}, "
    f"speed-up: {avg_fit_time_numpy / avg_fit_time_mps:.1f}x"
)
print(
    f"Avg transform time for torch mps: {avg_transform_time_mps:.3f} "
    f"speed-up: {avg_transform_time_numpy / avg_transform_time_mps:.1f}x"
)

Copy link

codecov bot commented Jul 18, 2025

❌ Unsupported file format

Upload processing failed due to unsupported file format. Please review the parser error message:
Error deserializing json

Caused by:
expected value at line 1 column 1

For more help, visit our troubleshooting guide.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cosmetic comments that were in draft for a while. I'll try to go back to this to have a closer look in the not too far future

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

One thing to address in a new PR is the whole story around "everything follows X" -

@@ -1106,9 +1113,9 @@ def transform(self, X, copy=None):
inplace_column_scale(X, 1 / self.scale_)
else:
if self.with_mean:
X -= self.mean_
X -= xp.astype(self.mean_, X.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education: why do we (now) need this additional astype? Is it because the type of X in transform can be different from what is used in fit? Why did we not need it before?

Copy link
Contributor

@OmarManzoor OmarManzoor Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner computation tries to use the maximum float available and sets the computed values and attributes accordingly. Since StandardScaler preserves the dtype we need this here as from what I remember the self.mean_ can be set according to the max float (float64) but X is float32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting out the xp.astype and running the tests, I think only array-api-strict is picky about this. For other namespaces X -= self.mean_ works fine if X has dtype float32 and self.mean_ has type float64, which it was it was not needed before with numpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring it out. array-pi-strict is strict :-/

@@ -1050,7 +1056,7 @@ def partial_fit(self, X, y=None, sample_weight=None):
# for backward-compatibility, reduce n_samples_seen_ to an integer
# if the number of samples is the same for each feature (i.e. no
# missing values)
if np.ptp(self.n_samples_seen_) == 0:
if xp.max(self.n_samples_seen_) == xp.min(self.n_samples_seen_):
Copy link
Member

@lesteve lesteve Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I double-checked and np.ptp does nothing smart (like computing both the min and max in a single pass) so this is fine to replace it by max == min.

@lesteve
Copy link
Member

lesteve commented Aug 27, 2025

LGTM, thanks to everyone involved in this PR over the last 2 years @AlexanderFabisch @EdAbati, @ogrisel @charlesjhill @OmarManzoor @betatim!

@lesteve lesteve merged commit 48cba5a into scikit-learn:main Aug 27, 2025
34 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Array API Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants