-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Make standard scaler compatible to Array API #27113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make standard scaler compatible to Array API #27113
Conversation
Hi @AlexanderFabisch, I'm happy to continue this if it cannot wait until October. Waiting to see what the maintainers think. :) Here are a few things I learned while working on my PR that might be helpful if you decide to keep working on it:
|
Sure, I could also give you write access to my fork if needed. That way we could collaborate better. |
3d9293a
to
fe6409c
Compare
Hi @AlexanderFabisch , thank you for sharing the fork :) There are still a couple of TODOs:
Another thing to bear in mind is that |
That looks a lot better @EdAbati . Thanks for continuing this PR. |
sklearn/utils/extmath.py
Outdated
|
||
try: | ||
return op(x, *args, **kwargs, dtype=target_dtype) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reason to inspect the signature is that it is more explicit/specific. It covers us for the case where a TypeError
is raised for a reason other than "this function doesn't have a dtype kwarg". I don't know off the top of my head what all the reasons are that a TypeError
could be raised, so maybe this isn't an issue. But maybe the fact that I don't know off the top of my head is a reason to be more specific?
A reason to not use inspection would be that it is slow (compared to the time spent on op
).
@OmarManzoor the script in your comment uses |
Really sorry about that. I used the prior script and forgot to replace StandardScaler. |
No worries. I was just wondering if I was missing something :D When I make the swap, I see numbers roughly similar to what you do. A ten times slowdown for fitting seems weird no? edit: these are the numbers I see if I use So I think the results above came from the code you pasted. I see a roughly 3x speed up if I add a 0th run of the Detailsfrom time import time
import numpy as np
import torch as xp
from sklearn._config import config_context
from sklearn.preprocessing import StandardScaler
X_np = np.random.rand(100000, 100).astype(np.float32)
X_xp = xp.asarray(X_np, device="mps")
# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in range(n_iter):
start = time()
pf_np = StandardScaler()
pf_np.fit(X_np)
fit_times.append(time() - start)
start = time()
pf_np.transform(X_np)
transform_times.append(time() - start)
avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")
# Torch mps benchmarks
with config_context(array_api_dispatch=True):
pf_xp = StandardScaler()
pf_xp.fit(X_xp)
fit_times = []
transform_times = []
for _ in range(n_iter):
with config_context(array_api_dispatch=True):
start = time()
pf_xp = StandardScaler()
pf_xp.fit(X_xp)
fit_times.append(time() - start)
start = time()
float(pf_xp.transform(X_xp)[0, 0])
transform_times.append(time() - start)
avg_fit_time_mps = sum(fit_times) / n_iter
avg_transform_time_mps = sum(transform_times) / n_iter
print(
f"Avg fit time for torch mps: {avg_fit_time_mps:.3f}, "
f"speed-up: {avg_fit_time_numpy / avg_fit_time_mps:.1f}x"
)
print(
f"Avg transform time for torch mps: {avg_transform_time_mps:.3f} "
f"speed-up: {avg_transform_time_numpy / avg_transform_time_mps:.1f}x"
)
|
Here are the results that I ran. I increased the dataset size to (1000000, 200). I just used the original code and replaced StandardScaler
from time import time
import numpy as np
import torch as xp
from tqdm import tqdm
from sklearn._config import config_context
from sklearn.preprocessing import StandardScaler
X_np = np.random.rand(1000000, 200).astype(np.float32)
X_xp = xp.asarray(X_np, device="mps")
# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in tqdm(range(n_iter), desc="Numpy Flow"):
start = time()
pf_np = StandardScaler()
pf_np.fit(X_np)
fit_times.append(time() - start)
start = time()
pf_np.transform(X_np)
transform_times.append(time() - start)
avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")
# Torch mps benchmarks
fit_times = []
transform_times = []
for _ in tqdm(range(n_iter), desc="Torch mps Flow"):
with config_context(array_api_dispatch=True):
start = time()
pf_xp = StandardScaler()
pf_xp.fit(X_xp)
fit_times.append(time() - start)
start = time()
float(pf_xp.transform(X_xp)[0, 0])
transform_times.append(time() - start)
avg_fit_time_mps = sum(fit_times) / n_iter
avg_transform_time_mps = sum(transform_times) / n_iter
print(
f"Avg fit time for torch mps: {avg_fit_time_mps:.3f}, "
f"speed-up: {avg_fit_time_numpy / avg_fit_time_mps:.1f}x"
)
print(
f"Avg transform time for torch mps: {avg_transform_time_mps:.3f} "
f"speed-up: {avg_transform_time_numpy / avg_transform_time_mps:.1f}x"
) |
…nto feature/standard_scaler_array_api
❌ Unsupported file formatUpload processing failed due to unsupported file format. Please review the parser error message:
For more help, visit our troubleshooting guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cosmetic comments that were in draft for a while. I'll try to go back to this to have a closer look in the not too far future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good.
One thing to address in a new PR is the whole story around "everything follows X" -
@@ -1106,9 +1113,9 @@ def transform(self, X, copy=None): | |||
inplace_column_scale(X, 1 / self.scale_) | |||
else: | |||
if self.with_mean: | |||
X -= self.mean_ | |||
X -= xp.astype(self.mean_, X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education: why do we (now) need this additional astype
? Is it because the type of X
in transform
can be different from what is used in fit
? Why did we not need it before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner computation tries to use the maximum float available and sets the computed values and attributes accordingly. Since StandardScaler preserves the dtype we need this here as from what I remember the self.mean_ can be set according to the max float (float64) but X is float32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting out the xp.astype
and running the tests, I think only array-api-strict
is picky about this. For other namespaces X -= self.mean_
works fine if X
has dtype float32 and self.mean_
has type float64, which it was it was not needed before with numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for figuring it out. array-pi-strict is strict :-/
@@ -1050,7 +1056,7 @@ def partial_fit(self, X, y=None, sample_weight=None): | |||
# for backward-compatibility, reduce n_samples_seen_ to an integer | |||
# if the number of samples is the same for each feature (i.e. no | |||
# missing values) | |||
if np.ptp(self.n_samples_seen_) == 0: | |||
if xp.max(self.n_samples_seen_) == xp.min(self.n_samples_seen_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I double-checked and np.ptp
does nothing smart (like computing both the min
and max
in a single pass) so this is fine to replace it by max == min
.
LGTM, thanks to everyone involved in this PR over the last 2 years @AlexanderFabisch @EdAbati, @ogrisel @charlesjhill @OmarManzoor @betatim! |
Here's my contribution from the EuroSciPy 2023 sprint. It's still work in progress and I won't have the time to continue the work before October. So if anyone else wants to take it from here, feel free to do so.
Reference Issues/PRs
See also #26024
What does this implement/fix? Explain your changes.
Make standard scaler compatible to Array API.
Any other comments?
Unfortunately, the current implementation breaks some unit tests of the standard scaler that are related to dtypes. That's because I wanted to make it work for torch.float16, but maybe that is not necessary and we should just support float32 and float64.
I'll also add some comments to the diff. See below.