Skip to content

PERF: Series.corr/Series.cov for EA dtypes #52502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 7, 2023

Conversation

lukemanley
Copy link
Member

import pandas as pd
import numpy as np

N = 1_000_000

ser1 = pd.Series(np.random.randn(N), dtype="Float64")
ser2 = pd.Series(np.random.randn(N), dtype="Float64")

%timeit ser1.corr(ser2)

# 849 ms ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 33.7 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

%timeit ser1.cov(ser2)

# 739 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    -> main
# 32.9 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

Also, aligns the error raised for non-numeric data with that of DataFrame.corr and DataFrame.cov.

@lukemanley lukemanley added Performance Memory or execution speed performance ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 6, 2023
@@ -2701,8 +2701,8 @@ def corr(
if len(this) == 0:
return np.nan

this_values = np.asarray(this._values)
other_values = np.asarray(other._values)
this_values = this.to_numpy(dtype=float, na_value=np.nan, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance this gets weird if we have non-numeric?

is this the right thing to do if we have NAs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance this gets weird if we have non-numeric?

This raises for non-numeric that cannot be cast to float (same as behavior as main branch):

ser = pd.Series(["a", "b"])
ser.corr(ser)

# ValueError: could not convert string to float: 'a'

It allows non-numeric that can be cast to float (same behavior as main branch):

ser = pd.Series(["1", "2"])
ser.corr(ser)

# 0.9999999999999999

is this the right thing to do if we have NAs?

NAs get dropped either way in the underlying nanops.py functions nancorr and nancov.

FYI, this is the same approach used in DataFrame.corr:

mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this is the same approach used in DataFrame.corr:

good point. works for me

@mroeschke mroeschke added this to the 2.1 milestone Apr 7, 2023
@mroeschke mroeschke merged commit 6187155 into pandas-dev:main Apr 7, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the perf-series-corr-cov-ea branch April 18, 2023 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants