-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Open
Labels
API - ConsistencyInternal Consistency of API/BehaviorInternal Consistency of API/BehaviorIndexRelated to the Index class or subclassesRelated to the Index class or subclassesMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatenp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatePDEP missing valuesIssues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprintIssues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Milestone
Description
The below table gives an overview of the result value for:
missing_value in idx
i.e. how Index.__contains__
handles various missing value sentinels as input for the different data types.
dtype | None | nan | <NA> | NaT |
---|---|---|---|---|
object-none | True | False | False | False |
object-nan | False | True | False | False |
object-NA | False | False | True | False |
datetime | True | True | True | True |
period | True | True | True | True |
timedelta | True | True | True | True |
float64 | False | True | False | False |
categorical | True | True | True | True |
interval | True | True | True | False |
nullable_int | False | False | True | False |
nullable_float | False | False | True | False |
string-python | False | False | False | False |
string-pyarrow | False | False | False | False |
str-python | False | False | False | False |
The last three rows with not a single True are specifically problematic, this seems a bug with the StringDtype
But more in general, this is quite inconsistent:
- For object dtype, we require exact match
- For datetimelike and categorical, we match any missing-like
- For interval, we match any missing-like except NaT (also not in case of datetimelike interval dtype)
- For float we only match NaN
- For nullable dtypes (int/float), we only match NA
The code to generate the table above:
import numpy as np
import pandas as pd
# from conftest.py
indices_dict = {
"object-none": pd.Index(["a", None], dtype=object),
"object-nan": pd.Index(["a", np.nan], dtype=object),
"object-NA": pd.Index(["a", pd.NA], dtype=object),
"datetime": pd.DatetimeIndex(["2024-01-01", "NaT"]),
"period": pd.PeriodIndex(["2024-01-01", None], freq="D"),
"timedelta": pd.TimedeltaIndex(["1 days", "NaT"]),
"float64": pd.Index([2.0, np.nan], dtype="float64"),
"categorical": pd.CategoricalIndex(["a", None]),
"interval": pd.IntervalIndex.from_tuples([(1, 2), np.nan]),
"nullable_int": pd.Index([2, None], dtype="Int64"),
"nullable_float": pd.Index([2.0, None], dtype="Float32"),
"string-python": pd.Index(["a", None], dtype="string[python]"),
"string-pyarrow": pd.Index(["a", None], dtype="string[pyarrow]"),
"str-python": pd.Index(["a", None], dtype=pd.StringDtype("pyarrow", na_value=np.nan))
}
results = []
for dtype, data in indices_dict.items():
for val in [None, np.nan, pd.NA, pd.NaT]:
res = val in data
results.append((dtype, str(val), res))
df = pd.DataFrame(results, columns=["dtype", "val", "result"])
df_overview = df.pivot(columns="val", index="dtype", values="result").reindex(columns=df["val"].unique(), index=df["dtype"].unique())
print(df_overview.astype(str).to_markdown())
cc @jbrockmendel I would have expected we had issues about this, but didn't directly find anything
WillAydWillAyd
Metadata
Metadata
Assignees
Labels
API - ConsistencyInternal Consistency of API/BehaviorInternal Consistency of API/BehaviorIndexRelated to the Index class or subclassesRelated to the Index class or subclassesMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatenp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatePDEP missing valuesIssues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprintIssues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint