Skip to content

BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

Built on top of #60936

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. Bug labels Mar 19, 2025
@rhshadrach rhshadrach added this to the 2.3 milestone Mar 19, 2025
@rhshadrach rhshadrach marked this pull request as ready for review March 23, 2025 12:07
expected_dtype, expected_value = dtype, pd.NA
if reduction_func in ["all", "any"]:
expected_dtype = "bool"
# TODO: For skipna=False, bool(pd.NA) raises; should groupby?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are a few TODOs / inconsistencies in our interface here. I think rather than branching and trying to document all of them, it would help to simplify this test and just skip/xfail the cases where things are not consistent. It may even be helpful to split this up into multiple tests that are more focused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than branching and trying to document all of them, it would help to simplify this test and just skip/xfail the cases where things are not consistent.

If they added significant complexity I would agree. However the complexity added seems minimal to me, and testing the current behavior tells us when it changes. So even if it's not the final behavior we'd desire, testing it seems better than skipping or xfailing.

It may even be helpful to split this up into multiple tests that are more focused

If there was a good way of doing this while ensuring we are going through all the reduction funcs, I'd definitely be on board. However to my knowledge there is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's challenging in the current state because its near impossible to tell what this test is trying to say about the expected behavior of things. Perhaps I am misreading - how would you summarize this unit test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you summarize this unit test?

Testing groupby on input of string dtype where all values are NA. But I fear I might be misunderstanding the question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd - friendly ping.

Copy link
Member

@WillAyd WillAyd Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this needs to be split into multiple tests and/or simplified so that it remains a unit test(s)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I would be happy to do so if there is a way to ensure it runs across all reducers, but do not see a way to make that happen.
  • A large number of bugs I've worked on over the years existed because so much of our test suite only checks one or a couple of ops, rather than all of them.
  • This is a common pattern in our tests and by no means the worst offender, e.g.
    def test_empty_groupby(columns, keys, values, method, op, dropna, using_infer_string):
    # GH8093 & GH26411
    override_dtype = None
    if isinstance(values, BooleanArray) and op in ["sum", "prod"]:
    # We expect to get Int64 back for these
    override_dtype = "Int64"
    if isinstance(values[0], bool) and op in ("prod", "sum"):
    # sum/product of bools is an integer
    override_dtype = "int64"
    df = DataFrame({"A": values, "B": values, "C": values}, columns=list("ABC"))
    if hasattr(values, "dtype"):
    # check that we did the construction right
    assert (df.dtypes == values.dtype).all()
    df = df.iloc[:0]
    gb = df.groupby(keys, group_keys=False, dropna=dropna, observed=False)[columns]
    def get_result(**kwargs):
    if method == "attr":
    return getattr(gb, op)(**kwargs)
    else:
    return getattr(gb, method)(op, **kwargs)
    def get_categorical_invalid_expected():
    # Categorical is special without 'observed=True', we get an NaN entry
    # corresponding to the unobserved group. If we passed observed=True
    # to groupby, expected would just be 'df.set_index(keys)[columns]'
    # as below
    lev = Categorical([0], dtype=values.dtype)
    if len(keys) != 1:
    idx = MultiIndex.from_product([lev, lev], names=keys)
    else:
    # all columns are dropped, but we end up with one row
    # Categorical is special without 'observed=True'
    idx = Index(lev, name=keys[0])
    if using_infer_string:
    columns = Index([], dtype="str")
    else:
    columns = []
    expected = DataFrame([], columns=columns, index=idx)
    return expected
    is_per = isinstance(df.dtypes.iloc[0], pd.PeriodDtype)
    is_dt64 = df.dtypes.iloc[0].kind == "M"
    is_cat = isinstance(values, Categorical)
    is_str = isinstance(df.dtypes.iloc[0], pd.StringDtype)
    if (
    isinstance(values, Categorical)
    and not values.ordered
    and op in ["min", "max", "idxmin", "idxmax"]
    ):
    if op in ["min", "max"]:
    msg = f"Cannot perform {op} with non-ordered Categorical"
    klass = TypeError
    else:
    msg = f"Can't get {op} of an empty group due to unobserved categories"
    klass = ValueError
    with pytest.raises(klass, match=msg):
    get_result()
    if op in ["min", "max", "idxmin", "idxmax"] and isinstance(columns, list):
    # i.e. DataframeGroupBy, not SeriesGroupBy
    result = get_result(numeric_only=True)
    expected = get_categorical_invalid_expected()
    tm.assert_equal(result, expected)
    return
    if op in ["prod", "sum", "skew", "kurt"]:
    # ops that require more than just ordered-ness
    if is_dt64 or is_cat or is_per or (is_str and op != "sum"):
    # GH#41291
    # datetime64 -> prod and sum are invalid
    if is_dt64:
    msg = "datetime64 type does not support"
    elif is_per:
    msg = "Period type does not support"
    elif is_str:
    msg = f"dtype 'str' does not support operation '{op}'"
    else:
    msg = "category type does not support"
    if op in ["skew", "kurt"]:
    msg = "|".join([msg, f"does not support operation '{op}'"])
    with pytest.raises(TypeError, match=msg):
    get_result()
    if not isinstance(columns, list):
    # i.e. SeriesGroupBy
    return
    elif op in ["skew", "kurt"]:
    # TODO: test the numeric_only=True case
    return
    else:
    # i.e. op in ["prod", "sum"]:
    # i.e. DataFrameGroupBy
    # ops that require more than just ordered-ness
    # GH#41291
    result = get_result(numeric_only=True)
    # with numeric_only=True, these are dropped, and we get
    # an empty DataFrame back
    expected = df.set_index(keys)[[]]
    if is_cat:
    expected = get_categorical_invalid_expected()
    tm.assert_equal(result, expected)
    return
    result = get_result()
    expected = df.set_index(keys)[columns]
    if op in ["idxmax", "idxmin"]:
    expected = expected.astype(df.index.dtype)
    if override_dtype is not None:
    expected = expected.astype(override_dtype)
    if len(keys) == 1:
    expected.index.name = keys[0]
    tm.assert_equal(result, expected)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd - friendly ping

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have the same feedback before on making this look more like a unit test. If you don't want to do that its no problem, but probably worth pinging someone else to review then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we should have had some EA tests like BaseReduceTests but for groupby that would allow us to have a more "standard" way to cover this, but although this test appears onerous I'm fine having it like this for now as to not block this PR

res_values = res_values.astype(object, copy=False)
elif is_string_dtype(dtype) and how in ["min", "max"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid special-casing these functions here? Where is the return value from other functions being handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - good call. We only get here with min/max today. If we do end up here with other ops at some point in the future, either (a) the dtype is already correct in which case _from_sequence is O(1) or (b) we want to cast. So I've removed the condition.

@rhshadrach rhshadrach requested a review from WillAyd March 29, 2025 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent dtype with GroupBy for StrDtype and all missing values
3 participants